TWI546686B

TWI546686B - Progressive sequence data mining method and system with dynamic granularity and automatic labeling

Info

Publication number: TWI546686B
Application number: TW104103101A
Authority: TW
Inventors: 蔡曉萍; 盧釔辰
Original assignee: 國立中興大學
Priority date: 2015-01-29
Filing date: 2015-01-29
Publication date: 2016-08-21
Also published as: TW201627885A

Description

Progressive sequence data mining method and system with dynamic granularity and automatic labeling

一種資料探勘方法，尤其是一種具動態粒度與自動標籤之漸進式序列資料探勘方法及系統。 A data exploration method, in particular, a progressive sequence data mining method and system with dynamic granularity and automatic labeling.

先前許多移動軌跡資料的探勘技術採用網格來描述地理，然而先前技術多半採用固定大小的N×M個網格來規劃地理空間，這些技術於實際應用時，可能會遭遇下列問題：1. 網格大小與網格的切割位置不易規劃，網格太大可能會移失一些比較精細的物件移動資訊；網格太小除了會導致總網格數上升，使落在網格內的軌跡經緯度座標點變得稀少，失去統計上的意義，因而丟失一些移動型樣；2. 地理空間的大小及網格的大小直接決定網格之總數，網格的總數影響資料分析時的計算複雜度和所需要的記憶體空間；3. 序列資料的資料量隨著時間不斷增加，使得每次探勘所需的運算成本及時間不斷增加，最終會超出硬體本身所能負荷計算上限；及4. 由於軌跡資料探勘技術的計算複雜度較高，先前技術多半採用客戶端與伺服器端的運作模式(client/servermodel)，將使用者之移動軌跡資料上傳到伺服器端做運算後回傳結果，但此做法除了增加網路通訊的成本外，更增加使用者隱私曝光的風險；5. 先前移動軌跡資料的探勘技術多以整體軌跡資料進行運算，因此當物件的移動習慣有劇烈變動時，例如：使用者突然搬家或離開平常的生活圈，於短時間內的探勘結果容易被先前資料主導，不易快速反應出行為的變化，影響探勘結果的效用。 Previous exploration techniques for moving trajectory data used grids to describe geography. However, most of the prior art used fixed-size N×M grids to plan geospatial space. These techniques may encounter the following problems in practical applications: 1. The grid size and the cutting position of the grid are not easy to plan. If the grid is too large, some fine object movement information may be lost. If the grid is too small, the total number of grids will increase, and the latitude and longitude coordinates of the trajectory falling within the grid will be caused. Points become scarce, lose statistical significance, and thus lose some moving patterns; 2. Geospatial size and grid size directly determine the total number of grids, the total number of grids affects the computational complexity and the data analysis The required memory space; 3. The amount of data of the sequence data increases with time, so that the computational cost and time required for each exploration increase continuously, and eventually exceeds the upper limit of the load calculation of the hardware itself; and 4. Because of the trajectory Data mining technology has higher computational complexity and more prior art Half adopts the client/server model (client/servermodel), uploads the user's mobile track data to the server and performs the operation to return the result. However, in addition to increasing the cost of network communication, it increases the use. The risk of privacy exposure; 5. The exploration technology of the previous moving trajectory data is mostly calculated by the overall trajectory data, so when the moving habit of the object changes drastically, for example, the user suddenly moves or leaves the usual living circle in a short time. The results of the exploration are easily dominated by previous data, and it is not easy to quickly reflect the changes in behavior and affect the effectiveness of the exploration results.

為解決先前技術既有的問題，本發明提出一種動態粒度與自動標籤之漸進式序列資料探勘方法，其包含一第一探勘階段及第二探勘階段，其建立一新累加重要網格表及一新累加機率尾置樹，以該累加機率尾置樹中的預測一序列資料未來產生於一目標空間中的移動軌跡，其中：該第一探勘階段包含下列步驟：步驟1-1、於一新取樣週期中，取得該新取樣週期中的該序列資料，該序列資料包含複數個分佈於該目標空間中的基礎資料；步驟1-2、執行一新增網格階層樹演算法，依據一網格粒度，該目標空間被分割為複數個網格，而每一網格皆對應該網格階層樹中的一網格階層樹節點；步驟1-3、執行一網格細分演算法，依據每一網格階層樹的葉節點對應之該網格中所包含的該基礎資料數目大於一細分門檻時，該網格進行網格分割，使該網格階層樹的葉節點對應產生複數個子節點，該葉節點對應之網格定義為一候選網格；新網格階層樹及該舊累加網格階層樹上有重疊關係且粒度不同的葉節點，調整為相同的粒度的葉節點，並將舊的該葉節點的該基礎資料之數目分配給複數個新產生的該葉節點，並將該葉節點的該基礎資料之數目分配該複數個葉節點；步驟1-4、執行一網格融合演算法，依據各該網格階層樹之葉節點所記錄統計的該基礎資料的數目，判斷該葉節點對應之該候選網格是否為一重要網格及調整該候選網格的大小，並給予該重要網格標籤，所有該重要網格形成一重要網格表；未被判定為重要網格的該網格所對應的的葉節點則與其兄弟節點進行網格融合；步驟1-5、依據該重要網格表產生該重要網格標籤序列資料，以該重要網格標籤序列資料表示於該新取樣週期中該序列資料所包含的該重要網格及該重要網格間的的時間順序關係；及步驟1-6、依據該重要網格標籤序列資料建立於該新取樣週期中一新機率尾置樹，其中該新機率尾置樹代表該新取樣週期中該重要網格標籤序列資料呈現的該移動型樣的集合；而該機率尾置樹中每一該節點代表一個該移動型樣，該移動型樣表示該序列資料中經過該節點所包含的標籤序列所對應的複數個重要網格之後，未來進到某一重要網格的條件機率分佈；該第二探勘階段包含下列步驟：步驟2-1、融合該新網格階層樹與舊累加網格階層樹，形成一新累加網格階層樹，融合該新網格階層樹及該舊累加網格階層樹中對應同一網格的二該網格階層樹葉節點所包含的該基礎資料的數目，以此形成該新累加網格階層樹；步驟2-2、重複步驟1-4之演算內容，依據該新累加網格階層樹中一葉節點所對應之網格中包含該基礎資料之數目，判斷該新累加階層樹的的該節點所對應的該網格是否為重要網格及給予該重要網格標籤；步驟2-3、融合該新取樣週期產生的該新機率尾置樹及舊累加取樣週期產生的該舊累加機率尾置樹，產生該新累加機率尾置樹，而該新累加機率尾置樹代表新累加取樣週期中的該移動型樣集。 In order to solve the problems existing in the prior art, the present invention provides a progressive sequence data mining method for dynamic granularity and automatic labeling, which comprises a first exploration stage and a second exploration stage, which establish a new accumulated important grid table and a A new cumulative probability tail tree, wherein the predicted sequence of data in the tail tree of the cumulative probability is generated in a moving track in a target space, wherein: the first exploration phase comprises the following steps: Step 1-1, New During the sampling period, the sequence data in the new sampling period is obtained, and the sequence data includes a plurality of basic data distributed in the target space; Step 1-2, performing a new grid hierarchy tree algorithm, according to a network Grid granularity, the target space is divided into a plurality of meshes, and each mesh corresponds to a mesh hierarchical tree node in the mesh hierarchical tree; steps 1-3, performing a mesh subdivision algorithm, according to each When the leaf node of a grid hierarchy tree corresponds to the number of the basic data contained in the grid is greater than a subdivision threshold, the grid is mesh-divided to make the grid order The leaf nodes of the layer tree correspondingly generate a plurality of child nodes, and the corresponding grid of the leaf nodes is defined as a candidate mesh; the new mesh hierarchical tree and the leaf nodes of the old accumulated mesh hierarchical tree having overlapping relationships and different granularities are adjusted a leaf node of the same granularity, and assigning the number of the basic data of the old leaf node to the plurality of newly generated leaf nodes, and allocating the number of the basic data of the leaf node to the plurality of leaf nodes; Step 1-4: Perform a mesh fusion algorithm, and determine, according to the number of the basic data recorded by the leaf nodes of the mesh tree, whether the candidate mesh corresponding to the leaf node is an important mesh Adjusting the size of the candidate mesh, and giving the important mesh label, all the important meshes form an important mesh table; the leaf nodes corresponding to the mesh that are not determined to be important mesh are performed with their sibling nodes Grid fusion; Step 1-5, generating the important grid label sequence data according to the important grid table, and expressing the important grid label sequence data in the sequence data in the new sampling period a time-order relationship between the important grid and the important grid; and steps 1-6, establishing a new probability tail tree in the new sampling period according to the important grid label sequence data, wherein the new probability tail tree Representing a set of the moving patterns presented by the important grid tag sequence data in the new sampling period; and each node in the probability tail tree represents a moving pattern, the moving pattern indicating that the sequence data passes through After the plurality of important meshes corresponding to the tag sequence included in the node, the conditional probability distribution of an important mesh is entered in the future; the second exploration phase includes the following steps: Step 2-1, merging the new mesh hierarchy The tree and the old accumulated grid hierarchy tree form a new accumulated grid hierarchy tree, and the new grid hierarchy tree and the corresponding grid of the same accumulated grid are included in the grid node of the same grid. Basic capital The number of materials, thereby forming the new accumulated grid hierarchy tree; step 2-2, repeating the calculation contents of steps 1-4, according to the grid corresponding to a leaf node in the newly accumulated grid hierarchy tree, the basic data is included a number, determining whether the grid corresponding to the node of the new accumulation tree is an important grid and giving the important grid label; Step 2-3, merging the new probability tail tree generated by the new sampling period And the old accumulated probability tail tree generated by the old accumulated sampling period, the new accumulated probability tail tree is generated, and the new accumulated probability tail tree represents the moving pattern set in the new accumulated sampling period.

其具有下列優點： It has the following advantages:

依據上述說明可知本發明之優點如下： According to the above description, the advantages of the present invention are as follows:

1. 透過執行一動態網格演算法，依據各該網格所包含的該基礎資料的數目動態的決定各該重要網格的粒度，使本發明有： 1. By performing a dynamic mesh algorithm, the granularity of each important mesh is dynamically determined according to the number of the basic data included in each grid, so that the present invention has:

(1)避免網格太大而導致遺失較精確的移動型樣的優點。 (1) Avoid the advantage that the mesh is too large to lose the more accurate moving pattern.

(2)避免網格太小而導致總網格數上升。 (2) Avoid the grid being too small and causing the total number of grids to rise.

(3)且每一網格中所包含的該基礎資料數目太少而失去統計意義的優點。 (3) The number of the basic data contained in each grid is too small to lose the statistical significance.

2. 僅對判斷為重要網格的網格進行標籤，大幅減少後續該機率尾置樹中節點之數目，有效的降低運算所需的資源及時間。 2. Label only the mesh that is judged to be an important mesh, greatly reducing the number of nodes in the subsequent tail tree, and effectively reducing the resources and time required for the operation.

3.由於先前技術之軌跡探勘演算過於複雜，使其演算過程必需依賴將該移動軌跡資料上傳至伺服器端運算，增加了個人隱私資料外洩的可能，而本發明透過前述簡化資訊量及降低運算資源需求的優點，使本發明之演算法可於使用者端的個人裝置進行運算，降低隱私曝光的風險。 3. Since the trajectory mapping calculation of the prior art is too complicated, the calculation process must rely on uploading the moving trajectory data to the server end operation, thereby increasing the possibility of leakage of personal privacy data, and the present invention simplifies the information volume and reduces the amount through the foregoing. The advantages of computing resource requirements enable the algorithm of the present invention to perform operations on the user's personal device, reducing the risk of privacy exposure.

4. 由於該移動軌跡資料會因時間而不斷累加，先前技術中以整體序列資料進行運算及探勘使每次資料探勘的成本不斷增加，最終會超出硬體可負荷的運算上限。而本發明以該新累加機率尾置樹進行該序列資料的探勘，使本發明僅需針對重要網格進行運算，無需重新對整體的序列資料進行運算。 4. Since the movement trajectory data will be accumulated over time, the calculation and exploration of the overall sequence data in the prior art increases the cost of each data exploration, and eventually exceeds the upper limit of the hardware loadable operation. The present invention performs the exploration of the sequence data by using the new cumulative probability tail tree, so that the present invention only needs to perform operations on the important grid without recalculating the entire sequence data.

5. 本發明以該時間因子於演算過程中進行調整，使本發明的該移動型樣可依據新進的該序列資料快速的調整，於實際物理意義中，若該目標物突然改變先前習慣，則可透過該時間因子使該新累加機率尾置樹快速的對應修正。 5. The present invention adjusts the time factor in the calculation process, so that the moving pattern of the present invention can be quickly adjusted according to the newly added sequence data. In the actual physical sense, if the target suddenly changes the previous habit, then The new accumulative probability tail tree can be quickly corrected by this time factor.

圖1為本發明較佳實施例之演算步驟示意圖。 1 is a schematic diagram of the calculation steps of a preferred embodiment of the present invention.

圖2為本發明較佳實施例之資料結構演算步驟示意圖。 2 is a schematic diagram of a data structure calculation step according to a preferred embodiment of the present invention.

圖3為本發明較佳實施例之資料結構演算步驟示意圖。 FIG. 3 is a schematic diagram of a data structure calculation step according to a preferred embodiment of the present invention.

圖4為本發明較佳實施例之資料結構演算步驟示意圖。 4 is a schematic diagram of a data structure calculation step according to a preferred embodiment of the present invention.

圖5為本發明較佳實施例之資料結構演算步驟示意圖。 FIG. 5 is a schematic diagram of a data structure calculation step according to a preferred embodiment of the present invention.

圖6為本發明較佳實施例之資料結構演算步驟示意圖。 FIG. 6 is a schematic diagram of a data structure calculation step according to a preferred embodiment of the present invention.

圖7為本發明較佳實施例之資料結構演算步驟示意圖。 FIG. 7 is a schematic diagram of a data structure calculation step according to a preferred embodiment of the present invention.

圖8為本發明較佳實施例之資料結構演算步驟示意圖。 FIG. 8 is a schematic diagram of a data structure calculation step according to a preferred embodiment of the present invention.

圖9為本發明較佳實施例之資料結構演算步驟示意圖。 FIG. 9 is a schematic diagram of a data structure calculation step according to a preferred embodiment of the present invention.

圖10為本發明較佳實施例之資料結構演算步驟示意圖。 FIG. 10 is a schematic diagram of a data structure calculation step according to a preferred embodiment of the present invention.

圖11為本發明較佳實施例之資料結構演算步驟示意圖。 FIG. 11 is a schematic diagram of a data structure calculation step according to a preferred embodiment of the present invention.

本發明提供一種具動態粒度與自動標籤之漸進式序列資料探勘方法及系統其包含二探勘階段，本發明透過二該探勘階段建立一新累加重要網格表(Incremental mapping table)及一新累加機率尾置樹(Incremental PST_DGAL)，該新累加機率尾置樹上每一節點儲存一移動型樣(Movement Pattern)的資訊，所有的節點表示所有該序列資料中的一移動型樣。一機率尾置樹表示一該序列資料中的所有移動行樣的集合。 The invention provides a progressive sequence data mining method and system with dynamic granularity and automatic labeling, which comprises two exploration stages. The invention establishes a new incremental grid table (Incremental mapping table) and a new cumulative probability through the two exploration stages. Incremental PST _DGAL , each node of the new cumulative probability tail tree stores a message of a Movement Pattern, all nodes representing a moving pattern in all of the sequence data. A probability tail tree represents a collection of all moving patterns in the sequence data.

請參考圖1及2，以該序列資料(SequencesData)由複數個該基礎資料(Data Element)依序組成，該基礎資料之順序可為時間、空間或任何依存關連的排序，例如：設備編碼的序列、物件存取ID(Identification)的序列、存取時間的序列、移動位置的序列、地理坐標位置序列、基因序列、股票走勢資料、商品交易紀錄序列、手寫筆順位置序列、按鍵鍵入序列、病徵出現序列序列、音樂和弦的序列等。該序列資料分析之應用可為分析商品交易記錄序列，以預測較佳之商品促銷方案，或者為醫學病徵分析、物種分析、手寫字辨識分析、音樂旋律辨識分析或輸入字辨識分析等。 Referring to FIG. 1 and FIG. 2, the sequence data (SequencesData) is sequentially composed of a plurality of the basic data (Data Element), and the order of the basic data may be time, space or any dependent order, for example, device coding. Sequence, sequence of ID access (Identification), sequence of access time, sequence of moving position, sequence of geographic coordinate position, gene sequence, stock trend data, commodity transaction record sequence, stylus sequence, key input sequence, symptom Sequence sequences, sequences of musical chords, and the like appear. The application of the sequence data analysis may be to analyze the commodity transaction record sequence to predict a better product promotion plan, or to analyze medical condition, species analysis, handwriting recognition analysis, music melody identification analysis or input word recognition analysis.

於本發明實施例中，該序列資料為一經緯度座標軌跡序列資料(Latitude and Longitude coordinate sequences)，以該序列資料演算出一重要網格標籤序列資料(Labelsequences)演算出一新取樣週期的新機率尾置樹。 In the embodiment of the present invention, the sequence data is a latitude and longitude coordinate sequence data, and a new probability of calculating a new sampling period by using an important grid label sequence data (Labelsequences) is calculated by using the sequence data. Tail tree.

二該探勘階段之第一探勘階段(phase1)為一動態粒度大小與自動標籤階段(DGAL algorithm(Dynamic Granularity and Auto-labeling))，該第二探勘階段(phase2)為該累加機率尾置樹之合併階段(incPST_DGAL(Incremental PST_DGAL Merge))。 The first exploration phase (phase1) of the exploration phase is a dynamic granularity and auto-labeling (DGAL algorithm), and the second exploration phase (phase2) is the cumulative probability tail tree. Consolidation phase (incPST _DGAL (Incremental PST _DGAL Merge)).

其中，該動態粒度與自動標籤之漸進式序列資料探勘方法可於一單一運算裝置中以迴圈的方式探勘個別取樣週期的資料和累加探勘結果，完成複數個取樣週期的運算，或依據不同的取樣週期分別分散至不同的子運算裝置進行分散式的運算，運算完成後再集中至一母運算裝置作統整運算。 The progressive sequence data mining method of the dynamic granularity and the automatic label can search the data of the individual sampling period and the accumulated exploration result in a loop operation in a single computing device, complete the calculation of the plurality of sampling periods, or according to different The sampling period is separately distributed to different sub-operating devices for decentralized operations, and then the operations are completed and then concentrated to a parent computing device for integration operations.

該第一階段依據一新取樣週期(period(i))中的該序列資料建立該新取樣週期中的一重要網格表(mapping table)及該新取樣週期中的該重要網格標籤序列茲了，該新取樣週期指每次於該序列資料中取得資料的時間區段。 The first stage establishes an important mapping table in the new sampling period according to the sequence data in a new sampling period (the period (i)) and the important grid label sequence in the new sampling period. The new sampling period refers to the time period in which the data is acquired in the sequence data each time.

該第一階段包含下列步驟： This first phase consists of the following steps:

步驟1-1、取得該新取樣週期(period(i))中的該序列資料。於本發明實施例中，該序列資料為該經緯度座標軌跡資料(Latitude and Longitude coordinate sequences)，其一基礎資料可包含經緯度及時間等資訊。 Step 1-1: Obtain the sequence data in the new sampling period (period(i)). In the embodiment of the present invention, the sequence data is the Latitude and Longitude coordinate sequences, and the basic data may include information such as latitude and longitude and time.

步驟1-2、請參考圖3、4及5，本步驟執行一新增網格階層樹演算法(newGT algorithm)，由該新增網格階層樹演算法建立該新取樣週期中的該網格階層樹，並以新取樣週期中的該網格階層樹統計該序列資料。 Step 1-2, please refer to FIG. 3, FIG. 4 and FIG. 5, this step performs a new grid function tree (new GT algorithm), and the new grid level tree algorithm is used to establish the network in the new sampling period. The hierarchical tree, and the sequence data is counted by the grid hierarchy tree in the new sampling period.

該序列資料中的該基礎資料對應分佈於一目標空間中，依據一網格粒度(granularity)，該目標空間被分割為複數個網格，使每一該網格均對應該目標空間的局部，該網格間彼此不相重疊，且全部該網格的集合即該目標空間，而每一網格皆對應該網格階層樹中的一網格階層樹節點(node)。於本發明實施例中，該目標空間對應實際的一地理空間，而該序列資料為分佈於該目標空間中的經緯度座標序列資料。 The basic data in the sequence data is correspondingly distributed in a target space, According to a granularity, the target space is divided into a plurality of meshes, so that each of the meshes corresponds to a part of the target space, the meshes do not overlap each other, and all the meshes are collected. That is, the target space, and each grid corresponds to a grid hierarchy tree node in the grid hierarchy tree. In the embodiment of the present invention, the target space corresponds to an actual geospatial, and the sequence data is latitude and longitude coordinate sequence data distributed in the target space.

該網格階層樹節點與該網格粒度具有下列關係： The mesh hierarchy tree node has the following relationship with the mesh granularity:

(1)依據k+1種不同的該網格粒度，該目標空間可有k+1種不同的切割方式，初始時，該網格階層樹是一個具有k+1階層的階層樹。 (1) According to k+1 different mesh granularities, the target space may have k+1 different cutting modes. Initially, the mesh hierarchical tree is a hierarchical tree with k+1 hierarchy.

(2)該網格階層樹中同一階層的網格階層樹節點皆對應相同大小的該網格粒度。 (2) The grid level tree nodes of the same level in the grid hierarchy tree all correspond to the grid size of the same size.

(3)該網格階層樹中較上層的該網格階層樹節點所對應的網格粒度會大於相對下層的該網格階層樹節點所對應的網格粒度。 (3) The mesh granularity corresponding to the upper hierarchical mesh tree node in the mesh hierarchical tree is greater than the mesh granularity corresponding to the lower hierarchical hierarchical hierarchical tree node.

(4)該網格階層樹中，沒有子節點的網格階層樹節點稱為葉節點(leaf node)，在同一階層且具有相同父節點的節點互為兄弟節點(sibling node)，兄弟節點對應的網格互為兄弟網格，該兄弟網格之網格粒度相同。 (4) In the grid hierarchy tree, the grid hierarchy tree nodes without child nodes are called leaf nodes, and the nodes with the same parent node in the same hierarchy are sibling nodes, and the sibling nodes correspond to each other. The meshes are mutually brother meshes, and the mesh size of the sibling grid is the same.

(5)該新網格階層樹中各葉節點所對應的該網格，定義為一候選網格，所有該候選網格的集合即為該目標空間。 (5) The grid corresponding to each leaf node in the new grid hierarchy tree is defined as a candidate grid, and the set of all the candidate grids is the target space.

一基礎資料的值屬於一網格對應的值域範圍，則該基礎資料為該網格所包含。例如一基礎資料的經緯度位於一網格對應的地理空間，則該網格包含該基礎資料。將該候選網格中所分布的該基礎資料的數目統計記錄於該網格所對應的該網格階層樹葉節點。 If the value of a basic data belongs to a range of values corresponding to a grid, the basic data is included in the grid. For example, the latitude and longitude of a basic data is located in a grid. Geospatial, the grid contains the underlying material. The number of the basic data distributed in the candidate grid is statistically recorded in the grid level leaf node corresponding to the grid.

於本發明實施例中，該網格階層樹第0階(Level 0)包含N ₀ ²個節點，對應N ₀ ²個網格。該網格階層樹第1階(Level 1)包含N ₀ ²×d ²之該網格階層樹節點，並對應N ₀ ²×d ²個網格，該網格階層樹第2階(Level 2)包含N ₀ ²×d ⁴之節點，對應N ₀ ²×d ⁴個網格，依此類推，該網格階層樹第i階(Level i)包含(2ⁱ×N ₀)²之節點，對應(2ⁱ×N ₀)²個網格，為易於說明，於本發明實施例中均以d=2為例，但不在此限。 In the embodiment of the present invention, the 0th order (Level 0) of the grid hierarchy tree includes N ₀ ² nodes, corresponding to N ₀ ² grids. The mesh of the first-order hierarchical tree (Level 1) comprising N ₀ ² × d ² of the mesh of the hierarchical tree node, and the corresponding N ₀ ² × d ² meshes, the mesh hierarchy tree of order 2 (Level 2 a node containing N ₀ ² × d ⁴ corresponding to N ₀ ² × d ⁴ grids, and so on, the i-th order (Level i) of the grid hierarchy tree contains nodes of (2 ⁱ × N ₀ ) ² , Corresponding to (2 ⁱ × N ₀ ) ² grids, for ease of description, in the embodiment of the present invention, d=2 is taken as an example, but not limited thereto.

其中： among them:

「N₀」表示矩形的該目標空間邊長與該粒度最大的網格之寬度的比例，網格等分數目 "N ₀ " indicates the ratio of the length of the target space of the rectangle to the width of the grid with the largest granularity, and the number of grids.

即等分數目； Equal to the number of points;

「i」表示階層編號。 "i" indicates the class number.

「d」表示上下階層的網格的寬度比例。 "d" indicates the width ratio of the grid of the upper and lower levels.

進一步的，該新增網格階層樹演算法可依據舊累加取樣週期(period(1~i-1))中所建立的一累加網格階層樹(incGT(1~i-1),incremental grid tree)的葉節點之階層層數決定該新取樣週期中(period(i))該網格階層樹(grid tree)的葉節點之階層層數。於初始取樣週期(period(1))中，新增網格階層樹演算法將該網格階層樹建立至一基礎階層(base level)，該基礎階層為該網格階層樹可建立之最大階層。 Further, the new mesh hierarchy tree algorithm can be based on an accumulated grid hierarchy tree (incGT(1~i-1), incremental grid established in the old accumulated sampling period (period(1~i-1)). The number of hierarchical layers of the leaf nodes of the tree determines the number of hierarchical layers of the leaf nodes of the grid tree in the new sampling period (period(i)). In the initial sampling period (period(1)), a new grid hierarchy tree algorithm is established to establish a grid hierarchy tree to a base level, and the base hierarchy is a grid hierarchy tree. The largest class.

接續於該初始取樣週期之後的該取樣週期(period(i>1))，其中所建立的該網格階層樹之階層數量則不直接建立至該基礎階層，而是參考舊累加取樣週期中所建立的該網格階層樹。於本發明實施例中，於新取樣週期中先建立與舊累加取樣週期相同結構的網格階層樹，並針對被對應一重要網格的一網格階層樹葉節點細分至下一階層。藉此，可透過降低該網格階層樹節點之數量提昇演算之速度。 Continuing the sampling period (period(i>1)) after the initial sampling period, wherein the number of levels of the established hierarchical tree of the grid is not directly established to the basic level, but is referred to in the old accumulated sampling period. The grid hierarchy tree is created. In the embodiment of the present invention, a grid hierarchy tree having the same structure as the old accumulation sampling period is first established in a new sampling period, and is segmented to a next level for a grid-level leaf node corresponding to an important grid. Thereby, the speed of the calculation can be increased by reducing the number of nodes of the grid hierarchy tree.

步驟1-3、請參考圖3及4，進一步的，該新增網格階層樹演算法執行一網格細分演算法，依據舊累加取樣週期的該網格階層樹的葉節點對應之該網格中所包含的該基礎資料(Data Element)數量大於一細分門檻時，該網格可進行網格分割，使該葉節點對應產生複數個子節點。 Steps 1-3, please refer to FIG. 3 and FIG. 4, further, the new mesh hierarchy tree algorithm performs a mesh subdivision algorithm, and the leaf node corresponding to the leaf hierarchy tree of the old accumulated sampling period corresponds to the network. When the number of the basic data (Data Element) included in the cell is greater than a subdivision threshold, the mesh may be mesh-divided so that the leaf node correspondingly generates a plurality of child nodes.

例如，網格對應網格階層樹第0階(Level 0)之一網格階層樹節點，其子網格對應網格階層樹第1階(Level 1)之該網格階層樹節點，以此類推。於本發明實施例中網格分割時，每一網格均d²等分為d²個粒度較小的網格。 For example, the grid corresponds to one of the grid level tree nodes of the 0th order (Level 0) of the grid hierarchy tree, and the subgrid corresponds to the grid hierarchy tree node of the first level (Level 1) of the grid hierarchy tree. analogy. When the grid is divided in the embodiment of the present invention, each of the grids are divided into d ² d ² smaller mesh size.

步驟1-4、請參考圖6及7，本步驟執行一網格融合演算法(gtMerge algorithm)，該網格融合演算法依據各該網格階層樹之葉節點所記錄統計的該基礎資料的數量，判斷一候選網格是否為一重要網格及調整該候選網格的大小，給予每一該重要網格標籤(labeling)(σ)，並依據該重要網格形成該重要網格表。 Steps 1-4, referring to FIG. 6 and FIG. 7, this step performs a grid fusion algorithm (gtMerge algorithm), and the grid fusion algorithm records the basic data according to the statistics of the leaf nodes of the grid hierarchy tree. The quantity, determining whether a candidate mesh is an important mesh and adjusting the size of the candidate mesh, giving each of the important mesh labels (σ), and forming the important mesh table according to the important mesh.

該網格融合演算法針對該序列資料於各該網格中分佈之該基礎資料數目動態決定複數個重要網格，使後續的演算過程僅針對複數個該重要網格進行演算，有效提昇運算效率。 The mesh fusion algorithm dynamically determines a plurality of important grids for the number of the basic data distributed in the grid by the sequence data, so that the subsequent calculation process only performs calculation for a plurality of the important grids, thereby effectively improving the operation efficiency. .

該網格融合演算法依據該候選網格中包含的該基礎資料之數目作為判斷該候選網格是否符合該重要網格之門檻，該候選網格判斷為一重要網格之判斷條件包含： The mesh fusion algorithm determines, according to the number of the basic data included in the candidate mesh, whether the candidate mesh meets the threshold of the important mesh, and the criterion for determining the candidate mesh as an important mesh comprises:

(1)於該網格階層樹上，對應於該網格的該網格階層樹節點為一葉節點。 (1) On the grid hierarchy tree, the grid hierarchy tree node corresponding to the grid is a leaf node.

(2)該候選網格包含之該基礎資料的數目數目大於一重要網格門檻(Cmin)。 (2) The number of the basic data included in the candidate grid is greater than an important grid threshold (Cmin).

(3)該候選網格與其兄弟網格的該基礎資料之數量的一亂度(e)小於一亂度門檻值(Emax)。其中，該亂度的定義如下： (3) A degree of disorder (e) of the amount of the base data of the candidate grid and its sibling grid is less than a chaos threshold (Emax). Among them, the degree of chaos is defined as follows:

「m」表示：兄弟網格的數量；「c_i」表示：第i個兄弟網格包含知該基礎資料的數目；「p_i」表示：第i個網格的機率，；例如：4個該基礎資料之數目分別都是2，其亂度計算如下： "m" means: the number of sibling grids; "c _i " means that the i-th sibling grid contains the number of basic data; "p _i " means: the probability of the i-th grid, For example: the number of the four basic materials is 2, and the turbidity is calculated as follows:

於網格階層樹上，葉節點的物理意義為其對應的網格為目前可能成為重要網格且粒度最小的網格。 On the grid hierarchy tree, the physical meaning of the leaf nodes is that their corresponding grids are the ones that may become important grids with the smallest granularity.

兄弟網格是可以融合成一個較大網格的網格集合，當兄弟網格的亂度低於亂度門檻值的物理意義是兄弟網格的該基礎資料的數目數目有相當的差異，和均勻分佈相反，代表一個較明顯的特徵。 The sibling grid is a grid set that can be merged into a larger grid. When the chaotic degree of the sibling grid is lower than the physical value of the chaotic threshold, the number of the basic data of the sibling grid is quite different, and Uniform distribution is opposite, representing a more pronounced feature.

例如於本發明實施例中定義每一該候選網格中包含的該基礎資料之數目大於2筆(Cmin=2)，且該候選網格與其兄弟網格之亂度低於0.6(即Emax=0.6)時(0.6為4網格資料量均勻分佈時的亂度)，該候選網格則被定義為一重要網格。 For example, in the embodiment of the present invention, the number of the basic data included in each candidate grid is greater than 2 (Cmin=2), and the chaos of the candidate grid and its sibling grid is less than 0.6 (ie, Emax= 0.6) (0.6 is the disorder when the 4 mesh data is evenly distributed), the candidate mesh is defined as an important mesh.

進一步的，前述每一該重要網格需包含之該基礎資料之筆數數目門檻值可依據使用者需求指定，或由使用者給予一比例換算得知，或根據系統整體運算資源考量而調整。於實際物理意義而言，該重要網格可表示該目標物出現於該網格之頻率極高或表示長時間停留於該網格。 Further, the threshold number of the basic data to be included in each of the important grids may be specified according to user requirements, or may be converted by a user, or adjusted according to the overall computing resources of the system. In practical physical terms, the important grid may indicate that the target is extremely high in the grid or that it stays in the grid for a long time.

該網格融合演算法以未被判定為重要網格的候選網格且其兄弟網格皆未被判定為重要網格為網格融合依據，使該網格及其鄰近的該兄弟網格皆符合前述網格融合依據時，該網格融合演算法對該網格進行融合。其中，前述網格融合係將該網格階層樹上對應該網格及其該兄弟網格的葉節點的該基礎資料之數目相加，並儲存於該葉節點的父節點，刪除該父節點的子節點，使該父節點成為一新的葉節點。 The mesh fusion algorithm uses a candidate mesh that is not determined to be an important mesh and its sibling mesh is not determined as an important mesh as a mesh fusion basis, so that the mesh and its neighboring brother meshes are both When the mesh fusion basis is met, the mesh fusion algorithm fuses the mesh. The mesh fusion system adds the number of the basic data corresponding to the grid and the leaf nodes of the sibling grid on the grid hierarchy tree, and stores the data on the parent node of the leaf node, and deletes the parent node. Child node, making the parent node a new leaf node.

步驟1-5、請參考圖7，此步驟依據該重要網格表產生該重要網格標籤序列資料(Label sequences)，以該重要網格標籤序列資料表示於該新取樣週期中該序列資料所包含的該重要網格及重要網格間的時間順序關係。 Step 1-5, please refer to FIG. 7. This step generates the important mesh label sequence data according to the important grid table, and the important grid label sequence data is used to represent the sequence data in the new sampling period. The chronological relationship between the important mesh and the important mesh contained.

步驟1-6、請參考圖8，此步驟依據該重要網格標籤序列資料建立於該新取樣週期中一新機率尾置樹(PST_DGAL)，其中該新機率尾置樹代表該新取樣週期中該重要網格標籤序列資料呈現的該移動型樣的集合。 Step 1-6, please refer to FIG. 8. This step is based on the important grid tag sequence data, and a new probability tail tree (PST _DGAL ) is established in the new sampling period, where the new probability tail tree represents the new sampling period. A collection of the moving patterns presented by the important grid tag sequence data.

該新機率尾置樹之建置過程「build_PST」可參考公式集7，然該機率尾置樹之建置方法不在此限。 The new build rate tree construction process "build_PST" can refer to the formula set 7, but the probability tail tree construction method is not limited to this limit.

其中該機率尾置樹中每一節點都有對應的一標籤序列，此標籤序列可直接儲存在該節點中或者藉由自根節點向下瀏覽(traverse)到該節點所經過的分支(branch)可組合得知。 Each node in the probability tail tree has a corresponding label sequence, and the label sequence can be directly stored in the node or traversed from the root node to a branch through which the node passes. Can be combined to know.

於本發明實施例中，該目標空間對應實際的一地理空間，該序列資料為對應該地理空間上的經緯度座標軌跡資料，該機率尾置樹中每一該節點代表一移動型樣，該移動型樣表示該序列資料中，依序經過該節點所包含的標籤序列所對應的複數個重要網格之後，未來進到某一重要網格的條件機率分佈。 In the embodiment of the present invention, the target space corresponds to an actual geographic space, and the sequence data is latitude and longitude coordinate trajectory data corresponding to the geographic space, and each node in the probability tail tree represents a moving pattern, the movement The pattern indicates the conditional probability distribution of the important mesh in the sequence after passing through the plurality of important meshes corresponding to the tag sequence included in the sequence.

舉例而言，若某一節點所表示之該移動型樣的標籤序列為s,s=ghi，而其節點所儲存的條件機率分佈表示該目標物於未來可能走向某一該重要網格的機率，例如該目標物於未來走向重要網格「a」的條件機率應為「p(a|s)」，代表該序列資料中依序經過該重要網格「g」、「h」、「i」後，進一步經過於重要網格「a」之機率。透過該機率尾置樹的建立。基於一重要網格標籤序列資料，一移動型樣中之判斷包含下列條件：移動型樣判斷條件1、p(s)>P_min移動型樣判斷條件2、 σ Σ_i，使 For example, if the tag sequence of the mobile pattern represented by a node is s, s=ghi, and the conditional probability distribution stored by the node indicates the probability that the target may move toward a certain important mesh in the future. For example, the probability that the target will move toward the important grid "a" in the future should be "p(a|s)", which means that the important data "g", "h", "i" are sequentially passed through the sequence data. After that, the probability of further passing through the important grid "a". Through the establishment of the probability tail tree. Based on an important grid tag sequence data, the judgment in a moving pattern includes the following conditions: moving pattern judgment condition 1, p (s) > P _min moving pattern judgment condition 2 σ Σ _i , make

其中，上述公式之代號意義如下：「Σ_i」為該新取樣週期之該新重要網格標籤集；「s」為該移動型樣的標籤序列，s(s=s₀s₁...s_l-1,s_i Σ_1,i)「p(s)」為標籤序列s在該重要網格標籤序列資料中出現之機率；「p_min」為標籤序列s出現機率的門檻值；「σ」為一標籤變數，對應該新重要網格標籤集之一標籤，即σ Σ_i；「p(σ|s)」為在該重要網格標籤序列資料中，目標物經過s之後，未來經過σ的條件機率；「γ」為父子節點差異比參數；「γ_min」為最小條件機率參數；及「suffix(s)」為標籤序列s的尾置序列，例如：suffix(ghi)=hi。 The code name of the above formula is as follows: "Σ _i " is the new important grid label set of the new sampling period; "s" is the label sequence of the moving pattern, s (s = s ₀ s ₁ ... s _l-1 , s _i Σ _1,i ) "p(s)" is the probability that the tag sequence s appears in the important grid tag sequence data; "p _min " is the threshold value of the probability of occurrence of the tag sequence s; "σ" is a tag variable, One of the labels of the new important grid label set, ie σ Σ _i ; "p(σ|s)" is the conditional probability of the future passing σ after the target passes s in the important grid tag sequence data; "γ" is the parent-child difference ratio parameter; "γ _min " is The minimum conditional probability parameter; and "suffix(s)" is the tail sequence of the tag sequence s, for example: suffix(ghi)=hi.

其中該移動型樣之判斷流程如下：移動型樣判斷條件1：前述移動型樣判斷條件1用於限定找出具代表性的該移動型樣，p(s)表示標籤序列s在該重要網格標籤序列資料出現的機率，當此機率值越大，表示該移動型樣越重要；移動型樣判斷條件2：進一步限定只要存在一條件機率p(σ|s)大於γ_min，且此條件機率與其父節點對應的條件機率p(σ|suffix(s))差γ倍，則該移動型樣表示相對特殊的該移動型樣。 The judgment process of the moving pattern is as follows: Moving pattern judgment condition 1: The foregoing moving pattern judgment condition 1 is used to define a representative movement pattern, and p(s) indicates that the label sequence s is in the important grid. The probability of the occurrence of the tag sequence data, the greater the probability value, the more important the mobile pattern; the moving pattern judgment condition 2: further defined as long as there is a conditional probability p(σ|s) greater than γ _min , and the conditional probability The conditional probability p(σ|suffix(s)) corresponding to its parent node is γ times different, and the moving pattern represents a relatively special moving pattern.

進一步的，該機率尾置樹的建置包含一階層限制(L_max)，限制移動型樣之標籤序列的最大長度，也就是該階層限制表示該機率尾置樹可建立的最深階層，例如若該階層限制為「3」，則該機率尾置樹最多可建立至第3層。p(s)L_max Further, the establishment of the probability tail tree includes a hierarchical constraint (L _max ), which limits the maximum length of the label sequence of the mobile pattern, that is, the hierarchy constraint indicates the deepest level that the probability tail tree can establish, for example, If the level is limited to "3", the probability tail tree can be established up to layer 3. p(s)L _max

請參考圖1，該第二階段之運算過程將該舊累加取樣週期(period(1)~period(i-1))的該舊累加網格階層樹(incGT(1~i-1))與新取樣週期(period(i))中所運算出的該新網格階層樹(GT(i))結合，得到代表一新累加取樣週期(period(1)~period(i))中的該新累加網格階層樹(incGT(1~i))，以及該新累加網格階層樹運算出對應之該新累加重要網格表(MT(1~i))。將舊累加取樣週期的該舊累加機率尾置樹(incPST_DGAL(1~i-1))與新取樣週期的該新機率尾置樹(PST_DGAL(i))結合，得到對應一新累加取樣週期(period(1~i))的一新累加機率尾置樹(incPST_DGAL(1~i))。 Referring to FIG. 1, the second stage operation process uses the old accumulated grid hierarchy tree (incGT(1~i-1)) of the old accumulated sampling period (period(1)~period(i-1)). The new mesh hierarchy tree (GT(i)) calculated in the new sampling period (period(i)) is combined to obtain the new one in the new accumulated sampling period (period(1)~period(i)). The accumulated grid level tree (incGT(1~i)), and the new accumulated grid level tree are calculated to correspond to the new accumulated important grid table (MT(1~i)). Combine the old cumulative probability tail tree (incPST _DGAL (1~i-1)) of the old accumulated sampling period with the new probability tail tree (PST _DGAL (i)) of the new sampling period to obtain a corresponding new cumulative sampling. A new cumulative probability tail tree of the period (1~i) (incPST _DGAL (1~i)).

而該新累加機率尾置樹表示於該新累加取樣週期中的該移動型樣集，並以該移動型樣集預測該目標物未來於該目標空間中移動的軌跡。 The new accumulated probability tail tree represents the moving pattern set in the new accumulated sampling period, and predicts the trajectory of the target moving in the target space with the moving pattern set.

進一步的，該新累加機率尾置樹係透過融合該新取樣週期的該新機率尾置樹及該與該舊取樣累加週期的該舊累加機率尾置樹而產生。其中該新機率尾置樹及該舊累加機率尾置樹可分別由不同的子運算裝置運算後，再回傳至母運算裝置運算出該新累加機率尾置樹。 Further, the new cumulative probability tail tree is generated by fusing the new probability tail tree of the new sampling period and the old accumulated probability tail tree with the old sampling accumulation period. The new probability tail tree and the old accumulated probability tail tree can be respectively operated by different sub-operation devices, and then returned to the parent computing device to calculate the new cumulative probability tail tree.

步驟2-1、請參考圖10，本步驟中執行一網格階層樹融合演算法(merge2GTs algorithm)，該網格階層樹演算法融合於舊累加取樣週期運算出的該舊累加網格階層樹及第一階段運算中於新取樣週期中演算出的該新網格階層樹，得到表示該新累加取樣週期的該新累加網格階層樹。 Step 2-1, please refer to FIG. 10, in this step, a mesh hierarchical tree fusion algorithm (merge2GTs algorithm) is implemented, and the grid hierarchical tree algorithm is integrated into the old accumulated grid hierarchical tree calculated by the old accumulated sampling period. And the new grid hierarchy tree calculated in the new sampling period in the first stage operation, and the new accumulated grid hierarchy tree indicating the new accumulated sampling period is obtained.

其中，該新網格階層樹與該舊累加網格階層樹之融合為加總網格階層樹中對應同一網格的二該網格階層樹葉節點所包含的該基礎資料的數目數目，以此形成該新累加網格階層樹。 The fusion of the new grid hierarchy tree and the old accumulation grid hierarchy tree is the number of the basic data included in the two grid-level leaf nodes corresponding to the same grid in the aggregate grid hierarchy tree. The new accumulated grid hierarchy tree is formed.

其中，該新累加取樣週期即為由該初始週期至新取樣週期的累加，故當下時點的舊累加取樣週期即為前一取樣時點所產生的該新累加取樣週期。舉例而言，若當下時點的新取樣週期為第i週期，則舊累加取樣週期為第1~i-1週期，而該新取樣週期與該舊累加取樣週期形成的該新累加取樣週期為第1~i週期。 The new accumulated sampling period is the accumulation from the initial period to the new sampling period, so the old accumulated sampling period at the current time point is the new accumulated sampling period generated by the previous sampling point. For example, if the new sampling period of the current time point is the ith period, the old accumulated sampling period is the 1~i-1 period, and the new sampling period formed by the new sampling period and the old accumulated sampling period is the first 1~i period.

進一步的，若新網格階層數與舊累加網格階層數的二葉節點的階層不同且二該葉節點所對應的網格有重疊關係，則將層數編號較低的該葉節點分割，使新產生的葉節點的階層與二該葉節點階層編號較高者之階層一致，並將該葉節點的該基礎資料的數目分配給新產生的子節點，再加總對應葉節點的該基礎資料的數目。 Further, if the number of new mesh layers is different from the level of the two-leaf node of the old accumulated mesh hierarchy and the mesh corresponding to the two leaf nodes has an overlapping relationship, the leaf node with a lower number of layers is segmented, so that The level of the newly generated leaf node is identical to the level of the node whose node number is higher, and the number of the basic data of the leaf node is The destination is assigned to the newly generated child node, and the total number of the base data corresponding to the leaf node is added.

例如，於圖9中，該新網格階層樹之階層0編號5的網格與該舊累加網格階層樹階層1編號16,17,22和23的網格有重疊關係，先將該新網格階層樹之階層0編號5的網格分割成4個子網格16,17,22和23，再與該舊累加網格階層樹之階層1編號16,17,22和23的網格各自加總，形成該新累加網格階層樹；其中，將該葉節點的該基礎資料的數目分配給新產生的葉節點時，可按新產生的葉節點對應之網格粒度等比例分配。 For example, in FIG. 9, the grid of the hierarchy 0 number 5 of the new grid hierarchy tree overlaps with the grid of the old accumulation grid hierarchy tree hierarchy numbers 16, 17, 22, and 23, first new The grid of the grid level tree 0 is divided into 4 sub-grids 16, 17, 22 and 23, and then the grids of the hierarchical 1 of the old accumulated grid hierarchy are numbered 16, 17, 22 and 23 respectively. Adding, the new accumulated grid hierarchy tree is formed; wherein, when the number of the basic data of the leaf node is allocated to the newly generated leaf node, the grid granularity corresponding to the newly generated leaf node may be equally distributed.

步驟2-2、本步驟重複步驟1-3~步驟1-5所述之內容，依據每該累加網格階層樹的葉節點所包含的該基礎資料的數目判斷該重要網格及調整候選網格大小，給予該重要網格該標籤(labeling)並依據該重要網格形成對應新累加取樣週期的該新累加重要網格表(MT(i))。 Step 2-2. In this step, the content described in steps 1-3 to 1-5 is repeated, and the important grid and the adjustment candidate network are determined according to the number of the basic data included in the leaf nodes of the accumulated grid hierarchy tree. The grid size is given to the important grid labeling and the new accumulated vital grid table (MT(i)) corresponding to the new accumulated sampling period is formed according to the important grid.

進一步的，由於第二階段中係融合該新網格階層樹及該舊累加網格階層樹，故使用於判斷該重要網格的門檻值可對應修正。於本發明實施例中，該新累加網格階層樹的每一該網格階層樹節點所包含的該基礎資料數目大於4筆時，該節點對應之網格為一重要網格。 Further, since the new mesh hierarchical tree and the old accumulated mesh hierarchical tree are merged in the second phase, the threshold value used for determining the important mesh may be correspondingly corrected. In the embodiment of the present invention, when the number of the basic data included in each of the mesh hierarchical tree nodes of the newly accumulated mesh hierarchical tree is greater than four, the grid corresponding to the node is an important mesh.

步驟2-3，如圖9，本步驟執行建置累加機率尾置樹(build_incPST_DGAL)演算法，其融合步驟1-5產生的該新累加機率尾置樹和該舊累加取樣週期產生的該舊累加機率尾置樹，運算出一新累加機率尾置樹(insPST_DGAL(1~i))，而該新累加機率尾置樹代表於該新累加取樣週期)中的該移動型樣集。 Step 2-3, as shown in FIG. 9, this step performs an implementation of an accumulation probability tail tree (build_incPST _DGAL ) algorithm, which combines the new accumulated probability tail tree generated in steps 1-5 and the old accumulated sampling period. The old cumulative probability tail tree computes a new cumulative probability tail tree (insPST _DGAL (1~i)), and the new accumulated probability tail tree represents the moving pattern set in the new accumulated sampling period.

於步驟1-6中執行的建置機率尾置樹演算法係透過該重要網格標籤序列資料進行機率計算，而本步驟執行之該建置累加機率尾置樹(build_incPST_DGAL)演算法中的機率計算，係基於步驟1-5產生的該新機率尾置樹和該舊累加取樣週期中產生的該舊累加機率尾置樹計算而得，並同時於演算過程中考量權重計算。 The build probability tail tree algorithm executed in steps 1-6 performs probability calculation through the important grid tag sequence data, and the built-in cumulative probability tail tree (build_incPST _DGAL ) algorithm executed in this step The probability calculation is calculated based on the new probability tail tree generated in steps 1-5 and the old accumulated probability tail tree generated in the old accumulated sampling period, and the weight calculation is considered in the calculation process at the same time.

於本發明實施例中，由於該網格階層樹演算法動態地調整該重要網格的大小，使該重要網格大小不一，而產生一網格匹配問題，使該新機率尾置樹和舊累加機率尾置樹之融合無法以單純地將該網格標籤序列資料相同的該節點所儲存的條件機率相加完成。該網格匹配問題係指，因每一取樣週期中相同的目標空間所對應劃分的該重要網格的數目和大小不盡相同，進而產生於新機率尾置樹和舊累加機率尾置樹中具有相同的該標籤序列的節點卻實際對應不同大小或範圍的網格。因此為解決該網格匹配問題，在融合該新機率尾置樹和該舊累加機率尾置樹時，包含一新累加機率(p(σ))和一新累加條件機率(p(σ|s))的計算，以及一重疊標籤序列的融合估計權重(W)。 In the embodiment of the present invention, since the grid hierarchy tree algorithm dynamically adjusts the size of the important grid, the important grid size is different, and a grid matching problem is generated, so that the new probability tail tree and The fusion of the old cumulative probability tail tree cannot be completed by simply adding the conditional probability stored by the node with the same grid tag sequence data. The grid matching problem means that the number and size of the important grids corresponding to the same target space in each sampling period are different, and thus are generated in the new probability tail tree and the old cumulative probability tail tree. Nodes with the same sequence of tags actually correspond to meshes of different sizes or ranges. Therefore, in order to solve the mesh matching problem, when the new probability tail tree and the old accumulated probability tail tree are merged, a new cumulative probability (p(σ)) and a new cumulative condition probability (p(σ|s) are included. )), and the fusion estimate weight (W) of an overlapping tag sequence.

該新累加機率p(σ)的計算係根據與該新累加週期的標籤σ分別有重疊關係的新週期重疊標籤的集合和舊累加取樣週期重疊標籤的集合計算。 The calculation of the new cumulative probability p(σ) is calculated based on the set of new periodic overlapping labels and the overlapping set of old accumulated sampling periods, respectively, in an overlapping relationship with the label σ of the new accumulation period.

該新累加條件機率(p(σ|s))的計算，係根據與該新累加取樣週期的標籤序列「s」與「sσ」分別有重疊關係的新取樣週期重疊標籤序列集和舊累加取樣週期重疊標籤序列集計算。 The calculation of the new cumulative conditional probability (p(σ|s)) is based on the new accumulated sampling The new sample period overlap label sequence set and the old accumulation sample period overlap label sequence set in which the periodic label sequence "s" and "sσ" have overlapping relationships, respectively.

其中，新取樣週期重疊標籤序列集或舊累加重疊標籤序列集的定義如下：給予該新累加取樣週期之一標籤序列s(s=s(s=s₀s₁...s_l-1,s_i Σ_1,i))，一新取樣週期或舊累加取樣週期標籤序列q(q=q₀q₁...q_l-1,q_i Σ_i或，q_i Σ_1,i-1)。如符合下列條件，則該新取樣週期或該舊累加取樣週期之該標籤序列「q」是該標籤序列「s」的新取樣週期重疊標籤序列或舊累加重疊標籤序列：標籤序列q的每一標籤q_i對應的網格和s_i對應的網格部分或全部重疊。該標籤序列s所有的該新週期重疊標籤序列/舊累加重疊標籤序列稱為新週期重疊標籤序列集/舊累加重疊標籤序列集。 Wherein, the new sampling period overlap label sequence set or the old accumulated overlap label sequence set is defined as follows: a label sequence s is given to one of the new accumulated sampling periods (s=s(s=s ₀ s ₁ ...s _l-1 , s _i Σ _1,i )), a new sampling period or an old accumulated sampling period label sequence q (q = q ₀ q ₁ ... q _l-1 , q _i Σ _i or, q _i Σ _1,i-1 ). If the following conditions are met, the tag sequence "q" of the new sampling period or the old accumulated sampling period is a new sampling period overlapping label sequence or an old accumulated overlapping label sequence of the label sequence "s": each of the label sequences q The grid corresponding to the label q _{i and} the grid corresponding to the s _i overlap partially or completely. The new cycle overlap tag sequence/old accumulation overlap tag sequence of the tag sequence s is referred to as a new cycle overlap tag sequence set/old accumulation overlap tag sequence set.

其中，一新累加取樣週期的該標籤序列的融合估計權重計算包括新取樣週期重疊標籤序列合估計權重()和舊累加取樣週期重疊標籤序列的融合估計權重()的計算，將依序對應之標籤的匹配權重相乘得之。 The fusion estimation weight calculation of the tag sequence of a new accumulated sampling period includes a new sampling period overlapping tag sequence and an estimated weight ( And the fusion estimation weight of the overlapped tag sequence with the old accumulated sampling period ( The calculation of the matching weights of the labels corresponding to the order is multiplied.

該新取樣週期重疊標籤序列融合估計權重計算公式如下：該舊累加取樣週期重疊標籤序列的融合估計權重計算公式如下：其中，計算公式的代號意義如下：「s」：一新累加取樣週期的標籤序列，s=s₀s₁...s_l-1,s_i Σ_1,i，該標籤序列的長度係該序列中標籤的個數，即|s|=1；新重要網格標籤集(Σ_i)：新取樣週期的重要網格的標籤集；舊累加重要網格標籤集(Σ_1,i-1)：舊累加取樣週期(period(1)~period(i-1))的重要網格的標籤集；及新累加重要網格標籤集(Σ_1,i)：新累加取樣週期(period(1)~period(1))的重要網格的標籤集。 The new sampling period overlap label sequence fusion estimation weight calculation formula is as follows: The calculation formula of the fusion estimation weight of the overlapped label sequence of the old accumulated sampling period is as follows: Among them, the meaning of the formula of the calculation formula is as follows: "s": a sequence of labels of a new accumulated sampling period, s = s ₀ s ₁ ... s _l-1 , s _i Σ _1,i , the length of the label sequence is the number of labels in the sequence, ie |s|=1; the new important grid label set (Σ _i ): the label set of the important grid of the new sampling period; the old accumulation Important grid label set (Σ _1,i-1 ): the label set of the important grid of the old accumulated sampling period (period(1)~period(i-1)); and the new accumulated important grid label set (Σ _{1 , i} ): The set of labels for the important grid of the new cumulative sampling period (period(1)~period(1)).

標籤匹配型態1、單一網格對應單一網格：請參考圖10，第一標籤所對應的網格和第二標籤所對應的網格相同，其於不同取樣週期之機率尾置樹融合運算中所對應的該匹配權重為「1」，即；標籤匹配型態2、多重網格對應單一網格：請參考圖10，第一標籤所對應的網格和第二標籤所對應的網格部分重疊，且第一標籤所對應的網格粒度較小，其於不同取樣週期之機率尾置樹融合運算中所對應的該匹配權重為「1」，即；及標籤匹配型態3、單一網格對應多重網格：請參考圖10，第一標籤所對應的網格和第二標籤所對應的網格部分重疊，且第一標籤對應的網格粒度較大，其於不同取樣週期之機率尾置樹融合運算中所對應的該匹配權重為「第二標籤對應的網格和與第一標籤對應的網格重疊之新累加週期的重要網格的粒度和的比例」，即，其中，...為與σ_i重疊之新累加週期的標籤，且σ_j {...}。 Label matching type 1, single grid corresponds to a single grid: Please refer to FIG. 10, the grid corresponding to the first label is the same as the grid corresponding to the second label, and the probability of tail sampling is different in different sampling periods. The matching weight corresponding to this is "1", that is, ; tag matching type 2, multi-grid corresponding to a single grid: Please refer to FIG. 10, the grid corresponding to the first label and the grid corresponding to the second label partially overlap, and the mesh size corresponding to the first label Smaller, the matching weight corresponding to the probability tailing tree fusion operation in different sampling periods is "1", that is, And the label matching type 3, the single grid corresponds to the multi-grid: Please refer to FIG. 10, the grid corresponding to the first label and the grid corresponding to the second label partially overlap, and the mesh size corresponding to the first label Larger, the matching weight corresponding to the probabilistic tail-tree fusion operation in different sampling periods is "the important grid of the new accumulation period in which the grid corresponding to the second label overlaps with the grid corresponding to the first label. Granularity and ratio", ie ,among them, ... a label for the new accumulation period that overlaps σ _i , and σ _j { ... }.

請參考圖11，其中，新取樣週期或舊累加取樣週期中標籤「g」與新累加取樣週期的標籤「e」屬於標籤匹配型態1，該融合估計權重為「1」。 Please refer to FIG. 11 , in which the label “g” in the new sampling period or the old accumulated sampling period and the label “e” in the new accumulated sampling period belong to the label matching type 1, and the fusion estimation weight is “1”.

新取樣週期或舊累加取樣週期中標籤「d」與新累加取樣週期的標籤「c」屬於標籤匹配型態2，該融合估計權重為「1」。 The label "d" in the new sampling period or the old accumulated sampling period and the label "c" in the new accumulated sampling period belong to the label matching type 2, and the fusion estimation weight is "1".

新取樣週期或舊累加取樣週期中標籤「h」與新累加取樣週期的標籤「g」及「f」屬標籤匹配型態3，該融合估計權重為「」，其運算式為： The label "h" in the new sampling period or the old accumulated sampling period and the labels "g" and "f" of the new accumulated sampling period are label matching type 3, and the fusion estimation weight is " ", its expression is:

更進一步，融合的該新機率尾置樹和該舊累加機率尾置樹最基本的計算是新累加機率p(σ)和新累加條件機率(p(σ|s))的計算。 Further, the most basic calculation of the merged new probability tail tree and the old cumulative probability tail tree is the calculation of the new cumulative probability p(σ) and the new accumulated conditional probability (p(σ|s)).

「build_incPSTDGAL」演算法為本發明融合新機率尾置樹和舊累加機率尾置樹的方法之一，在此不限定。 The "build_incPSTDGAL" algorithm is one of the methods for the fusion of the new probability tail tree and the old cumulative probability tail tree in the present invention, and is not limited herein.

該「build_incPSTDGAL」演算法過程請參考「公式集1」。 Please refer to "Formula Set 1" for the "build_incPSTDGAL" algorithm process.

如圖10所示，假設舊累加取樣週期的資料長度為10，以該舊累加機率尾置樹預測bc和bcd出現的機率分別為和，該新取樣週期的長度為5，以該新機率尾置樹預測be和bef出現的機率為分別為和P₂(bef)=0.25，新累加週期之一節點cd的條件機率p(a|cd)的計算步驟請參考公式集4。 As shown in FIG. 10, assuming that the data length of the old accumulated sampling period is 10, the probability of occurrence of bc and bcd in the old accumulated probability tail tree is respectively with The length of the new sampling period is 5, and the probability that the new probability tail tree predicts be and bef is And P ₂ (bef)=0.25, the calculation procedure of the conditional probability p(a|cd) of one of the nodes cd of the new accumulation period is referred to Equation Set 4.

請參考圖12，於本發明實施例中，以該累加機率尾置樹預測形成一軌跡“nokjfb”的機率(P(“nokjfb”))，其運算式為請參考公式集8。 Referring to FIG. 12, in the embodiment of the present invention, the probability of forming a track "nokjfb" (P("nokjfb")) is predicted by the accumulated probability tail tree, and the calculation formula is Refer to Equation 8.

本發明實施例於新機率尾置樹、舊累加機率尾置樹融合的運算中加入一時間因子(aging parameter)，以該時間因子對不同取樣週期索取的該標籤網格序列資料作不同權重的調整，使其運算結果可依使用者需求進行調整。 In the embodiment of the present invention, an aging parameter is added to the operation of the new probability tail tree and the old cumulative probability tail tree fusion, and the label grid sequence data obtained by different sampling periods is weighted differently by the time factor. Adjust so that the results of the calculation can be adjusted according to user needs.

其中該時間因子的應用可為：對應資料容量比例，例如，新進的該標籤網格序列資料暫總資料容量的5分之1，則融合時該新機率尾置樹乘上的權重植為1/5；對應資料產生時間，例如對應突然改變的移動習慣，為使該新累加機率尾置樹快速的調整，給予新機率尾置樹較高權重值；又例如，為保持舊累加機率尾置樹的預測結果，給予舊累加機率尾置樹較高的權重值；或者可依使用者經驗自行設定該時間因子。以上為該時間因子之實施範例，但實際應用不在此限。 The application of the time factor may be: corresponding data capacity ratio, for example, 1/1 of the temporary data capacity of the newly added tag grid sequence data, and the weight of the new probability tail tree multiplication is 1 when the fusion is performed. /5; corresponding data generation time, for example, corresponding to the sudden change of mobile habits, in order to make the new accumulative probability tail tree quickly adjust, giving the new probability tail tree a higher weight value; for example, to keep the old accumulative probability tail The predicted result of the tree gives a higher weight value to the old cumulative probability tail tree; or the time factor can be set according to the user experience. The above is an example of the implementation of the time factor, but the actual application is not limited to this.

結合一時間因子參數「α」的該新累加機率的計算公式請參考公式集5。 For the calculation formula of the new cumulative probability combined with a time factor parameter "α", please refer to Equation 5.

結合該時間因子參數的該新累加條件機率的計算公式請參考公式集6。 For the calculation formula of the new cumulative condition probability combined with the time factor parameter, please refer to Equation 6.

公式集1：「build_incPSTDGAL」演算法過程Formula set 1: "build_incPSTDGAL" algorithm process

於「build_incPSTDGAL」演算法中getProb(σ,T₁,MT₁,T₂,MT₂)係計算新累加機率p(σ)，於計算該新累加機率p(σ)時，T₁對應該新機率尾置樹T_i，MT₁對應該新週期重要網格表，T₂對應該舊累加機率尾置樹T_1,i-1，MT₂對應該舊累加重要網格表，該新累加機率的計算公式請參考公式集2。演算法中getCProb(s,σ,T₁,MT₁,T₂,MT₂)係計算的新累加機率p(σ|s)，該新累加條件機率的計算公式請參考公式集3。 In the "build_incPSTDGAL" algorithm, getProb(σ, T ₁ , MT ₁ , T ₂ , MT ₂ ) calculates the new cumulative probability p(σ). When calculating the new cumulative probability p(σ), T ₁ corresponds to the new one. The probability tail tree T _i , MT ₁ corresponds to the new cycle important grid table, T ₂ corresponds to the old cumulative probability tail tree T _{1, i-1} , MT ₂ corresponding to the old accumulated important grid table, the new cumulative probability Please refer to Equation 2 for the calculation formula. In the algorithm, getCProb(s, σ, T ₁ , MT ₁ , T ₂ , MT ₂ ) is the new cumulative probability p(σ|s) calculated. For the calculation formula of the new cumulative condition probability, please refer to Equation 3.

公式集2：新累加機率的計算公式Formula 2: Calculation formula for the new cumulative probability

其中， among them,

算式(1)為「」 Equation (1) is " "

算式(2)為「」 Equation (2) is " "

公式集3：新累加條件機率的計算公式Formula 3: Formula for calculating the probability of a new cumulative condition

其中， among them,

算式(3)為「」 Equation (3) is " "

算式(4)為「」 Equation (4) is " "

算式(5)為「」 Equation (5) is " "

算式(6)為「」 Equation (6) is " "

「s」：一新累加週期的標籤序列。一標籤序列的長度係該序列中標籤的個數，例如s=s₀s₁...s_l-1的長度以|s|表示，|s|=1；「cnt_i(n)/cnt_1,i-1(n)」：新個別週期/舊累加週期中長度為n的標籤序列的出現次數，其中，該長度為n的標籤序列的出現次數可以該標籤序列資料的總長度L估算，或以L-n+1估算，亦可另行紀錄該標籤序列資料中的標籤序列數目seqNo，以L-seqNo(n-1)估算。 "s": A sequence of tags for a new accumulation cycle. The length of a tag sequence is the number of tags in the sequence, for example, the length of s=s ₀ s ₁ ... s _l-1 is represented by |s|, |s|=1; "cnt _i (n)/cnt _{1, i-1} (n)": the number of occurrences of the tag sequence of length n in the new individual cycle/old accumulation cycle, wherein the number of occurrences of the tag sequence of length n can be estimated from the total length L of the tag sequence data Or, estimated by L-n+1, the number of tag sequences in the tag sequence data seqNo may be separately recorded and estimated by L-seqNo(n-1).

「T_i/T_1,i-1」：新機率尾置樹/舊累加機率尾置樹。 "T _i /T _1,i-1 ": new probability tail tree / old cumulative probability tail tree.

「」：以新機率尾置樹/舊累加機率尾置樹預測新週期重疊標籤序列集q/舊累加重疊標籤序列集t出現的機率。 " ”: The probability of occurrence of the new period overlap label sequence set q/old accumulated overlap label sequence set t is predicted by the new probability tail tree/old accumulation probability tail tree.

「」(新週期重疊標籤序列集)：一新累加週期的標籤序列s的新週期重疊標籤序列的集合。 " (New Period Overlapping Label Sequence Set): A collection of new periodic overlapping label sequences for a new accumulated period of the label sequence s.

「」(舊累加重疊標籤序列集)：一新累加週期的標籤序列s的舊累加重疊標籤序列的集合。 " (Old Accumulated Overlapping Tag Sequence Set): A collection of old accumulated overlapping tag sequences for a new accumulated cycle of the tag sequence s.

公式集4：Formula set 4:

其中， among them,

算式(7)為「」 Equation (7) is " "

算式(8)為「」 Equation (8) is " "

算式(9)為「」 Equation (9) is " "

算式(10)為「」 Equation (10) is " "

公式集5：Formula 5:

其中， among them,

算式(11)為「」 Equation (11) is " "

算式(12)為「」 Equation (12) is " "

公式集6：Formula 6:

其中：among them:

算式(13)為「cnt1,i-1(|sσ|)」 Equation (13) is " Cnt1,i-1(|sσ|)"

算式(14)為「」 Equation (14) is " "

算式(15)為「」 Equation (15) is " "

算式(16)為「」 Equation (16) is " "

公式集7：「build_PST」演算法過程 Formula 7: "build_PST" algorithm process

公式集8：Formula 8:

Claims

A progressive sequence data mining method for dynamic granularity and automatic labeling, comprising a first exploration stage and a second exploration stage, which establish a new accumulated important grid table and a new cumulative probability tail tree, with the cumulative probability tail The predicted sequence data in the tree is generated in a moving trajectory in a target space, wherein: the first exploration phase comprises the following steps: Step 1-1: obtaining the new sampling period in a new sampling period Sequence data, the sequence data includes a plurality of basic data distributed in the target space; Step 1-2, performing a new grid hierarchy tree algorithm, and the target space is divided into a plurality of networks according to a grid granularity a grid, and each grid corresponds to a grid hierarchy tree node in the grid hierarchy tree; steps 1-3, performing a grid subdivision algorithm, according to the network corresponding to the leaf nodes of each grid hierarchy tree When the number of the basic data included in the cell is greater than a subdivision threshold, the mesh is mesh-divided, so that the leaf nodes of the mesh hierarchical tree correspondingly generate a plurality of child nodes, and the mesh corresponding to the leaf nodes It is defined as a candidate mesh; mesh hierarchy tree and a new accumulation grid hierarchical tree old overlapping relationship and different particle sizes leaf nodes, leaf node is adjusted to the same particle size Pointing, and assigning the number of the basic data of the old leaf node to the plurality of newly generated leaf nodes, and allocating the number of the basic data of the leaf node to the plurality of leaf nodes; Step 1-4, performing a mesh fusion algorithm determines whether the candidate mesh corresponding to the leaf node is an important mesh and adjusts the candidate mesh according to the number of the basic data recorded by the leaf nodes of the mesh hierarchical tree. Size, and given the important grid label, all the important grids form an important grid table; the leaf nodes corresponding to the grid that are not determined to be important grids are mesh-fused with their sibling nodes; Step 1 -5, generating the important grid tag sequence data according to the important grid table, wherein the important grid tag sequence data is represented by the important mesh included in the sequence data and the important grid in the new sampling period a chronological relationship; and steps 1-6, establishing a new probability tail tree in the new sampling period based on the important grid tag sequence data, wherein the new probability tail tree represents the important in the new sampling period a set of the moving patterns presented by the tag sequence data; and each node in the probability tail tree represents a moving pattern, the moving pattern indicating a sequence of tags included in the sequence data passing through the node After a number of important grids, the future will enter a certain important The conditional probability distribution of the grid; and the second exploration stage comprises the following steps: Step 2-1: merging the new grid hierarchy tree with the old accumulation grid hierarchy tree to form a new cumulative grid hierarchy tree, and merging the new network And the number of the basic data included in the grid-level leaf node corresponding to the same grid in the grid-level tree and the old accumulated grid-level tree, thereby forming the new accumulated grid-level tree; step 2-2 The calculation content of the step 1-4 determines whether the grid corresponding to the node of the new accumulated hierarchical tree is the number according to the number of the basic data included in the grid corresponding to the leaf node in the new accumulated grid tree The important grid and the important grid label are given; Step 2-3, the new probability tail tree generated by the new sampling period and the old accumulated probability tail tree generated by the old accumulated sampling period are generated, and the new accumulated probability tail is generated. The tree is set, and the new cumulative probability tail tree represents the set of moving patterns in the new accumulated sampling period.

For example, the dynamic granularity of the first application of the patent scope and the progressive sequence data mining method of the automatic label, the calculation of the overlapping node fusion estimated weight is based on the label sequence included in the two nodes, and the matching weights of the labels corresponding to each other are sequentially matched. Multiply, the matching weight of the tag is calculated according to the tag matching type of the tag, and includes the following three tag matching types: Label matching type 1. Single grid corresponds to a single grid: the grid corresponding to the two labels is the same, and the matching weight corresponding to the probability tailing tree fusion operation in different sampling periods is "1"; the label matching pattern 2. The multi-grid corresponds to a single grid: the grids corresponding to the two labels overlap, and the grid size corresponding to the first label is small, and the matching weight corresponding to the probabilistic tail-tree fusion operation of different sampling periods It is "1"; and the label matching type 3, the single grid corresponds to the multi-grid: the grid corresponding to the two labels overlaps partially, and the grid corresponding to the second label overlaps with the grid corresponding to the first label. The granularity and proportion of the important grid of the cycle.

For example, in the dynamic granularity and automatic label progressive sequence data mining method of claim 2, the candidate grid determines that the criterion of the important grid includes: the number of the basic data included in the candidate grid is greater than an important The grid threshold; and the number of the basic data of the candidate grid and its sibling grid is less than a chaotic threshold.

Such as the dynamic granularity of the patent application scope 3 and the progressive label data mining method of the automatic label, the candidate grid and the base of the sibling grid thereof When the amount of the underlying data is higher than the chaotic threshold, it can be merged into a larger grid.

For example, in the method of claim 1, 2, 3 or 4, the dynamic granularity and the automatic label progressive sequence data mining method, a time factor is added to the operation of the new probability tail tree and the old cumulative probability tail tree fusion. The important grid label sequence data obtained by different time periods is adjusted by the time factor by different time weights.

For example, in the fifth aspect of the patent application, the dynamic granularity and automatic label progressive sequence data mining method is established according to the structure of an old accumulated grid hierarchical tree established in the old accumulated sampling period. The new grid hierarchy tree in the new sampling period, and the important grid label sequence data is counted by the new grid hierarchy tree in the new sampling period.

A progressive sequence data mining system with dynamic granularity and automatic labeling, wherein an arithmetic device performs a progressive sequence data mining method with dynamic granularity and automatic labeling according to any one of the foregoing claims 1 to 6 wherein the operation The device comprises at least one parent computing device and a plurality of sub-computing devices. The arithmetic algorithm is distributed to different sub-operating device operations according to different sampling periods, and then integrated into a parent computing device for integration operations.