TWI506456B

TWI506456B - System and method for dispatching hadoop jobs in multi-cluster environment

Info

Publication number: TWI506456B
Application number: TW102118168A
Authority: TW
Inventors: Chun Hsiang Huang; Wei Ting Lin; Hsiu Min Lin; jing ying Huang; Ching Tang Tsai
Original assignee: Chunghwa Telecom Co Ltd
Priority date: 2013-05-23
Filing date: 2013-05-23
Publication date: 2015-11-01
Also published as: TW201445332A

Description

Work distribution system and method based on Hadoop multi-cluster environment

本發明係關於一種基於Hadoop多叢集環境的工作分派系統及方法，特別係指一種結合叢集特徵分析、工作特徵分析，對叢集與運算任務進行特徵的分析判斷，並透過執行環境的選擇進行運算任務的派送，以達成簡便且效率高之服務運算資源之調配。The invention relates to a work assignment system and method based on Hadoop multi-cluster environment, in particular to a combination cluster feature analysis, work feature analysis, analysis and judgment of cluster and operation tasks, and computing tasks through execution environment selection. Delivery to achieve a simple and efficient deployment of service computing resources.

近年來因為大量的資訊化，使得一般企業與政府機構面對的是爆炸性成長的資料量，無論是在資料儲存、資料庫或資料檢索與資料探勘的領域中，都遭遇相同的問題，資料過濾與整理的龐大且耗時的工作，已無法由一台超級電腦負荷，轉而導向透過大量的群組電腦同時進行運算，進而獲得最大的效益。現今的資訊領域採用雲端服務的技術提供分散式運算來解決上述的問題，其中又以Apache Hadoop為主要的開放原始碼解決方案之一。In recent years, due to the large amount of informationization, the average enterprise and government agencies are facing an explosive growth in data volume. In the fields of data storage, database or data retrieval and data exploration, the same problem is encountered. The huge and time-consuming work that has been done is no longer burdened by a supercomputer, but instead is directed to computing through a large number of group computers, so as to obtain maximum benefits. Today's information field uses cloud-based services to provide decentralized computing to solve the above problems, including Apache Hadoop as one of the main open source solutions.

Hadoop實做出一個分散式運算的處理框架概念稱為MapReduce，透過將對數據進行的運算工作分發給網路上的每個節點處理，每個節點會周期性的把完成的工作和狀態的更新報告回來，進而達成大規模的數據運算分析。在此處理框架之下，工作的排程與分派預設為FIFO(First In First Out)演算法，雖然架構上簡單，卻因此忽略運算工作本質上需求的差異，可能造成某項工作長期占用資源的情況。此外，系統參數的調校是否能與運算工作本質上的需求相符合，也是另一項在Hadoop系統當中相當重要的因素，但是若需要滿足此項條件，使用者往往需要針對不同的運算工作重新設定整體系統環境參數，以便讓整體系統的效能與運作可以配合運算工作的需求。由此可見，上述習用的方法仍有諸多缺失，實非一良善之設計者，而亟待加以改良。如何大量的群組電腦同時進行運算，卻又能針對運算工作進行最佳化，本案發明人鑑於上述習用方法所衍生的各項缺點，乃亟思加以改良創新，並經多年苦心孤詣潛心研究後，終於成功研發完成本件基於Hadoop多叢集環境的工作分派之系統。Hadoop actually makes a decentralized operation processing framework called MapReduce. By distributing the data processing work to each node on the network, each node periodically reports the completed work and status updates. Come back, and then achieve a large-scale data operation analysis. Processing here Under the framework, the scheduling and assignment of work is preset to the FIFO (First In First Out) algorithm. Although the architecture is simple, it ignores the difference in the essential requirements of the computing work, which may cause a certain task to occupy resources for a long time. In addition, whether the adjustment of the system parameters can be consistent with the essential requirements of the computing work is another important factor in the Hadoop system, but if this condition is met, the user often needs to re-work for different operations. Set the overall system environment parameters so that the overall system performance and operation can match the needs of computing work. It can be seen that there are still many shortcomings in the above-mentioned methods, which are not a good designer, but need to be improved. How can a large number of group computers perform calculations at the same time, but they can optimize the computing work. The inventors of the present invention have improved and innovated in view of the shortcomings derived from the above-mentioned conventional methods, and after years of painstaking research. Finally, I successfully developed this system based on Hadoop multi-cluster environment.

本發明之目的即在於提供一種裝置與系統，特別是應用在多個大量資料處理的分散式電腦叢集，能夠根據執行程式特徵，待處理資料特性，與電腦叢集的動態行為，選擇最佳的執行環境。可以降低不同運算特性的工作之排程等待時間，有效的加快運算分析的速度，並提升整體資源使用效率。The object of the present invention is to provide an apparatus and system, in particular, a distributed computer cluster applied to a plurality of large-scale data processing, which can select the best execution according to the characteristics of the execution program, the characteristics of the data to be processed, and the dynamic behavior of the computer cluster. surroundings. It can reduce the scheduling waiting time of work with different computing characteristics, effectively speed up the calculation and analysis, and improve the overall resource utilization efficiency.

可達成上述發明目的之基於Hadoop多叢集環境的工作分派系統及方法，係利用一組叢集特徵與監控模組、工作資料與程式分析模組以及執行環境選擇模組的結合，提供最佳化的Hadoop多叢集環境工作分派系統給使用者執行大資料運算服務。其方法係透過掌握叢集特徵、監控叢集運作情形、分析運算資料特性與程式運算特性等影響參數，進而運算比對找出最合適的叢集，再透過執行環境選擇模組找到對應的叢集，並將用戶工作，包含用戶程式與輸入資料派送到對應的叢集中執行。A work distribution system and method based on the Hadoop multi-cluster environment that achieves the above objects, and utilizes a combination of a cluster feature and a monitoring module, a work data and a program analysis module, and an execution environment selection module to provide optimization. The Hadoop multi-cluster environment work dispatching system performs large data computing services for users. The method is to grasp the cluster characteristics and monitor the cluster operation. Shape, analysis of the operating data characteristics and program operation characteristics and other influencing parameters, and then the operation to find the most appropriate cluster, and then through the execution environment selection module to find the corresponding cluster, and the user work, including user program and input data is sent to The corresponding cluster is executed.

1‧‧‧工作分派系統1‧‧‧Work assignment system

11‧‧‧特徵資料庫模組11‧‧‧Characteristic database module

12‧‧‧叢集特徵模組12‧‧‧ Cluster Feature Module

13‧‧‧叢集監控模組13‧‧‧ Cluster Monitoring Module

14‧‧‧工作資料分析模組14‧‧‧Work data analysis module

15‧‧‧工作程式分析模組15‧‧‧Working program analysis module

16‧‧‧執行環境選擇模組16‧‧‧Execution Environment Selection Module

2‧‧‧用戶操作介面2‧‧‧User interface

3‧‧‧客戶程式3‧‧‧Client

4‧‧‧輸入資料4‧‧‧ Input data

5‧‧‧迷你叢集5‧‧‧Mini Cluster

6‧‧‧主機叢集6‧‧‧ Host cluster

第1圖為本發明之基於Hadoop多叢集環境的工作分派系統架構圖；第2圖為本發明基於Hadoop多叢集環境的工作分派系統之運作流程圖；第3圖為本發明基於Hadoop多叢集環境的工作分派系統之執行環境選擇流程圖。1 is a schematic diagram of a work distribution system based on a Hadoop multi-cluster environment of the present invention; FIG. 2 is a flowchart of operation of a work distribution system based on a Hadoop multi-cluster environment; FIG. 3 is a Hadoop multi-cluster environment according to the present invention; The execution environment selection flow chart of the work distribution system.

如第1圖所示，為本發明基於Hadoop多叢集環境的工作分派系統及方法之一種實施範例的架構示意圖，包括有：一特徵資料庫模組11，係用以儲存後述叢集特徵模組12、叢集監控模組13、工作資料分析模組14、工作程式分析模組15的矩陣方程式；一叢集特徵模組12，係用以收集叢集中不會隨著時間改變的靜態特徵，並以叢集靜態特徵矩陣方程式來描述其收集到的靜態特徵；一叢集監控模組13，用以定期收集每個叢集的動態特徵，並分析動態特徵曲線，以建立叢集動態特徵矩陣方程式來描述叢集特徵分析結果；一工作資料分析模組14，係用以收集工作執行中不會隨著時間改變的靜態特徵，並以工作靜態特徵矩陣方程式來描述其收集到的靜態特徵；一工作程式分析模組15，係用以分析用戶程式在執行時使用資源的情形，主要用以建立工作動態特徵矩陣方程式來描述用戶程式行為特徵；一執行環境選擇模組16，係用以由工作程式分析模組15與叢集特徵模組12建立的矩陣方程式中選出最適合用戶工作之叢集，並將其送往對應的叢集；本發明基於Hadoop多叢集環境的工作分派系統運作流程如第2圖所示，客戶將其工作(包含客戶程式3與輸入資料4)透過用戶操作介面2送至Hadoop多叢集環境的工作分派系統1，工作分派系統1由客戶工作特性與各叢集6特性找出最適合之叢集在將其送往此叢集執行，工作分派系統1中各個模組之說明如下。FIG. 1 is a schematic structural diagram of an implementation example of a work distribution system and method based on a Hadoop multi-cluster environment according to the present invention, including: a feature database module 11 for storing a cluster feature module 12 described later. , the cluster monitoring module 13, the work data analysis module 14, the work program analysis module 15 matrix equation; a cluster feature module 12, is used to collect static features that the cluster does not change over time, and clustered The static feature matrix equation is used to describe the collected static features; a cluster monitoring module 13 is used to periodically collect the dynamic features of each cluster and analyze the dynamic characteristic curves to establish a cluster dynamic feature matrix equation to describe the cluster feature analysis results. ; A work data analysis module 14 is configured to collect static features that do not change over time during work execution, and describe the collected static features by working static feature matrix equations; a work program analysis module 15, The method for analyzing the use of resources by the user program during execution is mainly used to establish a working dynamic characteristic matrix equation to describe the user program behavior feature; an execution environment selection module 16 is configured to analyze the module 15 and the cluster feature by the working program. The cluster equation established by the module 12 selects the cluster most suitable for the user's work and sends it to the corresponding cluster; the working process of the work distribution system based on the Hadoop multi-cluster environment of the present invention is shown in FIG. 2, and the customer works ( The work distribution system 1 including the client program 3 and the input data 4) is sent to the Hadoop multi-cluster environment through the user operation interface 2, and the work distribution system 1 finds the most suitable cluster by the customer work characteristics and the cluster 6 characteristics, and sends it to the cluster. This cluster is executed, and the description of each module in the work distribution system 1 is as follows.

首先，叢集監控模組13定期收集每個叢集的動態特徵(例如CPU頻率(GHz)、Disk空間、Memory的使用量)，並針對動態特徵曲線進行分析，將分析結果轉換成叢集動態特徵矩陣方程式，再儲存在特徵資料庫模組11。舉例來說，定期收集N個叢集(C_1…C_N)的n個動態特徵，如每秒CPU頻率(GHz)的使用量(%)、Disk空間的使用量(%)等，並以矩陣表示： First, the cluster monitoring module 13 periodically collects the dynamic characteristics of each cluster (such as CPU frequency (GHz), Disk space, Memory usage), and analyzes the dynamic characteristic curve to convert the analysis result into a cluster dynamic characteristic matrix equation. And stored in the feature database module 11. For example, periodically collect n dynamic features of N clusters (C_1...C_N), such as CPU usage (%) per second (%), usage (%) of Disk space, etc., and represent them in a matrix:

每個叢集各取時間間隔(t ₁ ~t _k )，其中k為間隔總數，計算出每個時間間隔的平均使用量，並以n×k 矩陣表示： Each cluster takes a time interval ( t ₁ ~ t _k ), where k is the total number of intervals, and the average usage of each time interval is calculated and expressed in an n × k matrix:

再將每個叢集在各時間間隔的1×n 矩陣儲存到特徵資料庫模組11。Each cluster is stored in the feature database module 11 at a 1×n matrix of each time interval.

而叢集特徵模組12主要負責收集叢集中不會隨著時間改變的靜態特徵，例如CPU核心數、CPU頻率(GHz)、Disk空間大小、Memory大小等規格，並將收集到的資料轉換成叢集靜態特徵矩陣方程式儲存在特徵資料庫模組11中，第i個叢集的矩陣方程式以1xn矩陣表示： The cluster feature module 12 is mainly responsible for collecting static features that are not changed over time in the cluster, such as CPU core number, CPU frequency (GHz), Disk space size, Memory size, etc., and converting the collected data into clusters. The static feature matrix equation is stored in the feature database module 11, and the matrix equation of the i-th cluster is represented by a 1xn matrix:

例如： E.g:

當有新叢集加入系統時，叢集特徵模組12會收集其靜態特徵，並同樣儲存在特徵資料庫模組11。When a new cluster is added to the system, the cluster feature module 12 collects its static features and also stores them in the feature database module 11.

工作資料分析模組14收集工作執行中不會隨著時間改變的靜態特徵，例如總資料量大小、總資料筆數、資料格式型態、是否壓縮等，描述工作的靜態特徵。當有新工作進入工作分派系統時，工作資料分析模組14會收集其靜態特徵，並將收集到的資料轉換成工作靜態特徵矩陣方程式儲存在特徵資料庫模組11中，工作靜態特徵矩陣方程式以1xn矩陣表示：[Js ₁ Js ₂ Js ₃ …Js _n ] (6) The work profile analysis module 14 collects static features that do not change over time during the execution of the work, such as the total amount of data, the total number of data, the format of the data format, whether it is compressed, etc., describing the static characteristics of the work. When a new job enters the work distribution system, the work data analysis module 14 collects its static features, and converts the collected data into a working static feature matrix equation stored in the feature database module 11, working static characteristic matrix equation Expressed in a 1xn matrix: [Js ₁ Js ₂ Js ₃ ... Js _n ] (6)

例如：[總資料筆數] _1×n (7) For example: [Total data] _1×n (7)

工作程式分析模組15，係用以分析客戶程式3在執行時使用資源的情形，係工作特徵分析模組中的一子模組，主要用以建立矩陣方程式來描述客戶程式3行為特徵；用戶於用戶操作介面2提交客戶程式3與輸入資料4的存放路徑後，工作程式分析模組15從輸入資料4擷取固定筆數的資料當作樣本，將客戶程式3與輸入資料樣本上載至迷你叢集5，要求迷你叢集5啟動程式開始處理輸入資料樣本，記錄客戶程式3在處理固定筆數資料樣本時使用迷你叢集5資源的情形(例如中央處理器、記憶體、檔案讀取與寫入要求、網路封包讀取與寫入要求)與花費時間，並將收集到的資料轉換成工作動態特徵矩陣方程式儲存在特徵資料庫模組11中，工作動態特徵矩陣方程式以1xn矩陣表示，其中Jd _n 為第n個工作動態特徵參數：[Jd ₁ Jd ₂ Jd ₃ …Jd _n ] (8) The work program analysis module 15 is used to analyze the situation in which the client 3 uses resources during execution, and is a sub-module in the work feature analysis module, which is mainly used to establish a matrix equation to describe the behavior characteristics of the client 3; After the user operation interface 2 submits the storage path of the client 3 and the input data 4, the work program analysis module 15 takes a fixed number of data from the input data 4 as a sample, and uploads the client 3 and the input data sample to the mini. Cluster 5, requires the Mini Cluster 5 startup program to start processing input data samples, and record the case where the client 3 uses the Mini Cluster 5 resource when processing a fixed number of data samples (eg, CPU, memory, file read and write requirements). , network packet read and write requirements) and time spent, and the collected data is converted into a working dynamic characteristic matrix equation stored in the feature database module 11, the working dynamic characteristic matrix equation is represented by a 1xn matrix, where Jd _n is the nth working dynamic characteristic parameter: [Jd ₁ Jd ₂ Jd ₃ ... Jd _n ] (8)

例如： E.g:

執行環境選擇模組16之運作流程如第3圖所示，首先從特徵儲存庫模組11取得叢集監控模組13、叢集特徵模組12、工作資料分析模組14與工作程式分析模組15分析之叢集靜態特徵矩陣方程式、叢集動態特徵矩陣方程式、工作靜態特徵矩陣方程式[Js ₁ Js ₂ Js ₃ …Js _n ] 與工作動態特徵矩陣方程式[Jd ₁ Jd ₂ Jd ₃ …Jd _n ] 並計算出用戶程式特徵矩陣方程式與對應各叢集的叢集特徵矩陣方程式如第(10)與(11)式所示： The operation flow of the execution environment selection module 16 is as shown in FIG. 3 . First, the cluster monitoring module 13 , the cluster feature module 12 , the work data analysis module 14 and the work program analysis module 15 are obtained from the feature repository module 11 . Analysis cluster static characteristic matrix equation Cluster dynamic characteristic matrix equation The working static characteristic matrix equation [Js ₁ Js ₂ Js ₃ ... Js _n ] and the working dynamic characteristic matrix equation [Jd ₁ Jd ₂ Jd ₃ ... Jd _n ] and calculate the user program characteristic matrix equation and the cluster feature matrix corresponding to each cluster The equation is as shown in equations (10) and (11):

其中F₁ ^job 代表用戶程式特徵矩陣方程式中的第一個特徵，其值為工作靜態特徵矩陣方程式第一項Js ₁ 與工作動態特徵矩陣方程式第一項Jd ₁ 相乘之結果，後面依此類推，共有n個特徵值，而則是代表第i個叢集的叢集特徵矩陣方程式的第一個特徵，同樣也有n個特徵值，由於叢集動態特徵矩陣方程式的值為當時叢集的平均使用率，而我們分析需要的為叢集的剩餘使用率，所以以(1-)計算出叢集剩餘使用率，其值是由叢集靜態特徵矩陣方程式第一項 與第一項叢集動態特徵矩陣剩餘使用率(1-)相乘而來，這裡以叢集1為例，其叢集特徵矩陣方程式即為，為了避免混淆之後我們以J表示用戶程式特徵矩陣方程式而Ci表示第i個叢集的叢集特徵矩陣方程式，如第3圖說明第二步是針對Ci做分群的動作，首先將不適合的叢集過濾掉，由於部分用戶程式特徵有下限值，若叢集對應的特徵值低於下限值這些用戶程式就無法在叢集上執行，舉例來說用戶程式特徵中有disk使用量，若叢集特徵中的disk剩餘量低於用戶程式所需disk使用量時，此叢集就不適合執行此用戶程式。在判別不適合的叢集可透過比較用戶程式特徵矩陣方程式與各叢集特徵矩陣方程式，若其中的元素屬於有下限值的特徵，且叢集特徵矩陣方程式元素小於用戶程式特徵矩陣方程式，表示此叢集特徵矩陣方程式對應的叢集不適合執行目前的用戶程式，於是不適合之叢集集合Cluster_unsuitable 的表示如第(12)式所示： Where F ₁ ^job represents the first feature in the user program characteristic matrix equation, and the value is the result of multiplying the first term Js ₁ of the working static feature matrix equation with the first term Jd _{1 of the} working dynamic characteristic matrix equation, and so on. , there are n eigenvalues, and Then it is the first feature of the cluster feature matrix equation representing the i-th cluster, and there are also n eigenvalues. Since the value of the cluster dynamic feature matrix equation is the average usage of the cluster at the time, we need the remainder of the cluster. Usage rate, so take (1- Calculate the remaining usage of the cluster, the value of which is the first term of the cluster static characteristic matrix equation Residual usage rate with the first cluster dynamic feature matrix (1- Multiply, here is cluster 1 as an example, and the cluster characteristic matrix equation is In order to avoid confusion, we denote the user program characteristic matrix equation with J and Ci denote the cluster feature matrix equation of the i-th cluster. As shown in Figure 3, the second step is to perform clustering action on Ci. First, filter the clusters that are not suitable. Since some user program features have a lower limit value, if the feature value corresponding to the cluster is lower than the lower limit value, the user program cannot be executed on the cluster. For example, the user program feature has disk usage, if the disk in the cluster feature This cluster is not suitable for executing this user program when the remaining amount is lower than the disk usage required by the user program. In the discriminating unsuitable cluster, the user program characteristic matrix equation and the cluster feature matrix equation can be compared. If the element belongs to the feature with the lower limit value, and the cluster feature matrix equation element is smaller than the user program characteristic matrix equation, the cluster feature matrix is represented. the equation for the corresponding cluster is not currently executed user program, so it is not suitable for the set of cluster _cluster unsuitable as a first representation (12) represented by the formula:

其中Cluster _all 代表所有叢集特徵矩陣方程式集合，L代表有下限值的特徵集合，表示Cⁱ 叢集的叢集特徵矩陣方程式的第j個元素，而F_j ^job 為用戶程式特徵矩陣方程式的第j個元素，過濾掉不適合的叢集後針對剩餘的叢集特徵方程式再將其分為最優先叢集特徵矩陣方程式集合與次優先叢集特徵矩陣方程式集合，首先最優先叢集特徵矩陣方程式集合在這裡定義為叢集特徵矩陣方程式的各特徵元素皆滿足用戶程式特徵方程式的所有元素，剩餘叢集特徵矩陣方程式則是次優先叢集特徵矩陣方程式集合，這兩個集合定義如下： Where Cluster _all represents a set of all cluster feature matrix equations, and L represents a feature set with a lower limit value. The jth element of the cluster characteristic matrix equation representing the C ⁱ cluster, and F _j ^job is the jth element of the user program characteristic matrix equation. After filtering out the unsuitable cluster, it is divided into the highest priority for the remaining cluster characteristic equations. Cluster feature matrix set and sub-priority cluster feature matrix set, first highest priority cluster feature matrix equation set here is defined as cluster feature matrix equations, each feature element satisfies all elements of the user program characteristic equation, and the remaining cluster feature matrix equation This is the set of priority cluster feature matrix equations, which are defined as follows:

ClusterCluster _{second priortySecond priorty} =Cluster=Cluster _allAll -(Cluster-(Cluster _{first priortyFirst priorty} U ClusterU Cluster _{unsuitableUnsuitable} ) (14)) (14)

其中Cluster _{first priorty} 為最優先叢集特徵矩陣方程式集合而Cluster _{second priorty} 為次優先叢集特徵矩陣方程式集合，將叢集特徵矩陣方程式分群後，下一步開始從中選擇目標叢集，選擇目標叢集可分為以下步驟： Cluster _{first priorty} is the highest priority cluster feature matrix equation set and Cluster _{second priorty} is the sub-priority cluster feature matrix equation set. After the cluster feature matrix equations are grouped, the next step is to select the target cluster, and the target cluster can be divided into the following steps:

A.檢查最優先叢集特徵矩陣方程式集合，若非空集合，則從集合中選擇最適合之叢集特徵矩陣方程式，在這裡用戶程式特徵矩陣方程式可視為一存在於n維空間的向量，同時最優先叢集特徵矩陣方程式集合也可視為多組存在於n為空間的向量集合，於是利用向量的距離做為選擇上的依據；一般而言距離越大代表叢集會有更充裕的資源供用戶程式執行，但本發明的目的在於降低工作等待(wait to run)時間，為了避免大量的用戶程式都配置到某個特定的叢集上降低了執行效率，於是在此選擇了向量距離最近，也就是最符合當時用戶程式執行的叢集，選擇如(15)所示：其中dist(C ⁱ ,J) 為叢集特徵矩陣方程式與用戶程式特徵矩陣方程式的向量距離，算法如下式所示： A. Check the set of the most preferred cluster feature matrix equations. If it is not an empty set, select the most suitable cluster feature matrix equation from the set. Here, the user program feature matrix equation can be regarded as a vector existing in n-dimensional space, and the highest priority cluster. The set of characteristic matrix equations can also be regarded as multiple sets of vectors existing in n space, so the distance of the vector is used as the basis for selection; generally speaking, the larger the distance, the more resources the cluster will have for the user to execute, but The purpose of the present invention is to reduce the wait to run time. In order to avoid a large number of user programs being configured to a specific cluster, the execution efficiency is lowered, so that the vector distance is selected closest to the user at that time. The cluster of program execution, as shown in (15): Where dist(C ⁱ , J) is the vector distance between the cluster feature matrix equation and the user program feature matrix equation. The algorithm is as follows:

B.如最優先叢集特徵矩陣方程式集合不存在任何的叢集特徵方程式，則從次優先叢集特徵矩陣方程式集合中做選擇，在這集合中皆是無法完全滿足用戶程式的叢集集合但還是可以順利完成用戶工作的需求，這裡的選擇方法如第一步相同，將各矩陣方程式視為存在於n維空間的向量，為了避免因叢集特徵與用戶程式特徵差別過大導致用戶程式執行時間過長，這邊一樣是以選擇兩向量空間距離最少做為選擇的依據，選擇如(17)式所示： B. If there is no cluster characteristic equation in the set of the most preferential cluster feature matrix, then the selection is made from the set of sub-priority cluster feature matrix, in which the cluster set of the user program cannot be fully satisfied but can be successfully completed. The user's work needs, the selection method here is the same as the first step, and the matrix equations are regarded as the vectors existing in the n-dimensional space. In order to avoid the user program execution time is too long due to the difference between the cluster features and the user program features, this is too long. The same is to choose the minimum distance between the two vector spaces as the basis for selection, as shown in (17):

C.如最優先與次優先叢集特徵矩陣方程式集合皆不存在任何的叢集特徵方程式，則表示目前所有存在的叢集皆不適合執行用戶工作，此時執行環境選擇模組退回用戶工作要求，並通知使用者。C. If there is no cluster characteristic equation in the set of the highest priority and the second priority cluster feature matrix, it means that all the existing clusters are not suitable for performing user work. At this time, the execution environment selection module returns the user work request and notifies the use. By.

如有找出最適合的叢集特徵矩陣方程式，執行環境選擇模組16由選出的叢集特徵矩陣方程式找到對應的叢集，並將客戶工作，包含客戶程式3與輸入資料4派送到對應的叢集中執行。If the most suitable cluster feature matrix equation is found, the execution environment selection module 16 finds the corresponding cluster from the selected cluster feature matrix equation, and sends the client work, including the client 3 and the input data 4, to the corresponding cluster to execute. .

本發明所提供之基於Hadoop多叢集環境的工作分派之系統，與其他習用技術相互比較時，更具有下列之優點：The system of work assignment based on Hadoop multi-cluster environment provided by the invention has the following advantages when compared with other conventional technologies:

1.本發明可根據待處理資料特性、運算程式之特徵與電腦叢集的動態行為，提供最佳化的執行環境給使用者，有效降低工作等待時間，提供可行、可靠、高效率之運算服務。1. The invention can provide an optimized execution environment to the user according to the characteristics of the data to be processed, the characteristics of the computing program and the dynamic behavior of the computer cluster, effectively reducing the waiting time of the work, and providing a feasible, reliable and efficient computing service.

2.本發明的工作分派之系統與方法可根據待處理資料特性進而充分使用運算設備硬體資源，降低運算服務建置成本，確保服務的穩定性與可靠性，解決運算工作本質上需求的差異之問題，進而提昇整體服務速度與效率，其經濟效益非常明顯。2. The system and method for the work assignment of the present invention can fully utilize the hardware resources of the computing device according to the characteristics of the data to be processed, reduce the cost of computing service construction, ensure the stability and reliability of the service, and solve the difference in the essential requirements of the computing work. The problem, and thus the overall service speed and efficiency, its economic benefits are very obvious.

上列詳細說明係針對本發明之一可行實施例之具體說明，惟該實施例並非用以限制本發明之專利範圍，凡未脫離本發明技藝精神所為之等效實施或變更，均應包含於本案之專利範圍中。The detailed description of the preferred embodiments of the present invention is intended to be limited to the scope of the invention, and is not intended to limit the scope of the invention. The patent scope of this case.

綜上所述，本案不但在空間型態上確屬創新，並能較習用物品增進上述多項功效，應已充分符合新穎性及進步性之法定發明專利要件，爰依法提出申請，懇請貴局核准本件發明專利申請案，以勵發明，至感德便。In summary, this case is not only innovative in terms of space type, but also can enhance the above-mentioned multiple functions compared with the customary items, and should fully comply with the novelty and progress. The statutory invention patent requirements of the step, apply in accordance with the law, and ask your office to approve the invention patent application, in order to invent invention, to the sense of virtue.

1‧‧‧工作分派系統1‧‧‧Work assignment system

11‧‧‧特徵資料庫模組11‧‧‧Characteristic database module

12‧‧‧叢集特徵模組12‧‧‧ Cluster Feature Module

13‧‧‧叢集監控模組13‧‧‧ Cluster Monitoring Module

14‧‧‧工作資料分析模組14‧‧‧Work data analysis module

15‧‧‧工作程式分析模組15‧‧‧Working program analysis module

2‧‧‧用戶操作介面2‧‧‧User interface

3‧‧‧客戶程式3‧‧‧Client

4‧‧‧輸入資料4‧‧‧ Input data

5‧‧‧迷你叢集5‧‧‧Mini Cluster

6‧‧‧主機叢集6‧‧‧ Host cluster

Claims

A system for job assignment based on Hadoop multi-cluster environment includes: a feature database module for storing static and dynamic feature matrix equations of clusters and working static and dynamic feature matrix equations; a cluster feature module, It is mainly responsible for analyzing the static characteristics of each cluster; a cluster monitoring module is mainly responsible for analyzing the dynamic characteristics of each cluster; a working data analysis module is mainly responsible for analyzing the static characteristics of the computing work; a working program analysis module is used for Analyze the situation in which the user program uses resources during execution; and an execution environment selection module is used to select the cluster that is most suitable for the user's work and send it to the corresponding cluster for execution.

According to the Hadoop multi-cluster environment-based work dispatching system described in claim 1, the cluster monitoring module periodically collects dynamic features of each cluster, analyzes the dynamic characteristic curves, and converts the analysis results into a cluster dynamic feature matrix. The equation is stored in the feature database module.

The system of work assignment based on Hadoop multi-cluster environment according to claim 1, wherein the cluster feature module is mainly responsible for analyzing static features that do not change over time in the cluster, and establishing a matrix equation to describe static features of the cluster; When a new cluster is added to the system, the cluster feature module analyzes its static features and converts the data into a matrix equation stored in the feature database module.

The system of work assignment based on the Hadoop multi-cluster environment described in claim 1, wherein the work data analysis module is mainly responsible for analyzing data characteristics and static features in the execution of the calculation work, and establishing a matrix equation to describe the work. The static feature; when a new job enters the work distribution system, the data analysis module analyzes its static features and converts the data into a matrix equation stored in the feature database module.

The system for scheduling work based on a Hadoop multi-cluster environment according to claim 1, wherein the worker program analysis module is configured to analyze a situation in which a client uses resources when processing data, and spends time, and converts the collected data into The working dynamic characteristic matrix equation is stored in the feature database module.

The system for scheduling work based on a Hadoop multi-cluster environment according to claim 1, wherein the execution environment selection module obtains a cluster monitoring module, a cluster feature module, a work data analysis module, and a work program from a feature database module. Analyze the results of the module analysis and use the user program feature matrix equation to work with the user, including the user program and input data to the corresponding cluster.

A work assignment method based on a Hadoop multi-cluster environment includes the following steps: obtaining a result of a cluster monitoring module, a cluster feature module, a work data analysis module, and a work program analysis module from a feature database module; calculating a user program feature The matrix equation and the cluster feature matrix equation corresponding to each cluster; the cluster feature matrix equations corresponding to each cluster are classified into the highest priority cluster feature matrix equation set, the second priority cluster feature matrix equation set and the unsuitable cluster feature matrix through the user program characteristic matrix equation A set of equations; wherein the user program characteristic matrix equation and each cluster feature matrix equation are compared, and if the element belongs to a feature having a lower limit value, and the cluster feature matrix equation element is smaller than the user program feature matrix equation, the cluster feature matrix is classified as unsuitable Equation; filter out discomfort After the cluster characteristic matrix equation is combined, all the elements of the cluster characteristic matrix equation satisfy the user program characteristic equation and all the elements are classified into the highest priority cluster feature matrix equation set, and the remaining cluster feature matrix equation classifies the sub-priority cluster feature matrix equation set; If the set of the most preferential cluster feature matrix is not an empty set, the most suitable cluster feature matrix equation is selected according to the user program feature matrix equation from the highest priority cluster feature matrix equation set; if the most preferential cluster feature matrix equation set is an empty set, then check Whether the set of sub-priority cluster feature matrix is an empty set, if not an empty set, then select a suitable cluster feature matrix equation; calculate the corresponding cluster by the selected cluster feature matrix equation, and work the user, including user program and input The data is sent to the corresponding cluster for execution; if the highest priority and the second priority cluster feature matrix are all empty sets, it means that all existing clusters are not suitable for performing user work. User job requirements, and notify the user.