1355168 九、發明說明: 【發明所屬之技術領域】 中分類所屬應用程式之方法 魄流4分财法,特収在網路流量 【先前技術】 在網路流量的分析中,—錄蚀 &1355168 IX. Description of invention: [Technical field of invention] Method of classifying application to it 魄flow 4 points of money, special collection in network traffic [Prior Art] In the analysis of network traffic, - Recording &
法,其利用封包的内部到來時間與大小的變Γ匕加 T量屬於何種類型,但是此方法僅能判定流量是屬於即時訊: (nSantMeSSagmg,IM)的傳輸、某種應用程式的命令資料 = 某種應用程式_4傳輸,並無法辨認出流量屬於何種制程心疋 j專統的網路伽技術皆依靠細程式已知稍與封包内容特徵 值比對方式,讀的方法6知有兩個缺^⑴無法細_決定淳號 使用之應_ ^、(2)封包β容如果被顧程式加舰無 特 徵值比對辨認。 1奋狩 另外-種方式是在點對點流量模式(p2p flGw pattem)中,先 檢查兩點之狀賴徘在Tcp及腦> 魏,接下來錄掉—些已 知(well-known)應用的連線(ex. HTTp, SMTp,FTp),若兩點之&的 連線數目等同於蟑號對(port pair)的數目的話,即將這些連線當作 是點對點P2P流量。此方法的限制在於現行的p2p流量狼多都是跑在 已知埠號(well-knownport)。利用已知埠消去法並無法保證對於p2p 應用程式的偵測正確率。 再者’稱作BlinC的方法,透過三個層面(Social,Functi〇nal, Application)來分析流量。Social層面將每一來源(source)與哪些目 的地(destination)有溝通標示出來;若有某一群的來源(s〇Ufce)同 時與很多且同樣的目地(destination)溝通的情況,很有可能是病毒 5 1355168 攻擊流量((ex. Blaster);若目的地(destination)的數目是正常的, 很有可能疋同時有一群人在劉覽同一個網站或是串流(streamjng)的 應用。Functional層面則是決定主機(host)所扮演的角色偏向Server、 Client或是P2P ;至於Application層面透過來源Π> (source IP)、目的 地 IP(destination IP)、來源埠號(sourcePort)、目的地槔號(destinati〇n Port)這些變數值組(4-tupie)再進一步來分辨流量是屬於哪一種應用 程式。此方法雖然有極高的準確度,但是因為需要大量的傳輸層資訊 做為判讀之用’所以相當耗時。The method uses the internal arrival time and size of the packet to change the type of T and the amount of T, but this method can only determine that the traffic belongs to the instant message: (nSantMeSSagmg, IM) transmission, command data of an application = Some application _4 transmission, and can not recognize the process of the traffic. The network gamma technology relies on the fine program to know the comparison with the packet content eigenvalue. Two defects ^ (1) can not be fine _ determine the use of the nickname _ ^, (2) packet β capacity if the program is added to the ship without eigenvalue comparison. 1 Fen hunting another way - in the point-to-point flow mode (p2p flGw pattem), first check the two points in the Tcp and brain > Wei, then record - some known (well-known) application Connection (ex. HTTp, SMTp, FTp), if the number of connections between the two points is equal to the number of port pairs, these connections are considered to be point-to-point P2P traffic. The limitation of this approach is that the current p2p traffic is mostly run at known well-known ports. The use of known 埠 elimination methods does not guarantee the correct detection rate for p2p applications. Furthermore, the method called BlinC analyzes traffic through three levels (Social, Functi〇nal, Application). The Social level communicates with each destination (destination); if a source (s〇Ufce) communicates with many and the same destinations, it is likely that Virus 5 1355168 Attack traffic ((ex. Blaster); if the number of destinations is normal, it is very likely that there is also a group of people on the same website or streaming (streamjng) application. Functional level It is to decide whether the role played by the host is biased toward Server, Client or P2P; as for the Application level, source (source IP), destination IP (destination IP), source nickname (sourcePort), destination nickname (destinati〇n Port) These variable value groups (4-tupie) go one step further to tell which application the traffic belongs to. This method has a high degree of accuracy, but it requires a large amount of transport layer information for interpretation. 'So it is quite time consuming.
在美國專利US Patent. 6,157,955中提出一個可以針對網路介面 加上分類引擎機制的架構來對網路上的封包做分析與分類。在分類引 擎的β刀包3 了兩個主體.封包頭端資訊解析(packet header parsing ) 與雜湊表查詢制(hash table lookups),而將的喊補,則是由 主機端定義來決定什麼樣的應用程式封包可以通過。此專利提供了一 ,彈性的_可以任意增加新的過濾方針,並且可以動態決定,儘量 節省所需侧的封包内容資t此篇專利類似於本中請案的架構(利 用-套分賴制分類網路中的流量),但卻沒有進一步地對使用加密 協定的應用程式做偵測機制。In U.S. Patent No. 6,157,955, an architecture for a network interface plus a classification engine mechanism is proposed to analyze and classify packets on the network. In the classification engine, the beta package includes two main bodies: packet header parsing and hash table lookups, and the shouting is determined by the host definition. The application package can pass. This patent provides one, flexible _ can add any new filtering policy, and can be dynamically determined, try to save the content of the required side of the package. This patent is similar to the structure of the request (use-set Classify traffic in the network), but there is no further detection mechanism for applications that use cryptographic protocols.
在美國專利US Patent. 6,597,660中提出一套可以分析、預測及 分辨網路㈣流量的架構’包含了—個可雌存及處理封包時間資訊 的裝置,在不同的時間點'不同的時間範圍内,同時累計接收到 的封包時間資訊,再利用統計出的封包時間資訊分類封包。此篇專利 與本申請案均採用統計計算後所得到的f訊作為 過’此篇專利使用的是封包到達時間資訊,與本申請案=外不 此篇專利也不能進-步細經過域後的封包。 在美國專利US patent. 6,754,662中採用了 一套發送引擎及一組 ^湊表、,構來”崎包,雜表内贿的是—纟 1 擎則會在收到封包後,根據收到的封包資訊,計算出封包的雜^引 (S ) 6 再嘗試以騎值當作索引,_存_湊表中去尋找^雜 目則會根據網路流量的崎’包含存取醉、最近存門2的項 式種類’贼流f長度等作騎,蚊存在轉表㈣^長^用程 在美國專利US, Patent. 6,839,751中採用了—組封包掏 一組資料庫;資料庫主要是絲儲存已經處理過的對話流 封包棟取裝置接受封包後,會到資料庫中查詢是否已經處理過^ 處理過,則會根據包含的統計資訊,包括該連線擁有的封包總個 封包到達時間、及此摘包壯魏到的封包到達時縣等更新 庫。如果沒有處理過,則在資料庫中新增項目。 、〆 傳統的偵測分析技術針對應用程式使用已知埠 port)方式做判;t ’但是現行許多存在於網路上的有害物質,由於採 用了動態埠(dynamicport)的技術,皆無法使用該方法來辨認。、木 現今廣泛使用的“封包内容特徵值”比對方法,因為越來越多 P2P/IM軟體使用封包加密技術而無法利用此方法對其封包内容判別 偵測’造成管理上的漏洞。 又,某些惡意軟體利用偽裝封包内容的方式意圖躲避内容特徵 值比對的偵測,傳統方法有可能會因此產生誤擋或是漏擋之問題。但 是現今的封包内容偵測方式,有侵害個人隱私的問題。 、目前的傳輸層特徵比對方式,皆有需要收集足夠傳輸層資訊方 能獲得正確判斷能力之缺陷,並且判斷時間過長,無法適用在需要快 速決定網路流量管理政策的閘道器或是防火牆之上。 【發明内容】 為了解決上述問題,本發明目的之一係提出可以用來偵測 被加密或是刻意隱藏通訊協定之應用程式,以提供網路管理者流量處 理上足夠的資訊。 r程式式之 併將:= 為了達到上述目的,本發明一眘 之方法,包括:昏指定:程== ΐ:ΐ;;於格中;以及’若是沒有存在於埠關連表格 接近代表― 【實施方式】 第1圖所示為本發明一實施例之網路流量分類方法之執行步 驟’包括第-Ρ皆段1〇〇之訓練過程與第二階段200之分類過程。〆 在第-階段100之訓練過程中,分析已錄製的流量並根據應用 程式的不同作分類’以求得各分類的代表賊,其包括··步驟ιι〇流 量收集(Traffic Collection),流程一開始是經由流量收集,先收集想 要比對的應用程式流量,得到足夠的封包個數後(至少需要超過4⑽ 個封包個數),步驟120計算各連線特徵(c〇nnecti〇n Characterizing), 將流量拆解成多個連線(connection);步驟130計算應用程式代表特 徵值,以各連線為處理單位再分別計算其代表特徵值,包含有支配值 (Dominating Size,DS)、支配值比例(Dominating Size Proportion DSP),及變動週期(Change Cycle,CC);以及最後的步驟14〇應用 程式代表特徵值之集合,得到一個應用程式代表特徵值之集合,並儲 1355168 存經由上辭驟所計算出的剌程式代表特徵值,以作為第二階段 200分類過程之線上模式比對流量的基準。 根據上述各步驟的動作,在步驟11〇流量收集(Traffic Collection)中’採用應用程式流量收集技術,利用網路流量過濾器的 =念’執行想要比制躺程^,蚊制程式及其使㈣埠號,使 得/、有所S要的應用程式封包才能通過網路介面,並且在網路流量出 入口端利麟#練技術將所需的流量錄製下綠為分析之用。In U.S. Patent No. 6,597,660, a set of architectures that can analyze, predict, and resolve network (four) traffic 'contains a device that can store and process packet time information at different time points in different time ranges. At the same time, the received packet time information is accumulated, and then the statistical packet time information is used to classify the packet. This patent and this application both use the statistical information calculated after the f-message as the 'this patent used is the packet arrival time information, and this application = outside this patent can not enter - step through the domain Packet. In U.S. Patent No. 6,754,662, a set of sending engines and a set of collaterals are used, and the composition of the "sakisaki, the miscellaneous table of bribes is - 纟1 engine will receive the packet, according to the received Packet information, calculate the packet of the packet (S) 6 and try to use the value of the ride as an index, _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The item type of the door 2 is the length of the thief flow f, etc., and the mosquito exists in the table (four) ^ long ^ used in the US patent US, Patent. 6,839,751 uses a group of data packages; the database is mainly silk After the processed conversation stream packet receiving device receives the packet, it will check whether the data has been processed in the database, and according to the included statistical information, including the total packet arrival time of the packet owned by the connection, And the package that arrives in the package is updated, and if it has not been processed, the project is added to the database. 〆The traditional detection and analysis technology uses the known 埠port) method to judge ;t 'But many of the current ones exist on the web. Harmful substances, due to the use of dynamic port (dynamicport) technology, can not be used to identify. Wood is widely used in the "package content feature value" comparison method, because more and more P2P / IM software using packet encryption Technology cannot use this method to discriminate detection of its packet content. This causes loopholes in management. In addition, some malicious software uses the method of masquerading packet content to avoid the detection of content feature value comparison. The traditional method may result in this. The problem of misinterpretation or missed. But today's packet content detection method has the problem of infringing on personal privacy. The current transmission layer feature comparison method needs to collect enough transmission layer information to obtain the correct judgment ability. Defects, and the judgment time is too long to be applied to the gateway or firewall that needs to quickly determine the network traffic management policy. SUMMARY OF THE INVENTION In order to solve the above problems, one of the objects of the present invention is to detect An application that is encrypted or deliberately hides the protocol to provide network manager traffic processing Enough information. r Styling and: = In order to achieve the above objectives, the present invention is a cautious method comprising: fainting: Cheng == ΐ:ΐ;; in the grid; and 'if it does not exist in the related form EMBODIMENT OF THE INVENTION [Embodiment] FIG. 1 is a flowchart showing a process of performing a network traffic classification method according to an embodiment of the present invention, including a training process of the first stage and a second stage 200. During the training of the first stage 100, the recorded traffic is analyzed and classified according to the application's to obtain the representative thief of each category, which includes the step ιι〇Traffic Collection, the process begins After traffic collection, first collect the application traffic that you want to compare, and after obtaining enough packets (at least 4 (10) packets are needed), step 120 calculates the connection characteristics (c〇nnecti〇n Characterizing). The traffic is disassembled into multiple connections; step 130 calculates the application representative eigenvalues, and calculates the representative eigenvalues by using each connection as the processing unit, including the dominating value (Dominating Siz) e, DS), Dominating Size Proportion DSP, and Change Cycle (CC); and finally step 14: the application represents a set of eigenvalues to obtain a set of application representative eigenvalues, and The store 1355168 stores the eigenvalues represented by the 辞 program as the benchmark for the online mode comparison flow of the second stage 200 classification process. According to the actions of the above steps, in the traffic collection (Traffic Collection) in step 11, the application traffic collection technology is used, and the network traffic filter is used to perform the desired operation, the mosquito program and its Make (4) nickname, so that / the application package of the S can pass through the network interface, and at the network traffic entrance and exit end Lilin # training technology will record the required traffic under the green for analysis.
在步驟120叶算各連線特徵(c_ecti〇n Characterizing)中, 依據來源IP、來鱗號、目的Π>及目鱗號,將錄製到的流量分類, 拆解成多條連線。以各連線為處理單位,分別計算各連線的特徵值, 亦為向量值(vector),包含有支配值(DS)、支配值比例(Dsp), 及變動週期(CC)。其中支配值與支配值比例各是指連線中估有較 大比例之各個封包大小及相對應的佔有比例數,變動週期則是當某一 連線中所含的封包大小有劇烈變化時,用來作為辅助辨識的依淤 在步驟130計算應用程式代表特徵值中,有了各連線的特徵值 2再從處在姻錢(sessiGn)的各連線推出可絲的代表特In step 120, each line connection feature (c_ecti〇n Characterizing) classifies the recorded traffic into a plurality of connections according to the source IP, the scaly number, the destination Π>, and the squama number. The eigenvalues of each connection are calculated by using each connection as a processing unit, which is also a vector value (vector), including a dominant value (DS), a dominant value ratio (Dsp), and a variation period (CC). The ratio of the dominant value to the dominant value refers to the size of each packet estimated in the connection and the corresponding proportion of the proportion of the packet. The variation period is when the size of the packet contained in a connection changes drastically. In the step 130, the application representative representative characteristic value is used as the auxiliary identification, and the characteristic value 2 of each connection is extracted from the connection line of the sessiGn.
本實施例中是對各連線的特徵值平均計算,將計算所得的平均 值作為某一類應用程式的代表特徵值。 、㈣f著在第二階段之分類過程中,利用第—階段1⑻之訓練 進,各顧程式代表特徵值,作為與稱中真實流量比對的基 用兹二者二各代表特徵值之間的差距來推論練到的封包屬於哪種應 Hi,:步驟如5接入網路中真實流量;步_流量拆解, 篡成多個連線(connectlon),並依照第—階段之步驟120計 =連線舰,·步驟22G建蝴_格(PQrt A_iat腹遍e, i t p^關連表格(PAT)去搜尋是否已__在,·如 沒有,則進入步驟230封包辨識,先分別計算各連線之特徵值,再 (S ) 9 1355168 與第-陳轉mt麟之各顧程械鱗紐舰幾里得距離 (EucUdeanDistance)比較,選擇差距最小者之應用程式特徵值做為 該連線之歸屬應用程式;如果該連線已有:身訊存树關連表格(ρΑτ) 中’則可依據埠關連表格(PAT)中之記錄,直接判定該連線之歸屬 應用程式。最後,如絲_連_計―之賴健不存在於ρΑτ 中,且與第-階段1GG所得應用程式特徵值集合之差距也過大而益法 判定歸屬之«種_,可將該連_定為『未知細程式· 最後的步驟舦制程式,等待被辨識的封包就可以被判定工成』『已 知的某類應用程式』,或是『未知的應用程式』。In this embodiment, the eigenvalues of the respective connections are averaged, and the calculated average values are used as representative eigenvalues of a certain type of application. (4) In the classification process of the second stage, using the training of the first stage 1 (8), each program represents the eigenvalue, as the base value of the comparison with the real flow in the middle. The gap is to infer which packet should be Hi, the steps are as follows: 5 access to the real traffic in the network; step _ traffic disassembly, split into multiple connections (connectlon), and according to step 120 of the first stage = Connected ship, · Step 22G to build a butterfly _ grid (PQrt A_iat belly all over e, itp^ related table (PAT) to search whether __ is, if not, then go to step 230 packet identification, first calculate each company The characteristic value of the line, and then (S) 9 1355168 compares with the EucUdeanDistance of the first-five squad, and chooses the application characteristic value of the smallest gap as the connection. Ownership application; if the connection already exists: 'in the connection tree (ρΑτ)', you can directly determine the connection application of the connection according to the record in the related table (PAT). Finally, as the wire _ Even the _ meter--the reliance does not exist in ρΑτ, and the first-order The difference between the set of application feature values of 1GG is too large, and the method of determining the attribution of the class is _, which can be defined as the "unknown program" and the final step of the program. Waiting for the identified packet can be judged. Become a "known application" or "unknown application".
清參閱第2圖所示為本發明一實施例之網路流量分類方法在第 二階段200之分類過程示意圖,其步驟與第】圖之步驟相同。分類過 程包括:步驟205接入網路中真實的封包流量;步驟21〇將封包流量 拆解成多個連線’·步驟22G比對連線若已有資訊存在埠關連表格,則 私最後步驟240的應擁式欺為程式A或喊B,糾對連線沒 有貧訊存在槔關連表格,則進入步驟23〇的封包辨識以判定為程式 A、程式B或夫知栽或.。2 is a schematic diagram showing the classification process of the network traffic classification method in the second phase 200 according to an embodiment of the present invention, and the steps thereof are the same as those of the first diagram. The classification process includes: Step 205 accesses the real packet traffic in the network; Step 21: Disassemble the packet traffic into multiple connections'. Step 22G Compare the connection If the existing information exists in the connection table, then the private final step If the connection of the 240 is a program A or a call B, and there is no connection between the connection and the connection, the packet identification in step 23 is entered to determine that the program A, the program B or the husband or the plant.
根據上述各步驟的動作,在步驟21G流量拆解中,接入網路上 真實流量後’依據來源IP (ΜΡ) '來料號(SrcPGrt) '目的 (DstIP),及目鱗號(DstpQrt),將想要分析的流量分 多條連線。 魏 在步驟220槔關連表格中,先尋找該連線的<SrcIp,細p如〉、 DstIP DstPort〉是否出現在埠關連表格(pAT)中,璋關連表格⑽了) 帽存的是e經賴出的連線及所狀交談(sessiGn)資訊,·以 SrclP DstIP ’ SrcPort ’ DstP〇rt>來代表一條被辨認出的連線,依照 下列步驟操作: ” 1_記錄使用該Src]p與DstIP的主機(h〇st)有使用辨認出的應用程式。 2.將其SrcPort、DstPort記錄於埠關連表格(ρΑτ)中。 10 1355168 3-若有某條連線符合<SrcPort,SrcPort+l>或是<DstPort,DstPort+l> 的情況,則認定該連線亦屬於該交談(sessi〇n)。 一在步驟23〇封包辨識中,依照21〇處所拆解的各連線分別計 算其特徵值’再與第-階段之應用程式代表特徵值之集合得到的應用 程式代表特徵值作歐幾里得距離(Euclidean Distance)運算;若是連 線封包大小分佈與某個應用程式代表特徵值相似,則之間的歐幾里得 距離-騎比較接近,故可絲觸__個連線與哪種制程式最為類 似同時我們也會對辨識出的各連線作交談(北以〇11)關聯性分析,According to the actions of the above steps, in the traffic disassembly in step 21G, after accessing the real traffic on the network, 'based on the source IP (ΜΡ) 'material number (SrcPGrt) 'purpose (DstIP), and the target scale number (DstpQrt), Divide the traffic you want to analyze into multiple lines. In step 220 of the related table, Wei first looks for the <SrcIp of the connection, the fine p such as >, DstIP DstPort> whether it appears in the connection table (pAT), the related table (10)) The connection and the sessiGn information, and the SrclP DstIP 'SrcPort 'DstP〇rt> to represent an identified connection, follow the steps below: ” 1_ Record using the Src]p and The DstIP host (h〇st) has an application that recognizes it. 2. Record its SrcPort and DstPort in the 埠Connected table (ρΑτ). 10 1355168 3-If there is a connection that matches <SrcPort, SrcPort+ l> or <DstPort, DstPort+l>, it is determined that the connection also belongs to the conversation (sessi〇n). In step 23, packet identification, according to the connection of the 21〇 location Calculate the eigenvalue ' and then the application representative representative of the eigenvalues of the first stage to represent the eigenvalues as the Euclidean Distance operation; if the connection packet size distribution and an application representative feature Similar values, then between The distance-riding is relatively close, so it is most similar to which type of system is connected with the __ connection. At the same time, we will also talk about the identified connections (North to 〇11).
將屬於相同交談(SeSS1〇n)的各連線組合在—起,以期得到較全 的皆訊。 運作流程即可 以上所述疋針對某一應用程式比對的操作流程敘述,如果 種應用程式需要比對,本發明僅需要針對不同的應用程式多次操作本 π。上述’本發明為—在網路流量中分類所屬應用程式之 線封包大小分佈與結合相聯雜之方法來做為辨碎Combine the links that belong to the same conversation (SeSS1〇n) in order to get a more comprehensive message. The operation flow can be described above for the operation flow of an application comparison. If an application needs to be compared, the present invention only needs to operate the π multiple times for different applications. The above-mentioned invention is a method for classifying the size distribution of a line packet of an application to be associated with the combination in the network traffic.
=向量值)與已知的代表特徵點比對辨認,2=::= 内容辨剛題與_魏之__^===封包 可以用來做為線上閘道器使用之辨認機制。 k供—個 以上所述之實施例僅係為說 ‘:广的在使熟習此項技藝之人士能夠 並據以實施,當不能以之限定本發明之 内谷 本發明所揭示之精神所作之均等變 發明之專利範圍内。 ㈣必盍在本 11 < S ) 1355168 【圖式簡單說明】 第1圖所示為本發明一實施例之網路流量分類方法之執行步驟。 第2圖所示為本發明一實施例之網路流量分類方法之分類過程示意 圖。 【主要元件符號說明】 100 第一階段 200 第二階段 S110-S140 訓練過程之步驟 S205-S240 分類過程之步驟 12= vector value) is identified with known representative feature points, 2=::= Content Identification and _Wei __^=== Packets can be used as an identification mechanism for online gateways. The above-described embodiments are merely for the purpose of enabling the person skilled in the art to practice the present invention, and the equivalent of the invention disclosed herein is not limited thereto. Within the scope of the patent of the invention. (4) Must be in this paragraph 11 Slt 1355168 [Simple Description of the Drawings] FIG. 1 is a flowchart showing the execution steps of the network traffic classification method according to an embodiment of the present invention. FIG. 2 is a schematic diagram showing a classification process of a network traffic classification method according to an embodiment of the present invention. [Main component symbol description] 100 First stage 200 Second stage S110-S140 Steps of the training process S205-S240 Steps of the classification process 12