JP5856988B2

JP5856988B2 - Communication classification apparatus, method, and program

Info

Publication number: JP5856988B2
Application number: JP2013019824A
Authority: JP
Inventors: 晃弘下田; 石橋　圭介; 圭介石橋
Original assignee: Nippon Telegraph and Telephone Corp; NTT Inc
Current assignee: Nippon Telegraph and Telephone Corp; NTT Inc
Priority date: 2013-02-04
Filing date: 2013-02-04
Publication date: 2016-02-10
Anticipated expiration: 2033-02-04
Also published as: JP2014154888A

Description

本発明は、通信分類装置及び方法及びプログラムに係り、特にネットワーク上を流れる多種多様な通信から、ユーザの操作に直接的に起因する通信と、それに連動して機械的に送出される通信を分類するための通信分類装置及び方法及びプログラムに関する。 The present invention relates to a communication classification apparatus, method, and program, and in particular, classifies communication directly caused by a user's operation and communication mechanically transmitted in conjunction with the communication from various types of communication flowing on a network. The present invention relates to a communication classification apparatus, method, and program.

詳しくは、利用者が通信の分析において、同一ユーザから発せられる複数の通信を集約し、それらを分類することでユーザ単位の通信分析を容易にする技術に関わる。 Specifically, the present invention relates to a technology that facilitates a communication analysis for each user by aggregating a plurality of communications issued from the same user and classifying them in a communication analysis.

近年、多様化した通信トラフィックを効率よく解析するために、解析の用途に応じた様々な通信の分類手法が考案されている。 In recent years, in order to efficiently analyze diversified communication traffic, various communication classification methods according to analysis applications have been devised.

一般的な通信の集約の例として、同一の送信元と宛先アドレスを持つパケットを集約するフローや、TCPに代表される、プロトコルの状態遷移の開始から終了に至る双方向の通信ノードでやりとりされた一連パケットをセッションという形で集約する方式が存在する。これらの方式はプロトコルの性質に基づき、パケットをより有意義な通信単位に集約をする。それにより特定の解析処理を簡易化し、通信ログを保存する際のストレージ・サイズを削減することができる。一方でフローやセッションの生成個数は、通信ソフトウェアや通信先のネットワークやコンテンツの構造の影響で大きく変動する。したがって、ユーザの行動分析などより高次元の分析を試みる場合は、さらに通信を集約する必要が生じる。 Examples of general communication aggregation include flows that aggregate packets with the same source and destination addresses, and bi-directional communication nodes from the start to the end of protocol state transition, such as TCP. There is a method of aggregating a series of packets in the form of a session. These methods aggregate packets into more meaningful communication units based on the nature of the protocol. As a result, specific analysis processing can be simplified, and the storage size for saving the communication log can be reduced. On the other hand, the number of flows and sessions generated varies greatly due to the influence of the communication software, the network of the communication destination, and the content structure. Therefore, when attempting higher-dimensional analysis such as user behavior analysis, it is necessary to further aggregate communication.

非特許文献１の技術は、Webサーバの通信ログとユーザの行動パターンに着目して、より高次な通信の集約を試みている。具体的には、Webサーバのログに記録された時刻情報とユーザのIPアドレスやアクセス対象コンテンツの情報を元に、各ユーザの連続的なアクセスがどの程度継続するかの測定を実施する。そして、ユーザ毎のWebセッションの継続時間およびセッションの境界を検出し、ユーザのセッションに基づく通信の集約を試みている。 The technology of Non-Patent Document 1 attempts to concentrate higher-order communications by paying attention to the communication log of the Web server and the user's behavior pattern. Specifically, based on the time information recorded in the log of the Web server, the user's IP address, and the information on the content to be accessed, a measurement is made of how long each user continues to access. Then, the duration of the Web session for each user and the boundary of the session are detected, and communication aggregation based on the user session is attempted.

D He, A Goker, "Detecting session boundaries from web user logs", Proceedings of the BCS-IRSG 22nd annual colloquium on information retrieval research, pp.57--66, 2000.D He, A Goker, "Detecting session boundaries from web user logs", Proceedings of the BCS-IRSG 22nd annual colloquium on information retrieval research, pp.57--66, 2000.

しかしながら、上記の非特許文献１の技術には、以下の3点の指摘事項がある。 However, the technique of Non-Patent Document 1 has the following three points.

１）この先行研究において、時間間隔を用いてセッションを切り分ける目的は、セッションの継続時間を計測することである。本発明が提案する通信クラスタは、複数の通信を通信クラスタとして集約することで、通信同士の関係性を可視化することが目的であり、通信の継続時間を得ることが目的ではない。単純に時間間隔を用いて通信を区別するという点において、本発明と先行研究では目的が異なる。 1) In this previous study, the purpose of separating sessions using time intervals is to measure the duration of the session. The purpose of the communication cluster proposed by the present invention is to visualize the relationship between communications by aggregating a plurality of communications as a communications cluster, and not to obtain the duration of communications. The purpose of the present invention is different from that of previous research in that communication is simply distinguished using time intervals.

２）この先行研究では、ユーザごとの各セッションを区別する基準として、隣接する通信ログの時間間隔のみを用いている。「各セッションがどの通信をきっかけに生じたのか」という判断に基づく通信の切り分けは行っていない。この先行研究の手法では、セッションのきっかけとなる通信が識別できないため、ユーザが実際に操作したことによる通信と、それに伴い発生したソフトウェアによる機械的起因の通信を区別することができない。 2) In this previous study, only the time interval between adjacent communication logs was used as a criterion for distinguishing each session for each user. The communication is not divided based on the determination of “which communication is triggered by each session”. In this prior research method, communication that triggers a session cannot be identified, and thus communication that is actually performed by a user cannot be distinguished from communication that is caused by software caused by the operation.

３）この先行研究はWebサーバのログを用いることから、通信の測定箇所はネットワークの末端に設置されたサーバでなければならない。そのため、ネットワークの中間ノードでの測定には応用できず、また測定対象のプロトコルはHTTP(Hyper Text Transfer Protocol)/HTTPS(Hyper Transfer Protocol over Secure Socket Layer)に限定されるため、対象プロトコルの汎用性に乏しい。 3) Since this previous research uses Web server logs, the communication measurement point must be a server installed at the end of the network. Therefore, it cannot be applied to measurements at intermediate nodes in the network, and the protocol to be measured is limited to HTTP (Hyper Text Transfer Protocol) / HTTPS (Hyper Transfer Protocol over Secure Socket Layer). It is scarce.

本発明は、上記の点に鑑みなされたもので、ネットワーク上の任意の測定地点で採取された通信ログを元に、特定の通信プロトコルに限定することなく、ユーザの行動による通信と機械的起因の通信を区別し、複数の通信を集約することで、ユーザの行動分析に適した通信の分類結果を得ることが可能な通信分類装置及び方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the above points. Based on a communication log collected at an arbitrary measurement point on a network, communication based on user actions and mechanical causes are not limited to a specific communication protocol. It is an object of the present invention to provide a communication classification apparatus, method, and program capable of obtaining a communication classification result suitable for user behavior analysis by discriminating between the above-described communications and aggregating a plurality of communications.

上記の課題を解決するため、本発明（請求項１）は、ネットワーク上を流れる多種多様な通信から、ユーザの操作に直接的に起因する通信（以下、「代表的通信」と記す）と、それに連動して機械的に送信される通信（以下、「従属的通信」と記す）とに分類するための通信分類装置であって、
ネットワーク上を流れる通信をキャプチャすることにより得られた通信データ、または、収集された通信ログが入力されると、入力順に通信データ記憶手段に格納する入力手段と、
前記通信データ記憶手段から連続する二つの通信データX_n，X_n+1を取得して、該通信データX_n，X_n+1の時間間隔が所定の閾値Tc以上場合は、該二つの通信データはそれぞれ別の通信クラスタであるとし、該通信データX_n+1を代表的通信とし、該閾値Tcより小さい場合は、同一通信クラスタであり、該X_n+1は従属的通信と判定する時間間隔比較手段と、
前記代表的通信となった前記通信データX_n+1の次の通信データX_n+2を前記通信データ記憶手段から取得して、該通信データX_n+2と該通信データX_n+1の差が所定の代表的通信識別閾値Tfより小さい場合は、該通信データX_n+1を従属的通信とする代表的通信識別手段と、
前記時間間隔比較手段及び前記代表的通信識別手段の分類結果を、該通信データを一意に示す通信識別子と共に分類結果記憶手段に格納する結果格納手段と、を有する。 In order to solve the above-described problem, the present invention (Claim 1) is directed to communication (hereinafter referred to as “representative communication”) directly resulting from a user's operation from various types of communication flowing on the network. A communication classification device for classifying communication that is mechanically transmitted in conjunction with the communication (hereinafter referred to as “subordinate communication”),
When the communication data obtained by capturing the communication flowing on the network or the collected communication log is input, the input means stores in the communication data storage means in the input order;
Two continuous communication data X _n and X _{n + 1} are acquired from the communication data storage means, and when the time interval between the communication data X _n and X _{n + 1} is _{equal to} or greater than a predetermined threshold Tc, the two communication data The data are different communication clusters, and the communication data X _{n + 1} is a representative communication. If the data is smaller than the threshold value Tc, it is the same communication cluster, and the X _{n + 1} is determined as a dependent communication. Time interval comparison means;
The communication data X _{n + 2} next to the communication data X _{n + 1} that has become the representative communication is acquired from the communication data storage means, and the communication data X _{n + 2} and the communication data X _{n + 1} If the difference is smaller than a predetermined representative communication identification threshold Tf, representative communication identification means that makes the communication data X _{n + 1} a dependent communication,
A result storage unit that stores the classification results of the time interval comparison unit and the representative communication identification unit together with a communication identifier that uniquely indicates the communication data in a classification result storage unit.

また、本発明（請求項２）は、前記分類結果記憶手段の通信クラスタに代表的通信が欠落している不完全クラスタがある場合、該分類結果記憶手段の他の通信クラスタの従属的通信群の類似度が所定の閾値を超え、かつ、最も類似度が高い完全な通信クラスタの代表的通信を、該不完全クラスタの代表的通信とする通信分類補完手段を更に有する。 In the present invention (Claim 2), when there is an incomplete cluster in which representative communication is missing in the communication cluster of the classification result storage means, a dependent communication group of another communication cluster of the classification result storage means The communication class supplement means further includes a representative communication of a complete communication cluster having the highest similarity as the representative communication of the incomplete cluster.

また、本発明（請求項３）は、前記通信データ記憶手段において、
ユーザ毎の通信データが一定時間以上観測されない場合には、削除する手段を含む。 Further, according to the present invention (Claim 3), in the communication data storage means,
If communication data for each user is not observed for a certain time or longer, a means for deleting is included.

上述のように本発明によれば、ネットワーク上の任意の測定地点で採取された通信ログを元に、特定の通信プロトコルに限定することなく、ユーザの行動による通信と機械的起因の通信を区別し、複数の通信を集約することで、ユーザの行動分析に適した通信の分類結果を得る技術に適用することが可能になる。 As described above, according to the present invention, communication based on a user's action is distinguished from communication caused by a mechanical action based on a communication log collected at an arbitrary measurement point on a network without being limited to a specific communication protocol. In addition, by collecting a plurality of communications, it is possible to apply the technique to a communication classification result suitable for user behavior analysis.

本発明の一実施の形態における通信解析方法の概要１を示す図である。It is a figure which shows the outline | summary 1 of the communication analysis method in one embodiment of this invention. 本発明の一実施の形態における通信解析方法の概要２を示す図である。It is a figure which shows the outline | summary 2 of the communication analysis method in one embodiment of this invention. 本発明の一実施の形態における通信解析方法の概要３を示す図である。It is a figure which shows the outline | summary 3 of the communication analysis method in one embodiment of this invention. 本発明の一実施の形態における通信分類装置の構成例である。It is an example of composition of a communication classification device in one embodiment of the present invention. 本発明の一実施の形態における通信分類部の動作のフローチャートである。It is a flowchart of operation | movement of the communication classification | category part in one embodiment of this invention. 本発明の一実施の形態における通信分類部及び通信分類補完部の出力データ例である。It is an example of output data of the communication classification part and communication classification complementation part in one embodiment of this invention.

以下、図面と共に本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

まず、本発明の概要を説明する。 First, the outline of the present invention will be described.

本発明が適用可能な通信は、ある特定のユーザから送出される通信に着目した際に、特定の通信をきっかけとして、それに続く形で、通信が連続的かつ集中的に発生していることが認められる通信である。例えば、TCP(Transmission Control Protocol)通信におけるコネクション要求やDNS(Delivery Status Notification)クエリ要求パケット、HTTPのセッション等が該当する。 In the communication to which the present invention is applicable, when attention is paid to communication transmitted from a specific user, the communication is continuously and intensively generated in the form following the specific communication. This is a recognized communication. For example, connection requests in TCP (Transmission Control Protocol) communication, DNS (Delivery Status Notification) query request packets, HTTP sessions, and the like are applicable.

本発明は、あるユーザが特定の目的を達成するために発生させる一連の通信の中で、最初のきっかけとなる通信と、その通信に対して従属的に生じる通信を識別することである。 The present invention is to identify a first trigger communication and a communication that occurs dependently on the first communication among a series of communication generated by a user to achieve a specific purpose.

対象とする通信がステートフルなプロトコルである場合、各セッションを行使する一連のパケットにおける、最初の通信開始要求パケットを抽出して、本発明のシステムへの入力とする。例えば、TCP通信の場合はsynフラグのパケット、HTTP通信の場合は、GET要求パケットである。 When the target communication is a stateful protocol, the first communication start request packet in a series of packets that exercise each session is extracted and used as an input to the system of the present invention. For example, the packet is a syn flag in the case of TCP communication, and is a GET request packet in the case of HTTP communication.

対象とする通信がステートレスなプロトコルである場合、ユーザがサーバに要求するパケットそれぞれを、本発明のシステムへの入力とする。例えば、DNS通信の場合、DNSクエリ要求パケットとなる。 When the target communication is a stateless protocol, each packet that the user requests from the server is input to the system of the present invention. For example, in the case of DNS communication, it is a DNS query request packet.

次に、本発明の通信の分類方法について説明する。 Next, the communication classification method of the present invention will be described.

図１は、本発明の一実施の形態における通信解析方法の概要を示す。 FIG. 1 shows an outline of a communication analysis method according to an embodiment of the present invention.

同一のユーザによる同一の目的を持った連続的な一連の通信において、最初のきっかけとなる通信を「代表的通信」、それに続く従属的な通信は「従属的通信」と表記する。 In a series of continuous communications with the same purpose by the same user, the first trigger communication is referred to as “typical communication”, and the subsequent dependent communication is referred to as “subordinate communication”.

同図の通信クラスタとは、１つの代表的通信と、複数の従属的通信で構成される。本発明は通信ログから通信クラスタを正確に分類することが最終目的となる。 The communication cluster in the figure is composed of one representative communication and a plurality of subordinate communications. The final purpose of the present invention is to accurately classify communication clusters from communication logs.

図１におけるＡ１からＡ３までの通信に着目した場合、Ａ１の通信が行われた後で、連続的に、Ａ２，Ａ３の通信が発生しているため、Ａ１が代表的通信、Ａ２，Ａ３が従属的通信となる。Ｃ１からＣ３の通信についても同様である。 When attention is paid to communication from A1 to A3 in FIG. 1, since communication of A2 and A3 occurs continuously after communication of A1, A1 is representative communication, and A2 and A3 are Dependent communication. The same applies to communications from C1 to C3.

通信Ｂ１は、前後の通信と連続しておらず、かつ前後と通信と十分な時間が空いているため、独立した通信クラスタ（通信クラスタＢ）に１つの代表的通信が属している形となる。ここでの「十分な時間」は通信クラスタを分類するための重要なパラメータとなる。これについては、図２にて後述する。 The communication B1 is not continuous with the preceding and succeeding communications and has sufficient time for the preceding and succeeding communications, so that one representative communication belongs to an independent communication cluster (communication cluster B). . The “sufficient time” here is an important parameter for classifying communication clusters. This will be described later with reference to FIG.

Ｄ１からＤ４のように、全ての通信を従属的通信とみなす場合が存在する。その場合の当該クラスタは不完全な通信クラスタとする。このようなケースの詳細は図の説明で述べる。また、不完全クラスタを用いた応用技術については後述する。 There are cases where all communications are regarded as dependent communications, such as D1 to D4. In this case, the cluster is an incomplete communication cluster. Details of such a case will be described in the description of the figure. An application technique using an incomplete cluster will be described later.

［１］通信分類方法：
以下に、通信分類方法について説明する。 [1] Communication classification method:
The communication classification method will be described below.

図２は、本発明の一実施の形態における通信解析方法の概要２を示す。同図において、太線矢印は代表的通信を示し、破線矢印は従属的通信を示す。 FIG. 2 shows an outline 2 of the communication analysis method according to the embodiment of the present invention. In the figure, a thick line arrow shows typical communication, and a broken line arrow shows dependent communication.

図２では、X１からX１０までの通信に対して、通信間の間隔をそれぞれt1からt9まで示している。各通信が属する通信クラスタはまだ不明な状態であると仮定する。 In FIG. 2, for communication from X1 to X10, intervals between the communication are shown from t1 to t9, respectively. Assume that the communication cluster to which each communication belongs is still unknown.

（１）通信クラスタへの分類：
通信の通信クラスタの分類にあたり、通信クラスタ間閾値Tcをパラメータとしてそれぞれ設定する。Tcは２つの連続する通信において、それぞれの通信クラスタが異なると見做す最小時間間隔を表す。 (1) Classification into communication clusters:
When classifying communication clusters for communication, a threshold value Tc between communication clusters is set as a parameter. Tc represents the minimum time interval in which each communication cluster is considered to be different in two consecutive communications.

また、任意の連続する２通信の時間間隔がTcより大きい場合は、その２通信はそれぞれ別の通信クラスタであると見做す。例えば、図２において、t1≧Tcとなる場合は、X１とX２は別の通信と見做す。一方で、t2＜Tcとなる場合は、X２とX３は同一の通信クラスタに属すると見做す。 When the time interval between any two consecutive communications is larger than Tc, the two communications are considered to be different communication clusters. For example, in FIG. 2, when t1 ≧ Tc, X1 and X2 are regarded as different communications. On the other hand, when t2 <Tc, it is assumed that X2 and X3 belong to the same communication cluster.

上記の方法では、あるユーザが同時並行で複数の通信を行っていた場合、１つの通信クラスタに当該ユーザの複数の異なる種類の通信が紛れ込み、通信クラスタを正しく判別できない場合がある。上記手法では、通信間の時間差情報を用いて通信の大雑把な分類結果を得ることであり、通信分類の確実な精度を保証するものではない。従って誤った通信クラスタの分類判定は、本発明の範囲ではなく、本発明を用いた上位の分析システムで考慮するものとする。 In the above method, when a certain user performs a plurality of communications in parallel at the same time, a plurality of different types of communications of the user may be mixed into one communication cluster, and the communication cluster may not be correctly identified. In the above method, a rough classification result of communication is obtained by using time difference information between communications, and reliable accuracy of communication classification is not guaranteed. Therefore, erroneous communication cluster classification determination is not considered in the scope of the present invention, but is considered in a higher-level analysis system using the present invention.

（２）代表的通信の検出：
通信クラスタの分類にあたり、代表的通信識別閾値Tfをパラメータとしてそれぞれ設定する。Tfは時間間隔であり、代表的通信が存在しない通信クラスタを検出するために用いる。但し、Tf＜Tcである。 (2) Detection of representative communications:
When classifying the communication cluster, a representative communication identification threshold value Tf is set as a parameter. Tf is a time interval and is used to detect a communication cluster in which no representative communication exists. However, Tf <Tc.

原則として、上記の（１）通信クラスタへの分類の手法で得られた各通信クラスタの先頭の通信を代表的通信と見做す。例えば、図２の通信クラスタＢにおいて、X２は最初に出現した通信なので、代表的通信と見做す。X３以降は従属的通信となる。代表的通信が存在しないと見做す例外については次項で述べる。 In principle, the first communication of each communication cluster obtained by the above-described method of classifying into communication clusters (1) is regarded as representative communication. For example, in the communication cluster B of FIG. 2, X2 is communication that appears first, and therefore is considered as representative communication. After X3, it becomes dependent communication. Exceptions that assume that there is no representative communication are discussed in the next section.

代表的通信と従属的通信の時間間隔が閾値Tfより小さい場合は、代表的通信が欠落していると見做し、代表的通信を従属的通信に変更する。加えて、当該通信クラスタを不完全クラスタとする。例えば、図２の通信クラスタＣにおいて、t7＜Tfである場合は、X７は従属的通信と見做す。この理由については、図３にて後述する。 When the time interval between the representative communication and the dependent communication is smaller than the threshold value Tf, it is assumed that the representative communication is missing, and the representative communication is changed to the dependent communication. In addition, the communication cluster is set as an incomplete cluster. For example, in the communication cluster C of FIG. 2, when t7 <Tf, X7 is regarded as dependent communication. The reason for this will be described later with reference to FIG.

（３）パラメータの選択方法：
閾値Tc及びTfは、測定対象とするネットワーク環境やプロトコルに応じて、通信クラスタまたは代表的通信が適切に分類される最適値を、事前の試行により決定する。Tc及びTfは必ずしも固定値である必要はない。一定の学習期間を設けて最適なTcまたはTfを学習して動的に決定することも可能である。 (3) Parameter selection method:
The threshold values Tc and Tf are determined by prior trials in accordance with the network environment and protocol to be measured, and optimal values for appropriately classifying the communication cluster or the representative communication. Tc and Tf are not necessarily fixed values. It is also possible to dynamically determine the optimum Tc or Tf by providing a certain learning period.

［２］代表的通信の補完方法：
次に、上記の技術の拡張として、代表的通信が観測されなかった場合の補完手法について説明する。 [2] Supplementary methods for typical communications:
Next, as an extension of the above technique, a supplementary method when representative communication is not observed will be described.

図３は、本発明の一実施の形態における通信解析方法の概要３を示す。同図において、黒丸は代表的通信を示し、白丸は従属的通信を示す。 FIG. 3 shows an outline 3 of the communication analysis method according to the embodiment of the present invention. In the figure, black circles represent representative communications, and white circles represent dependent communications.

通信には、図１及び図２の手法では代表的通信を選べないケースが存在する。そのケースとは、本来発信されるべき通信が発信者側でキャッシュされているために発信されないケースである。具体的には、例えば、DNS通信において問い合わせるレコードが発信者側のローカルキャッシュに既に存在した場合は、DNS問い合わせが行われないようなケースである。 In communication, there are cases in which representative communication cannot be selected by the method of FIGS. The case is a case where the communication that should be originally transmitted is not transmitted because the caller has cached the communication. Specifically, for example, when a record to be inquired in DNS communication already exists in the local cache on the sender side, a DNS inquiry is not performed.

代表的通信が失われることで生じるデメリットは、通信クラスタを発生させるきっかけとなった特徴的な通信が欠落することで、該通信クラスタの属性を判断することが困難になることである。 The demerit caused by the loss of the representative communication is that it is difficult to determine the attribute of the communication cluster due to the lack of characteristic communication that triggered the generation of the communication cluster.

図３に従って代表的通信を補完する手順を説明する。図３において、上半分の通信クラスタは代表クラスタが欠落した不完全通信クラスタ、同図下半分の３つの通信クラスタは、代表的クラスタが検出できた完全な通信クラスタである。 A procedure for complementing representative communication will be described with reference to FIG. In FIG. 3, the upper half communication cluster is an incomplete communication cluster in which the representative cluster is missing, and the lower half three communication clusters are complete communication clusters in which the representative cluster has been detected.

１）代表的通信が欠落している場合、当該通信クラスタの別の従属的通信群と、他の通信クラスタの従属的通信群の類似度を判定する。ここで、代表的通信が欠落しているか否かについては、代表的通信を従属的通信に変更する操作時に代表的通信の有無を記録しており、当該情報を参照することにより判定できる。 1) When the representative communication is missing, the similarity between another dependent communication group of the communication cluster and a dependent communication group of another communication cluster is determined. Here, whether or not the representative communication is missing can be determined by recording the presence or absence of the representative communication at the time of the operation of changing the representative communication to the dependent communication and referring to the information.

２）その結果、類似度が一定の閾値を超え、かつ最も類似度が高い完全な通信クラスタの代表的通信を、不完全通信クラスタの仮の代表的通信と見做す。 2) As a result, representative communication of a complete communication cluster having a similarity exceeding a certain threshold and having the highest similarity is regarded as provisional representative communication of an incomplete communication cluster.

以下に、本発明の通信分類装置について説明する。 The communication classification apparatus of the present invention will be described below.

図４は、本発明の一実施の形態における通信分類装置の構成例を示す。 FIG. 4 shows an example of the configuration of the communication classification apparatus according to an embodiment of the present invention.

同図に示す通信分類装置１０は、通信分類部１１、メモリ１２Ａ、１２Ｂ、通信ログデータベース１３、分類結果データベース１４、通信分類補完部１５を有する。 The communication classification device 10 shown in FIG. 1 includes a communication classification unit 11, memories 12 </ b> A and 12 </ b> B, a communication log database 13, a classification result database 14, and a communication classification complement unit 15.

通信分類装置１０へのデータの入力方法として、ネットワーク上を流れる通信を通信キャプチャ装置１からリアルタイムに入力する方法、または、既に採取された通信もしくは通信ログを入力データとする方法がある。 As a method of inputting data to the communication classification device 10, there are a method of inputting communication flowing on the network in real time from the communication capture device 1, or a method of using already collected communication or communication log as input data.

メモリ１２Ａには、一時的に、入力されたユーザ毎の通信データ（例えば、送信元アドレス）を格納し、メモリ１２Ｂには、予め、通信クラスタ間閾値Tcと代表的通信識別閾値Tfのパラメータ、及び、通信分類部１１の処理による分類結果を格納する。 The memory 12A temporarily stores input communication data (for example, transmission source address) for each user, and the memory 12B previously stores parameters of the communication cluster threshold Tc and the representative communication identification threshold Tf, And the classification result by the process of the communication classification part 11 is stored.

通信分類部１１では、前述した方法により、入力データから通信を通信クラスタに分類し、その分類結果を通信識別子と共に分類結果データベース１４に格納する。その際、ユーザの送信元アドレス(IPアドレス等)はメモリ１２Ａに一定期間保持しておき、ユーザ毎の通信の分類を実現する。なお、ここで、「通信識別子」とは、対象とする通信のプロトコルにより異なり、通信を一意に識別できる情報のタプルである。また、メモリ１２上に保存するユーザ毎の通信は、当該ユーザの通信が一定時間観測されない場合に、メモリ１２上から削除するものとする。この時間は対象となるネットワークの通信量やシステムの性能も考慮して決定する。 The communication classification unit 11 classifies the communication into communication clusters from the input data by the method described above, and stores the classification result in the classification result database 14 together with the communication identifier. At that time, the transmission source address (IP address, etc.) of the user is held in the memory 12A for a certain period, and the classification of communication for each user is realized. Here, the “communication identifier” is a tuple of information that can be uniquely identified, depending on a target communication protocol. The communication for each user stored on the memory 12 is deleted from the memory 12 when the communication of the user is not observed for a certain period of time. This time is determined in consideration of the communication volume of the target network and the performance of the system.

図５は、本発明の一実施の形態における通信分類部の動作のフローチャートである。 FIG. 5 is a flowchart of the operation of the communication classification unit according to the embodiment of the present invention.

通信キャプチャ装置１から通信キャプチャデータ、または、外部の装置から通信ログが入力され、メモリ１２Ａに格納されると（ステップ１０１）、通信分類部１１は、メモリ１２Ａに通信データがある場合は（ステップ１０２、No）は、当該通信データを通信ログデータベース１３に格納すると共に、当該メモリ１２Ａから２つの連続する通信データX_n、X_n+1を取り出す(ステップ１０３)。なお、通信ログデータベース１３に格納する際に、少なくとも当該通信を一意に識別することが可能な通信識別子及び時刻情報を付与するものとする。２つの通信データの時間間隔（X_n+1−X_n）が、メモリ１２Ｂに格納されている通信クラスタ間閾値Tcより小さければ（ステップ１０４、No）、X_nとX_n+1は同一の通信クラスタと判定し、X_n+1を従属的通信とし（ステップ１０５）、当該分類結果をメモリ１２Ｂに格納する（ステップ１１０）。 When communication capture data is input from the communication capture device 1 or a communication log is input from an external device and stored in the memory 12A (step 101), the communication classification unit 11 determines that there is communication data in the memory 12A (step 101). 102, No) stores the communication data in the communication log database 13, and takes out two consecutive communication data _Xn , _{Xn + 1} from the memory 12A (step 103). In addition, when storing in the communication log database 13, the communication identifier and time information which can identify the said communication uniquely at least shall be provided. If the time interval (X _{n + 1} -X _n ) between the two communication data is smaller than the communication cluster threshold Tc stored in the memory 12B (step 104, No), X _n and X _{n + 1} are the same. The communication cluster is determined, _{Xn + 1} is set as a dependent communication (step 105), and the classification result is stored in the memory 12B (step 110).

一方、２つの通信データの時間間隔（X_n+1−X_n）が、メモリ１２Ｂに格納されている通信クラスタ間閾値Tc以上であれば（ステップ１０４、Yes）、X_nとX_n+1は別の通信クラスタと判定し、X_n+1を代表的通信と判定する（ステップ１０６）。次に、メモリ１２ＡからX_n+1の次のデータX_n+2を取り出し（ステップ１０７）、X_n+2−X_n+1の値が代表的通信識別閾値Tfより小さければ（ステップ１０８、Yes）、ステップ１０６で代表的通信と判定されているX_n+1を「従属的通信」に変更し、メモリ１２Ｂの分類結果を更新する（ステップ１１０）。一方、X_n+2−X_n+1の値が代表的通信識別閾値Tf以上であれば（ステップ１０８、No）、ステップ１０６で判断された「代表的通信」をメモリ１２Ｂに格納する（ステップ１１０）。ｎの値を１インクリメントし、ステップ１０２に移行する（ステップ１１１）。メモリ１２Ａに読み込むべき通信キャプチャデータまたはログデータがない場合には（ステップ１０２、Yes）、メモリ１２Ｂに格納されている分類結果を分類結果データベース１４に格納し（ステップ１１２）、処理を終了する。 On the other hand, if the time interval (X _{n + 1} −X _n ) between the two communication data is greater than or equal to the communication cluster threshold Tc stored in the memory 12B (step 104, Yes), X _n and X _{n + 1} Is determined to be another communication cluster, and X _{n + 1} is determined to be representative communication (step 106). Then, the next data X _{n + 2} of X _{n + 1} from the memory 12A is taken out (step 107), X _{n + 2} -X when _{n + 1} value is less than the typical communication identification threshold Tf (step 108, Yes), X _{n + 1} determined as representative communication in step 106 is changed to “subordinate communication”, and the classification result in the memory 12B is updated (step 110). On the other hand, _if the value of X _{n + 2} −X _{n + 1} is greater than or equal to the representative communication identification threshold value Tf (step 108, No), the “typical communication” determined in step 106 is stored in the memory 12B (step 110). The value of n is incremented by 1, and the process proceeds to step 102 (step 111). If there is no communication capture data or log data to be read into the memory 12A (step 102, Yes), the classification result stored in the memory 12B is stored in the classification result database 14 (step 112), and the process is terminated.

通信分類補完部１５は、分類結果データベース１４を参照して代表的通信が存在しないレコードを抽出し、当該不完全通信クラスタに対して、当該通信の送信ユーザの別の通信クラスタ、ないしは、別のユーザの通信クラスタから、欠落した代表的通信を補完する。 The communication classification complementing unit 15 refers to the classification result database 14 and extracts a record in which no representative communication exists, and with respect to the incomplete communication cluster, another communication cluster of the transmission user of the communication, or another Complement missing typical communications from the user's communications cluster.

欠落した代表的通信の補完手順を図６の例を用いて示す。図６は通信分類部の出力結果(A)と、通信補完部の出力(B)のデータ例をそれぞれ示している。通信分類部(A)の出力において、通信クラスタAは通信X0, X1, … Xnで構成されており、通信クラスタB, Cも同様である。ただし、通信クラスタBに関しては、「[1]（２）代表的通信の検出」において前述した手順により、代表的通信Y0が欠落していると判断された状態であると仮定する。ここで、通信補完部は、代表的通信が欠落している通信クラスタBの各通信を他の通信クラスタの通信と比較する。 The procedure for complementing the missing representative communication is shown using the example of FIG. FIG. 6 shows data examples of the output result (A) of the communication classification unit and the output (B) of the communication complementing unit. In the output of the communication classification unit (A), the communication cluster A is composed of communication X0, X1,... Xn, and the communication clusters B and C are the same. However, it is assumed that the communication cluster B is in a state where it is determined that the representative communication Y0 is missing by the procedure described above in “[1] (2) Detection of representative communication”. Here, the communication complementing unit compares each communication of the communication cluster B lacking the representative communication with the communication of another communication cluster.

次に通信クラスタ間の比較方法について述べる。通信の比較に用いる指標として、通信先のサーバが同一でユーザが異なる場合でも共通して出現するパラメータを選択する。ここで比較に用いる指標をPn (n: 各指標の番号)とおく。指標の例としては、各通信の送信先アドレスや通信バイト数、または通信クラスタの代表的通信を除くすべての通信の合計通信データ量、合計通信個数である。これらの指標を P1〜Pnとしたとき、各指標に対応する重み(W1〜Wn)を定義する。通信クラスタ間の類似度をSと定める時、類似度は各指標の比較結果に重みを掛け合わせた値の合計値として表現される。通信クラスタaと通信クラスタbの類似度Sabは以下のように導出される。 Next, a comparison method between communication clusters will be described. As an index used for communication comparison, a parameter that appears in common even when the communication destination server is the same and the user is different is selected. Here, an index used for comparison is Pn (n: number of each index). Examples of the index are the transmission destination address and the number of communication bytes of each communication, or the total communication data amount and the total number of communication of all the communication except the representative communication of the communication cluster. When these indices are P1 to Pn, the weight (W1 to Wn) corresponding to each index is defined. When the similarity between communication clusters is defined as S, the similarity is expressed as a total value of values obtained by multiplying the comparison result of each index by a weight. The similarity Sab between the communication cluster a and the communication cluster b is derived as follows.

ここで、指標の数はNであり、Pkaは通信クラスタaにおける指標kの値を表す。またEqは与えられた2つの指標が一致していた場合は１，そうでなければ0とする関数であり、Wは指標kに対応する重みである。
通信クラスタBと、他の通信クラスタとのそれぞれの類似度Sを求めた結果、Sが一定閾値(任意に定める)以上であり、かつ複数の通信クラスタが該当した場合はSが最も大きい通信クラスタを、補完元通信クラスタとして選択する。いずれのSも閾値を超えなかった場合は、類似した代表的通信が存在しないと見なす。図6において通信クラスタCが補完元通信クラスタとして選択されたと仮定した場合、通信クラスタCの代表的通信Z0を通信クラスタBの代表的通信Y0として補完する。

Here, the number of indices is N, and Pka represents the value of the index k in the communication cluster a. Eq is a function that is 1 if the two given indices match, and 0 otherwise, and W is a weight corresponding to the index k.
As a result of obtaining the similarity S between communication cluster B and another communication cluster, if S is equal to or greater than a certain threshold (arbitrarily determined) and multiple communication clusters are applicable, the communication cluster with the largest S Is selected as a complement source communication cluster. If neither S exceeds the threshold, it is considered that there is no similar representative communication. If it is assumed that the communication cluster C is selected as the complement source communication cluster in FIG. 6, the representative communication Z0 of the communication cluster C is supplemented as the representative communication Y0 of the communication cluster B.

上記の通信分類補完部１５において、欠落した代表的通信が、欠落した通信とは直接的に関係ない通信で誤って補完される可能性があるが、これに対しては、上述の類似度が一定に満たない場合は、不完全通信クラスタをその後の解析処理（本発明の範囲外）では破棄し、他の通信クラスタに影響しない方法で対応するものとする。 In the communication classification complementing unit 15 described above, there is a possibility that the missing representative communication is erroneously complemented by a communication that is not directly related to the missing communication. If it is less than a certain value, the incomplete communication cluster is discarded in the subsequent analysis process (outside the scope of the present invention), and the problem is handled by a method that does not affect other communication clusters.

なお、上記の図４に示す通信分類装置の構成要素の動作をプログラムとして構築し、通信分類装置として利用されるコンピュータにインストールして実行させる、または、ネットワークを介して流通させることが可能である。 The operation of the components of the communication classification apparatus shown in FIG. 4 can be constructed as a program and installed in a computer used as the communication classification apparatus for execution or distributed through a network. .

また、本発明は上記の実施の形態に限定されることなく、特許請求の範囲内において、種々変更・応用が可能である。 The present invention is not limited to the above-described embodiments, and various modifications and applications can be made within the scope of the claims.

１通信キャプチャ装置
１０通信分類装置
１１通信分類部
１２Ａ，１２Ｂメモリ
１３通信ログデータベース
１４分類結果データベース
１５通信分類補完部 DESCRIPTION OF SYMBOLS 1 Communication capture device 10 Communication classification device 11 Communication classification part 12A, 12B Memory 13 Communication log database 14 Classification result database 15 Communication classification complement part

Claims

From a wide variety of communications that flow on the network, communications that are directly attributable to user operations (hereinafter referred to as “typical communications”), and communications that are mechanically transmitted (hereinafter referred to as “subordinate”). A communication classifying apparatus for classifying the data into “communication”),
When the communication data obtained by capturing the communication flowing on the network or the collected communication log is input, the input means stores in the communication data storage means in the input order;
Two continuous communication data X _n and X _{n + 1} are acquired from the communication data storage means, and when the time interval between the communication data X _n and X _{n + 1} is _{equal to} or greater than a predetermined threshold Tc, the two communication data The data are different communication clusters, and the communication data X _{n + 1} is a representative communication. If the data is smaller than the threshold value Tc, it is the same communication cluster, and the X _{n + 1} is determined as a dependent communication. Time interval comparison means;
The communication data X _{n + 2} next to the communication data X _{n + 1} that has become the representative communication is acquired from the communication data storage means, and the communication data X _{n + 2} and the communication data X _{n + 1} If the difference is smaller than a predetermined representative communication identification threshold Tf, representative communication identification means that makes the communication data X _{n + 1} a dependent communication,
A result storage means for storing the classification results of the time interval comparison means and the representative communication identification means together with a communication identifier uniquely indicating the communication data in a classification result storage means;
A communication classification apparatus comprising:

If there is an incomplete cluster in which representative communication is missing in the communication cluster of the classification result storage means, the similarity of the subordinate communication groups of other communication clusters of the classification result storage means exceeds a predetermined threshold, and The communication classification apparatus according to claim 1, further comprising: a communication classification complementing unit that sets the representative communication of the complete communication cluster having the highest similarity as the representative communication of the incomplete cluster.

The communication data storage means includes
The communication classification device according to claim 1, further comprising means for deleting when communication data for each user is not observed for a certain period of time.

From a wide variety of communications that flow on the network, communications that are directly attributable to user operations (hereinafter referred to as “typical communications”), and communications that are mechanically transmitted (hereinafter referred to as “subordinate”). A communication classification method for classifying a communication
In an apparatus having input means, time interval comparison means, representative communication identification means, and result storage means,
An input step in which the input means stores the communication data obtained by capturing the communication flowing on the network, or when the collected communication log is input, and stores it in the communication data storage means in the order of input;
The time interval comparing means, two communication data X _n continuing from the communication data storage means, obtains the X _{n + 1,} the communication data X _n, the time interval of X _{n + 1} is greater than a predetermined threshold value Tc The two communication data are different communication clusters, and the communication data X _{n + 1} is a representative communication. If the communication data is smaller than the threshold Tc, the two communication data are the same communication cluster, and the X _{n + 1} Is a time interval comparison step for determining a subordinate communication;
The representative communication identification means acquires communication data X _{n + 2} next to the communication data X _{n +} ₁ that has become the representative communication from the communication data storage means, and the communication data X _{n + 2} If the difference between the communication data X _{n + 1} is smaller than a predetermined representative communication identification threshold Tf, a representative communication identification step in which the communication data X _{n + 1} is a dependent communication;
A result storage step of storing the classification results of the time interval comparison step and the representative communication identification step together with a communication identifier uniquely indicating the communication data in a classification result storage unit;
The communication classification method characterized by performing.

When there is an incomplete cluster in which representative communication is missing in the communication cluster of the classification result storage unit, the communication classification complementing unit has a predetermined similarity degree of a dependent communication group of another communication cluster of the classification result storage unit The communication classification method according to claim 4, further comprising a communication classification complementing step in which the representative communication of the complete communication cluster having the highest similarity and exceeding the threshold is set as the representative communication of the incomplete cluster.

In the input step,
The communication classification method according to claim 4, wherein when communication data for each user is not observed for a certain time or longer, the data in the communication data storage means is deleted.

Computer
The communication classification program for functioning as each means of the communication classification apparatus of any one of Claims 1 thru | or 3.