Detailed Description
The invention is described in further detail below with reference to the figures and examples:
FIG. 1 is a schematic diagram of the internal flow of the offline anomaly detection modeling module according to the present invention, including ① preprocessing and grouping, ② time-based partitioning, ③ descriptive statistics, ④ descriptive statistics analysis, ⑤ possible recombination, double circles indicate the input and output of offline security anomaly detection, the original input is the alarms from security devices (e.g., devices such as firewalls, intrusion detection devices, and routers), the final output is the algorithm guidance for selecting security anomaly detection, the gray boxes are parameters input by security analysts, and different parameters can adapt to different application scenarios and security analysis purposes.
The ① preprocessing and grouping depends primarily on the topology of the network and the goals of the security analyst, e.g., only a certain subnet or class of alarms need to be monitored.
The ② calculates alarm time series and time-based segmentation (e.g., dividing a day into day and night).
The ③ descriptive statistics extract the distribution of each alarm time series and the descriptive statistics of the time sequence dependence, the distribution is represented by a central trend (mean, median) and dispersion of the data (variance, quartile, coefficient of variance).
The ④ descriptive statistics analysis analyzes the extracted descriptive statistics to infer the applicability and effectiveness of the anomaly detection algorithm.
For example, if the number of alarms is dependent on operating time, descriptive statistics of different time distributions (e.g., day, night) may be extracted.
Further, the ① preprocessing and grouping module may be any type of alarm, such as original alarm, super alarm, or meta alarm reported by the security device.
The preprocessing, i.e., the standardization of alarm information, and the elimination of repeated alarms, etc. Alarm grouping by setting initial combination parametersAnd then realized. The different grouping methods depend on the goals of the security analyst. For example:
⑴ source of alarm source address of alarm;
⑵ alarm type, either normal alarm type or super alarm type.
The ⑴ alarm sources, which may be either internal or external alarms, internal alarms primarily exhibit behavior during working hours and user behavior, while external alarms are primarily change and noise.
The ⑵ alarm types reveal different behavior based on different alarm types that would otherwise prevent security anomaly detection given that all alarms of a group would be a nuisance.
The ① preprocessed and grouped outputs are N alarm groups, i.e.、、…、. For example, considering alarms generated by an enterprise IT network for 5 months, alarms may be classified according to previously defined criteria:
an alarm source: the method comprises the following steps of alarming of wired equipment, alarming of wifi and external alarming;
alarm type: hobbyhorse, etc.
The reason why the wired alarm and the wifi alarm are monitored respectively is that the PC client of most internal employees is connected with all servers in a wired communication mode, and most internal employees (including guests) using a notebook computer and a smart phone are connected in a wireless communication mode. In addition, most network, wifi devices are policy limited so that some PCs (or notebooks) can only access Web and mail applications. For these reasons, it is desirable for a security alarm analysis system to be able to derive different historical behaviors from the wired alarms and the alarms generated by the wireless host.
The extraction of alarm types is related to the number of alarms of each type. In fig. 2, the percentage of different types of alarms generated is given (less than 1% of alarms, disregarded). As seen from fig. 2, the alarm generated 80% is of the trojan alarm type. This result is trusted because the enterprise does not directly monitor most host devices. Fig. 1 is adapted to be independent of all alarm packets and independent of the number of alarms. However, it is very useful for automatic analysis containing a large number of alert packets. Therefore, the next step mainly considers the three most active alert packets: wired hobbyhorse, wireless hobbyhorse and external hobbyhorse.
Further, the ② is based on time division, the input of which is、、…、(ii) a And, three operational steps of extracting descriptive statistics: alarm time series calculation, valid/invalid alarm sequence tagging, and time-based segmentation.
For each alarm groupTime sequence of alarmsStatistics requires the input of two parameters:
the time window w determines the alarm quantity to be analyzed;
⑵ time granularity g, the minimum time unit of an alarm (e.g., time series of alarms per day, hour, minute) is evaluated.
The above parameters are input by the security analyst, depending on the scenario and the analysis objective. For example, if the analysis target is to find out which day the anomaly or situational awareness of the alarm, the time granularity may be equal to one day (The number of alarms per day) and a time window w of 6 months or more. On the other hand, if the analysis objective is to evaluate whether the day and night have different alarm distributions, the time granularity may be equal to one hour or less and the time window w is 1 month or more. In the context of security analysis, too fine a granularity g (e.g., seconds) should be avoided.
Then the ② is evaluated based on the time-based segmentationActive or not in the time window w. The purpose of this step is mainly to remove inactive time sequencesThis is because for further analysis. As a criterion for checking whether the time series of alarms is active, if 50% or more of the number of alarms are generated within the time interval, the alarm is active, i.e. mean)>0. Other criteria and thresholds, such as filtering inactive alert sequences, depend on the security analysis objectives and the conditions of the enterprise IT system.
In calculating alarm time seriesThereafter, if it is active, the parameters are combined at the input timeIs further divided, wherein,defined as some time interval (e.g. day, night), alarm time sequenceIs divided into M subsequencesJ is equal to { 1,2, …, M }. On the other hand, if the security analyst has no particular expectation of the time-series behavior of the alarms, all of the alarm packets may be groupedDefining a fine granularity time(e.g., generally in terms of hourly segmentation.) this is due to the fact that the ⑤ possible recombinations can automatically suggest possible coarse-grained temporal recombinations to analyze the descriptive statistics extracted at the ③ descriptive statistics.
The output of the ② time-based segmentation is M subsequencesAnd sequencesI.e. for each alarm packetAnd outputting M +1 alarm sequences.
Now, considering the previous example again, mainly the 3 most active alert packets are of interest: wired Trojan, wifi Trojan, external Trojan. The time window w examined was 5 months and the time particle size g was 1 hour. This time granularity allows to examine the temporal behavior of different time intervals. Fig. 5 is a time series of wired, wifi, and external trojan alarms for each hour. The X-axis represents time (hours), and the Y-axis represents reported alarm quantity (0-800 alarms/hour). Because the median of these three alarm sequences is greater than zero (mean: ()>0, i =1,2, 3), so they are active. As can be seen from fig. 3, wifi trojans are most active, wired trojans are next to them, and the external trojan alert sequence is weakest.
Further, the ③ descriptive statistics with inputs ofAnd M subsequences. This module extracts 3 sets of related descriptive statistics, relating to random distribution, timing dependence and stability.
The random distribution, the distribution characteristics of which have 2 main attributes: concentration trends and discrepancies. For highly dynamic application scenarios, the following statistics are reviewed and can be visually represented by box diagrams.
⑴ median m (i.e., mean (m)), indicating the central tendency of the data;
⑵ quartile iqr, which represents the dispersion around a central tendency.
To show the impact of outliers on data dispersion, the variance coefficient was investigatedWhereinandrespectively, the mean and variance of the distribution to which the alarm sequence belongs.When the value is higher, the alarm sequence is discrete and/or an abnormal value exists; however,when the value is smaller, the distribution is expressed as a convergent distribution.
Again, for the most active alarm sequence: wired Trojan, WIFI Trojan, external Trojan, considering time combination= { working hour (day), working hour (night), holiday (day), holiday (night) }, time combinations are given in fig. 4A box diagram of (a); where the X-axis represents time division (day, night) and the Y-axis represents the number of alarms per time unit (e.g., number of alarms reported per hour). Each box graph gives the following statistical properties: lower quartile (q1), median (mean), upper quartile (q3), interquartile (iqr = q3-q 1), lower whisker: (q1) ((r))=) And upper tentacle=. All are inAbove and aboveThe values below can be considered as outliers.
FIG. 5 shows different time combinationsCoefficient of variance ofThe value of the coefficient. This statistic is useful for capturing variability of the data.
As seen in fig. 4, most of the alarms during the day of the work day are generated by wifi trojans. On the other hand, in the daytime of holidays, wifi Trojan alarms are reduced, and in the nighttime of holidays, there are almost no alarms. As can be seen from fig. 7, in the daytime of the working day, the coefficient of variation of the wifi trojan alarm is low, while other combinations are higher than this, which indicates that the alarm sequence is noisy and/or has some outliers.
In all four time combinations of fig. 4 (a) and (d), the wired trojan alarm exhibits similar central trends (m) and dispersion (iqr) of the alarm, with somewhat higher points during the day of the work day. However, on weekdays, either daytime or night, there are higher outliers. These outliers are almost an order of magnitude higher than the concentration trend; as can be seen from fig. 5, the variance coefficient is also a high value.
On the other hand, external trojans are almost equally distributed, whether during the day or at night, and the day of the workday is somewhat low, which may be related to attacks from different time zones. The dispersion of the external trojan warning is low and the coefficient of variance approaches 1.5 in all time combinations. This suggests that the external trojan warning sequence is independent of the time of detection and can be combined into a time combination (no difference in working hours/holidays, daytime/night).
The time dependence, descriptive statistics related to timing dependence, is useful for regression-based anomaly detection. An alarm sequence exhibits a temporal dependency if it is trending, periodic, and seasonal. The trend is a general systematic component, and for sufficiently long time frames, a time series may show periodic or seasonal patterns.
To extract timing dependent descriptive statistics, filtering and auto-correlation time series analysis techniques are employed. Filtering can reduce the noise of the time series. Such noise may hide trends and temporal patterns that are useful for model anomaly detection. In this case, a simple filtering technique is employed; it is important to consider that the nature of the data can be changed as more advanced filtering techniques are employed. For this reason, the present invention employs SMA filtering based on a radius r hour center window. For the sake of clarity, assumeAs an alarm time series, andis the number of alarms at time t (e.g., if the time granularity g equals 1 day, thenIndicating the number of alerts on day t). The SMA filtering generates a new sequence SMA (t), in which the alarm sequenceEach value of (1) isIs replaced by the average of the 2r neighbors of (i):
SMA(t)
wherein,is the number of alarms at time t, and 2r +1 is the size of the moving average window. The invention proposes smooth filtering with radius r of 1 or progressive filtering with radius r of 5.
After filtering, the following auto-correlation function (ACF) is calculated:
wherein,is the time interval of the automatic association,is the alarm time series, E is the mathematical expectation operator,andis thatMean and variance of. When self-correlation is a high value and slow decay, it means that future values are correlated with historical values; the opposite is true, i.e., when the automatic association between two values tends to zero. If it isThen a time series is considered predictable and has sufficient prediction accuracy in the kth window. Therefore, the above conditions are satisfied, and the regression-based abnormality detection algorithm can be effectively used.
Unlike randomly distributed descriptive statistics, timing dependent statistics are only from the entire alarm time seriesBecause the auto-correlation function requires continuity of alarm time for identifying predictability, trends, and periodicity.
In particular, with respect to timing dependencies, the present invention extracts the following descriptive statistics:
⑴ as predictable intervalsA value;
time seriesMain period ofIf any.
There may be multiple periods (e.g., 24 hours, 7 days), or there may be no period (in this case,= 0). Again, note that regardless of whether or not to alert sequenceFiltering is applied and each statistic can be decimated. That is, there are 3 configurations (no SMA filtering, weak SMA filtering, strong SMA filtering), and accordingly 3 pairs of values (f &),)。
Fig. 6 shows ACF values of the wired trojan, the WIFI trojan, and the external trojan. X-axis represents time interval(hours), value of Y-axis ACF. The vertical dashed line indicates a 24 hour slip, while the horizontal dashed line indicates a threshold of 0.3 to determine whether an alarm sequence is predictable and gives no filtering,=1 and=5 results of three configurations.
Fig. 4 (a) shows a wired trojan warning for a 24-hour period of a week, which is slightly enhanced by SMA filtering, but still remains below the 0.3 threshold (hence, period)= 0). The filtering slightly improves the pitchPrediction of, in particular=5, however, the alarm sequence remains weakly associated. On the other hand, WIFI trojan alerts exhibit strong 24 hour periods, which is evident even if no filtering is used. This means that the highest probability of finding the same value every hour is every 24 hours. The ACF of the external Trojan warning sequence shows a trend component which is enhanced by filtering to achievePredicted to be higher than24 hours worth of = 5. .
Stability of the description statistics, each alarm time seriesTo show the stability of the descriptive statistics of their distribution, the median (mean) and the interquartile range are considered. In the invention, w is defined as the time window over which the alarm time series is to be analyzed. It is verified how the distribution statistics evolve in the time window w. For this purpose, two parameters are considered: size of sliding window S (e.g., 1 month), time shift(e.g., 1 week); wherein,Sw is added. By assigning different values to these parameters, the information security analyst can assess the stability of the descriptive statistics over different periods. Information that determines how often the anomaly detection parameters are re-evaluated is also useful. The invention calculates median and quartile rangeValue of from the time interval=[0, s]Is started and then=[, s+]Then, then=[2, s+2]Etc. until the entire time window w is covered. This process is descriptive statisticsAnd。
fig. 7 gives descriptive statistics on the alarm data set. Time shift on X-axisAnd the Y axis representsAndvalue of (number of alarms/hour). In this example, w =5 months, s =1 month,=1 week. For example, X =0, indicates month 1And(ii) a X =1, and represents week 1Andand so on. This enables the evaluation of how descriptive statistics evolve on a weekly basis.
As can be seen from fig. 7, during the initial period, the statistics of the wired trojan during the day are unstable and then stable; on the other hand, the WIFI trojan has almost no alarm at night, but the alarm is increased sharply in the daytime. The external trojan is stable throughout the cycle.
Here, criteria are given for automatically verifying whether descriptive statistics regarding the alarm distribution are stable. Let d be a descriptive statistic (e.g., iqr), and the descriptive statistic d be a value at time shift t (e.g.,at 5Value of (d). To assess the stability of d, a popular dispersion measurement method is used: the median absolute deviation MAD. In particular, for each descriptive statistic d, the stability index is calculated by the following formula:
Wherein,denotes MAD, denominator m (d) = mean (C: (D))) This is a normalization factor for descriptive statistics of different scales requiring comparison. Is smaller(almost zero) means that the descriptive statistic d is stable and vice versa. In particular, when the time sequenceIs stable when the concentration trend and dispersion satisfy the following relations:
0
wherein,is a stability threshold that can be adjusted by the security analyst based on the IT network environment. In the application scenario of the invention, the stability and instability of descriptive statistics are identified automatically and verified heuristically=0.2 is a sufficient threshold. In the above formula, considerThe maximum value of the stability index, since a statistically significant difference can be sufficient to take into account the instability of the distribution. In fig. 8, both the wired trojan and wireless trojan alarms during the day are unstable, while the stability index for the other four distributions is below the threshold.
The distribution-based anomaly detection method, whose alarm sequence can be modeled by parametric or non-parametric distributions (Gaussian distribution,(Gamma) distribution) and the abnormal event occurs in a low probability region of the stochastic model or a large variation region is distributed. These algorithms are only present when not presentOrder dependent and inapplicable regression-based methods are useful. Alarm sequences can only be modeled by distributions if the central trends and dispersion remain stable. If it is notAndare stable, a distribution-based anomaly detection method may be employed. The distribution-based algorithm may be parametric or nonparametric.
The parameterization technique is only useful when there is some evidence or knowledge of the alarm sequence distribution. E.g. medianIs stable and concentrated in the quartile regionThe alarm sequence can then be modeled by a gaussian distribution, although further analysis is required, e.g. chi-square test. Other commonly used parameter distributions are gamma distribution and Longtail distribution. More complex distributions may be approximated by distributions, such as MoG (approximated by a gaussian distribution).
Non-parametric techniques are only useful if there is no a priori knowledge of the alarm sequence distribution. Common examples are histogram-based techniques and kernel-based techniques (e.g., Parzen window estimation).
In addition, ifIs unstable andis stable, then the CUSUM-like method, which uses median as descriptive statistics, is effective for anomaly detection.
Note that external trojan (day, night), wireless trojan (night), and wired trojan (night) alarms can be modeled by distributions, while wireless trojan (day) is constantly growing in mean and variance, such alarm sequences make modeling more complex by distributions. The wired trojan is unstable only in an initial period and is stable all the time thereafter. That is, the distribution-based method is effective for the wired trojan after the initial period is unstable.
As shown in fig. 9, the decision flow chart evaluates the convergence index in the first step: the distribution-based approach is also effective for anomaly detection if the alert sequence is not convergent, but does not have timing dependencies.
FIG. 10 is a diagram of distributed-based information security anomaly detection according to the present invention, including a real-time alarm module, a historical alarm module, an offline anomaly detection modeling module, an online anomaly detection module, and a knowledge base.
The real-time alarm module receives alarms reported by various safety devices through protocols such as SNMP, syslog and the like in real time and respectively sends the alarms to the history alarm module and the distribution-based abnormal online detection model module.
The historical alarm module can be used as a backup of an alarm time sequence and can also provide alarm data for the offline security attack anomaly detection model module.
The off-line anomaly detection modeling module models the alarm time sequence and provides guidance for an anomaly detection method based on threshold, an anomaly detection method based on regression and an anomaly detection method based on random distribution. The distribution-based anomaly detection method calculates the median m, the quartile range iqr, the event interval k, the period T,Andand determining whether to select the distributed information security anomaly detection method.
The online anomaly detection module adopts a distribution-based method to detect the anomalies of the alarm time sequence reported by the real-time alarm module in real time online, and reports the detection results to a related display module or a safety analyst for further processing.
The knowledge base stores various statistical parameters, anomaly detection methods, application scenarios thereof and the like.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention; all equivalent changes and modifications made according to the present invention are considered to be covered by the scope of the present invention.