CN119203077B

CN119203077B - A method and system for detecting and processing genuine software based on big data analysis

Info

Publication number: CN119203077B
Application number: CN202411350700.3A
Authority: CN
Inventors: 赵宇; 夏岩; 冯钦华
Original assignee: Beijing Softmap Science & Technology Co ltd
Current assignee: Beijing Softmap Science & Technology Co ltd
Priority date: 2024-09-26
Filing date: 2024-09-26
Publication date: 2025-09-05
Anticipated expiration: 2044-09-26
Also published as: CN119203077A

Abstract

The application discloses a method and a system for detecting and processing original software based on big data analysis, and relates to the technical field of software management, wherein the method comprises the steps of obtaining installation source information of each installation software in a client device to be detected; the method comprises the steps of respectively verifying installation source information of each installation software based on a software installation authorization library to identify whether each installation software has potential piracy risks, acquiring an online service log of each risk installation software aiming at each identified risk installation software with potential piracy risks, extracting multidimensional risk features from the online service log, respectively inputting the multidimensional risk features of each risk installation software into a piracy identification model to correspondingly determine whether each installation software has the piracy risks, wherein the piracy identification model adopts a deep learning model. Therefore, through comprehensively applying the big data analysis and the deep learning model, the automatic detection of the orthography of various installation software in the equipment is realized.

Description

Method and system for detecting and processing original software based on big data analysis

Technical Field

The application relates to the technical field of software management, in particular to a method and a system for detecting and processing original software based on big data analysis.

Background

Aiming at the problem of software piracy, various governments and software enterprises are actively taking measures to popularize the use of the original software. Currently, in the field of software orthographic management, tools based on manual auditing and simple scanning are applied to markets to assist organizations and enterprises in managing the installation and use conditions of software.

At present, the detection and management of the original software are mainly realized through the following steps of firstly collecting the software installation condition on the terminal equipment in a manual recording or automatic scanning mode, then carrying out manual analysis to confirm the validity and authorization state of the software, and finally outputting a report for a manager of an enterprise or an organization to carry out further compliance inspection. Therefore, the manual operation occupies a quite important part, and the software information is input, classified and authorized state checked depending on a large amount of manual participation, so that the management cost is increased, and the checking efficiency is greatly reduced. However, in large enterprises or government institutions, thousands of computer terminals need to be checked one by one, and the manual detection and management manner is often careless, so that data omission or errors are easily caused.

In addition, the current original software inspection tool has single function, mainly focuses on simple scanning and data comparison, relies on hard coding rules to judge the authorization state, lacks deep analysis and understanding of software use behaviors, and has serious defects of flexibility and intelligence, so that the detection effect is often unsatisfactory.

In view of the above problems, currently, no preferred technical solution is proposed.

Disclosure of Invention

The application provides a method, a system, a storage medium, a computer program product and electronic equipment for detecting and processing original software based on big data analysis, which are used for at least solving the problems of complex manual operation, poor detection flexibility and low efficiency in the prior related technology, and can efficiently, accurately and intelligently detect and manage the original software.

The embodiment of the application provides a method for detecting and processing legal software based on big data analysis, which comprises the steps of obtaining installation source information of each installation software in a client device to be detected, respectively verifying the installation source information of each installation software based on a software installation authorization library to identify whether each installation software has potential piracy risks, wherein the software authorization library comprises a plurality of authorized legal software names and corresponding software authorization installation channels, obtaining an online service log of each risk installation software aiming at the identified risk installation software with the potential piracy risks, and extracting multidimensional risk features from the online service log, wherein the multidimensional risk features comprise software online update response frequency, software function module use frequency, software plug-in loading information and software user behavior information, respectively inputting the multidimensional risk features of each risk installation software into a piracy identification model to correspondingly determine whether the piracy risks exist, and the piracy identification model adopts a deep learning model.

According to the second aspect, the embodiment of the application provides a system for detecting and processing the original software based on big data analysis, which comprises an installation source acquisition unit, a piracy risk initial identification unit, a piracy risk determination unit and a service log acquisition unit, wherein the installation source acquisition unit is used for acquiring installation source information of each installation software in a client device to be detected, the piracy risk initial identification unit is used for respectively verifying the installation source information of each installation software based on a software installation authorization library so as to identify whether each installation software has a potential piracy risk, the software authorization library comprises a plurality of authorized original software names and corresponding software authorization installation channels, the service log acquisition unit is used for acquiring an online service log of each identified risk installation software with the potential piracy risk and extracting multidimensional risk features from the online service log, the multidimensional risk features comprise software online update response frequency, software function module use frequency, software plug-in loading information and software user behavior information, the piracy risk determination unit is used for respectively inputting the multidimensional risk features of each risk installation software into a piracy identification model so as to correspondingly determine whether the piracy risk exists, and the piracy identification model adopts a deep learning model.

In a third aspect, an electronic device is provided that includes at least one processor and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the big data analysis based method of the present application.

In a fourth aspect, an embodiment of the present application provides a storage medium having a computer program stored thereon, wherein the program when executed by a processor implements the steps of the method for detecting and processing a master software based on big data analysis of any of the embodiments of the present application.

In a fifth aspect, embodiments of the present application provide a computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method for detecting and processing master software based on big data analysis of any of the embodiments of the present application.

The method and the system for detecting and processing the original software based on big data analysis provided by the application can at least produce the following technical effects:

(1) Traditional manual inspection and simple rule judgment are converted into automatic multidimensional data analysis through a big data analysis and deep learning technology. The software installation authorization library is utilized to carry out preliminary verification on the software installation source information, and the subsequent deep learning is used for further and automatic identification on the piracy risk of the potential risk software, so that the dependence on manual verification is greatly reduced, and the method is particularly suitable for mass terminal equipment management in large enterprises and institutions, thereby effectively improving the management efficiency and avoiding the problems of omission and misjudgment in manual operation.

(2) In the technical scheme, the software authorization state is judged not only by relying on the traditional simple scanning and hard coding rules, but also a piracy recognition model based on deep learning is introduced. The multi-dimensional risk characteristics (such as online updating response frequency, function module use frequency, plug-in loading information and the like) of the software are extracted through the online service log, so that the use behavior of the software can be deeply analyzed, various complex software use scenes can be flexibly dealt with, the intelligence and the flexibility of detection are remarkably improved, and the limitation of single rule detection is avoided.

(3) The deep learning model can capture complex patterns and associations in large-scale data, effectively identifying potential risks that are difficult to find by rule decisions. By learning the relevance and regularity of various risk features, the model can autonomously optimize the recognition capability in the face of unknown software environments or continuously-changing software use behaviors, so that the overall detection effect is improved. Therefore, by the technical scheme, stronger dynamic adaptability can be realized, and new piracy means and technology can be identified in time, so that the continuously-changing piracy software environment can be effectively handled, and higher identification accuracy and robustness can be maintained.

According to the technical scheme, mass data are rapidly analyzed and processed through the application of the big data analysis and deep learning model, automatic detection of the orthography of the installation software is achieved, participation of manual operation is greatly reduced, enterprises or institutions can be helped to remarkably reduce labor cost, and in addition, the detection system framework has better expandability and can adapt to the requirements of complex software environments in large enterprises or government institutions.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 illustrates a flow chart of an example of a method of processing for detection of master software based on big data analysis according to an embodiment of the present application;

FIG. 2 illustrates a schematic diagram of structural connections of an example of a piracy identification model, in accordance with an embodiment of the present application;

FIG. 3 illustrates an operational flow diagram for an example of extraction of a timing feature matrix according to an embodiment of the application;

FIG. 4 illustrates an operational flow diagram of an example of adaptively determining convolution kernel scale weights based on information entropy weight calculation in accordance with an embodiment of the present application;

FIG. 5 illustrates a block diagram of an example of a big data analysis based master software detection processing system in accordance with an embodiment of the present application;

Fig. 6 is a schematic structural diagram of an embodiment of an electronic device of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In the technical scheme of the application, the related processes such as collection, storage, use, processing, transmission, provision, disclosure and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order is not violated.

FIG. 1 illustrates a flowchart of an example of a method of processing for detection of master software based on big data analysis according to an embodiment of the present application.

The execution main body of the method of the embodiment of the application can be any controller or processor with calculation or processing capability, and a set of intelligent and efficient original software detection processing method is formed through big data analysis, multidimensional risk feature extraction and deep learning model introduction, so that the detection accuracy can be effectively improved, the manual intervention can be reduced, and the management cost can be reduced.

In some examples, it may be integrally configured in an electronic device or terminal by means of software, hardware or a combination of software and hardware, and the type of terminal or electronic device may be diversified, such as a mobile phone, a tablet computer or a desktop computer, etc.

As shown in fig. 1, in step S110, installation source information of each installation software in the client device to be detected is acquired.

In some embodiments, the detection system (or detection tool) may first obtain relevant information of the installed software on the client device, particularly the installation source information of the software, in an automated manner using a scanning tool (such as API call, system log analysis, or registry parsing). The type of installation source information may be varied, such as download channels (e.g., official website downloads, third party site downloads, USB installations, network sharing, etc.), installation time, installation path, etc. Here, the source information of each installed software in the scanning client device needs to be accurately sampled, so that all installed software can be covered comprehensively, errors or omission of manual recording can be avoided, and the detection efficiency and accuracy are improved.

In step S120, the installation source information of each installed software is respectively verified based on the software installation authorization library, so as to identify whether each installed software has a potential piracy risk.

It should be noted that the software authorization library includes a plurality of authorized genuine software names and corresponding software authorization installation channels. In addition, a system administrator can also update and maintain the software authorization library periodically to ensure that the original software information is up to date and cover a wide authorization channel (such as an official website, a trusted application store and the like).

Here, the installation source information of each software is compared using data in the software authorization library. The authorization library contains the names of legal authorized software and corresponding installation channels, and whether the software is authorized by legal authorized software is judged by comparing the software names with the installation channels. For example, if a piece of software is installed through an unauthorized third party channel, the piece of software will be marked as risky to install the piece of software with potential piracy risk. Therefore, the software with potential piracy risks can be rapidly screened out through comparison of the authorization libraries, the accuracy is high, false alarm is reduced, the follow-up piracy risk analysis process is avoided being implemented for all installed software, the piracy detection efficiency is improved, and the resource consumption is reduced.

In step S130, for each identified risk installation software that is at risk of potential piracy, an online service log of the risk installation software is obtained, and multidimensional risk features are extracted from the online service log.

Here, for potentially pirated software that is not identified as having a canonical installation source, its online service log is obtained through a network interface. The online service log comprises a communication record, an update response log, a plug-in loading log and the like of the software and a server in the using process. Further, multi-dimensional risk features are extracted from the logs, the multi-dimensional risk features including software online update response frequency, software function module use frequency, software plug-in loading information and software user behavior information.

For the description of the online update response frequency, it can be determined whether the software is updated frequently and whether the update coincides with the update period of the original software. In particular, the original software is usually pushed with regular updates, and the user can update according to the update frequency issued by the manufacturer. Piracy software may lack such update services or have abnormal update response behavior (e.g., skip updates from time to time, fail to update), thus helping to identify software piracy risk by analyzing the user's update response behavior.

Aiming at the description of the use frequency of the functional module, the method can analyze the use frequency of the software core function and judge whether the use frequency is matched with the normal use behavior of the original software. In particular, the functional modules of the original software generally operate stably and comprehensively, and a user can access and use all the functional modules. Pirated software may have some functionality missing or running unstable. By recording the data such as the calling times, the running success rate and the like of the software function module, whether the user lacks access to the advanced functions for a long time or whether abnormal behaviors such as crashes, errors and the like frequently occur can be detected.

For a description of plug-in loading information, it may be monitored whether software is loaded with an unauthorized plug-in or extension module. In particular, some pirated software may bypass the security mechanism of the original software through plug-ins, illegal patches, or hacking tools, which tend to tamper with the file structure of the software or load illegal plug-ins, resulting in the software exhibiting different behavior from the original software at startup or operation. By analyzing the file structure, the memory occupation and the loading condition of the plug-in, the characteristics of the illegal plug-in can be extracted.

For a description of software user behavior information, it may record user operation behavior for evaluating whether abnormal operation or abnormal usage patterns exist. Specifically, the operating habits of the original users are relatively fixed, such as frequency of use, operating time period, and the like. Pirate users may have abnormal behaviors such as frequent device switching, cross-zone usage, abnormal operating time periods, etc. By modeling the time, frequency, etc. of user operation behavior, it is helpful to find the risk of pirate software that does not match the original user behavior pattern.

Therefore, the multidimensional risk features of the feature dimension design are introduced, the running mode of the software and the use behaviors of users can be better understood, the accuracy of piracy identification is further improved, and the software which is seemingly legal but actually possibly contains the piracy features can be identified.

In step S140, the multidimensional risk features of each risk installation software are respectively input to a piracy identification model, so as to correspondingly determine whether the piracy risk exists, wherein the piracy identification model adopts a deep learning model.

Here, the piracy recognition model can be trained by pre-using a data sample set containing a large amount of historical data and labeling information, so that the piracy recognition model can accurately distinguish the behavior modes of the piracy software from the piracy software.

In some embodiments, during the reasoning stage, the piracy identification model may analyze the input multidimensional risk features to generate a piracy risk score or a piracy risk probability value, and if the probability value exceeds a preset threshold, the software is determined to be pirated. In addition, the deep learning model can be continuously self-optimized, and the judgment accuracy and generalization capability of the deep learning model are improved by updating new data.

By the embodiment of the application, the whole process from data collection to risk identification is realized to realize high automation, and the detection efficiency is greatly improved. Firstly, global screening verification is carried out by using installation source information to identify risk installation software in a client, and further, deep analysis is carried out on each risk installation software by combining a deep learning model with multidimensional feature analysis to identify potential pirate software, so that detection accuracy can be effectively improved. In addition, by continuously training a deep learning model and updating a software authorization library, the detection system can quickly adapt to newly-appearing software and piracy means in the market and simultaneously support the expansion requirement of a large-scale enterprise environment.

In some examples of the embodiments of the present application, the piracy recognition model is a multi-modal neural network, so as to comprehensively utilize multiple types of risk features, and specifically, comprehensively analyze two aspects of static features of a time sequence relationship and a nonlinear relationship respectively, so as to predict a final risk probability value of pirated software.

FIG. 2 illustrates a structural connection diagram of an example of a piracy identification model, according to an embodiment of the application.

As shown in fig. 2, the Multi-modal neural network 200 includes a CNN (Convolutional Neural Network ) module 210, an MLP (Multi-Layer Perceptron) module 220, a feature fusion module 230, and a classification module 240.

The CNN module 210 is configured to extract a local dependency relationship between a software online update response frequency and a software function module usage frequency in the multidimensional risk feature, so as to generate a corresponding time sequence pattern feature.

Specifically, the CNN module is used for reconstructing a time sequence structure by updating the response frequency of the software on line and the use frequency of the software functional module so as to adapt to convolution operation. Next, local feature extraction is performed on the time series data using a plurality of convolution layers, and the convolution kernel captures local correlations of features and frequency change patterns through a sliding window, such as periodic fluctuation of online update response frequency, peaks and valleys of the use frequency of the functional module, and the like. By overlapping the convolution layers, higher-level time sequence mode features can be extracted layer by layer, and finally the time sequence mode features are generated.

The MLP module 220 is configured to extract nonlinear relationships between software plug-in loading information and software user behavior information in the multidimensional risk feature to generate a corresponding static pattern feature.

Specifically, the MLP module mainly processes static information contained in multidimensional risk features, processes plug-in loading information and user behavior information, and extracts complex and high-dimensional nonlinear modes between the features. Specifically, the MLP module includes an input layer, a hidden layer, and an output layer. Through the input layer, the MLP module inputs and embeds static features into a high-dimensional space. Through the hidden layer, the multi-layer fully connected structure of the MLP module can effectively capture the nonlinear relationship between complex features. For example, plug-in loading information may have a complex potential link to user operation behavior, and hidden layers can extract these deep associations layer by layer. Finally, a feature vector representing the combined pattern of the learned static features, i.e., the static pattern features, is output by the MLP module.

It should be noted that, although the software plug-in loading information and the software user behavior information are static, they often imply complex software usage habits and behavior patterns, and by using the MLP module, complex nonlinear relations between plug-in loading and user behavior, such as complex information of combinations of different plug-ins, loading sequences, etc., can be captured, so as to enhance the identification capability of pirated software. Therefore, the potential piracy risk under the complex behavior mode can be better identified, and particularly under the scenes of plug-in loading failure and abnormal user operation behaviors, the model can accurately detect the characteristics of the pirated software, and false alarm are reduced.

By combining the CNN module and the MLP module, time sequence features (such as online update response frequency and use frequency of the functional module) and static features (such as plug-in loading information and user operation behaviors) can be processed simultaneously, so that the local mode and the global nonlinear relation between the features can be fully mined and modeled, and the problem that a single model cannot capture a complex behavior mode is avoided. Therefore, the deep mining enables the system to comprehensively analyze the software behaviors, so that the accuracy of identifying pirated software is improved, and particularly, the system has strong identification capability on software with abnormal functions, abnormal updating response and unstable loading of plug-ins.

The feature fusion module 230 is configured to fuse the time sequence mode feature and the static mode feature to generate corresponding fusion features.

Here, the feature fusion module may use various non-limiting ways to fuse the time-series mode features and the static mode features, such as simple feature stitching or weighted fusion, so that the fused features can contain comprehensive information from different dimensions and different time levels, and provide more abundant contextual feature data. By effectively fusing the time sequence mode characteristics and the static mode characteristics, the comprehensive utilization of multi-mode information is realized, the comprehensive understanding of software behaviors is realized, and different types of pirated software and user behavior modes can be effectively treated.

The classification module 240 is configured to perform classification processing on the fusion feature through the full connection layer, so as to output a risk probability value of the risk installation software belonging to pirated software.

Here, the classification module may employ a typical full-connectivity layer network to classify the fusion features. Specifically, after a plurality of layers of processing, the high-dimensional fusion characteristics of the full-connection layer network use Sigmoid activation function in the last layer to generate the risk probability value of the risk installation software belonging to pirate software. And judging whether the corresponding software is pirated or not according to the level of the risk probability value. Therefore, the classification process based on deep learning can keep high efficiency when processing a large amount of data, can quickly generate piracy detection results, and can provide real-time piracy software detection services for enterprises and users.

According to the embodiment of the application, a multi-mode neural network structure is adopted, the system can be expanded according to the increase or change of feature dimensions (such as adding new behavior features or plug-in loading modes), and meanwhile, the model can be self-adaptive to software behavior change under different use scenes through a deep learning continuous training mechanism. Therefore, the detection system has good expansibility and self-adaptive capacity, can be continuously optimized and upgraded along with the changes of software markets and user behaviors, and effectively meets the detection requirements of novel pirated software.

In some examples of embodiments of the application, feature fusion module 230 employs a self-attention mechanism based feature fusion module. By introducing a self-attention mechanism, the feature fusion module can perform fine-granularity interaction analysis on the time-series features and the static features, so that feature fusion is not a simple linear combination any more, but a complex relationship among the features is dynamically captured through the attention mechanism.

Specifically, the time sequence mode features and the static mode features corresponding to the first risk installation software are received, and the feature matrix of the time sequence mode features is thatT is the total number of time steps, F ₁ is the dimension of the time sequence feature, and the feature matrix of the static mode feature isF ₂ is the dimension of the static feature.

The time sequence mode feature and the static mode feature are respectively projected through a linear transformation matrix to generate corresponding query vectors, key vectors and value vectors:

In the formula, Is a linear transformation weight matrix of the query, key and value of the time sequence characteristics; The method is characterized in that the method is a linear transformation weight matrix of query, key and value of static features, d is the dimension of potential features and is used for unifying the representation of all input features, Q _s,K_s,V_s respectively represents query vectors, key vectors and value vectors corresponding to time sequence features, and Q _p,K_p,V_p respectively represents query vectors, key vectors and value vectors corresponding to the static features.

For interaction of the time sequence feature and the static feature, attention calculation is performed respectively:

Z _s＝Attention(Q_s,K_s,V_s), formula (3)

Z _p＝Attention(Q_p,K_p,V_p), formula (4)

Wherein, attention (·) represents a dot product Attention computing mechanism;Is a scaling factor to prevent the dot product value from excessively affecting the gradient computation, softmax represents a Softmax activation function, converts the correlation value into a weight coefficient by the Softmax activation operation, the weight coefficient is used to weight sum V to generate a fused feature representation, K ^T represents the transpose of the key vector K, Z _s is the time-sequential feature self-attention representation, and Z _p is the static feature self-attention representation.

Here, through the self-attention mechanism, a weight may be automatically assigned according to the importance of each feature in the current scene. For features with a higher risk of piracy, such as abnormal modes of online software update frequency, the model can give the features higher weight, ensuring that the key features have a greater influence on the final judgment. In addition, dynamic weight adjustment of the self-attention mechanism ensures that the model is more sensitive to the response of key features. For example, if a piece of software is not updated normally for a long time or the operating frequency of a user is abnormally high, the model can focus attention on the features through an adaptive mechanism, so that the accuracy of piracy detection is improved.

The interaction between the query vector and the key vector by mixing the timing features and the static features is further captured:

z _sp＝Attention(Q_s,K_p,V_p), formula (6)

Z _ps＝Attention(Q_p,K_s,V_s), formula (7)

Where Z _sp represents a first interactive attention feature of the timing feature to the static feature and Z _ps represents a second interactive attention feature of the static feature to the timing feature.

It should be noted that in the multi-modal feature, the interaction complexity of the timing feature and the static feature is generally high. Through a self-attention mechanism, the model can establish deep correlation between time sequences and static characteristics, and capture potential interaction modes between the time sequences and the static characteristics. For example, fluctuations in online update frequency may have some link to plug-in loading failure or frequent user operations. Thus, collaborative analysis of self-attention mechanisms between timing features and static features is utilized to help models better understand the logic behind certain complex behavioral patterns. For example, in the event of a plug-in load failure and abnormal update frequency, the detection system can capture a deep association of such patterns of behavior, thereby greatly increasing the likelihood of determining pirated software.

Fusing the timing feature self-attention representation, the static feature self-attention representation, the first interactive attention feature and the second interactive attention feature:

Z _final＝Z_s+Z_p+Z_sp+Z_ps (8)

Where Z _final represents the fusion feature corresponding to the first risk installation software.

According to the embodiment of the application, the self-adaptive capability of the feature fusion module is endowed by utilizing a self-attention mechanism, and the importance of the features can be automatically adjusted along with the dynamic change of the data, so that the model has higher robustness and expandability when facing diversified software use scenes. In addition, more accurate characteristic information can be captured through dynamic weighting and multidimensional interaction, and the output after characteristic fusion has higher information density and correlation, so that piracy detection results have higher accuracy and lower false alarm rate.

In some examples of embodiments of the present application, the CNN module employs a multi-scale convolutional network module in which a plurality of different scale convolutional kernels are introduced to capture characteristic behavior patterns of both short-term fine granularity and long-term coarse granularity.

Specifically, convolution kernels of different scales act on the input feature matrix to generate a plurality of different feature maps reflecting short-term and long-term behavior patterns. For example, a smaller convolution kernel can better capture feature changes over a short time period, while a larger convolution kernel can capture features over a long time span.

Where F _i is a feature matrix generated by a convolution kernel of size k _i×k_i,The method comprises the steps of representing a convolution kernel weight matrix with a scale of k _i×k_i, wherein X _t represents an input time sequence feature matrix, conv (·) represents convolution processing, N is the total number of convolution kernels, and alpha _i is convolution kernel scale weight corresponding to the convolution kernels with the scale of k _i×k_i.

Here, features under different scales are fused through summation operation, and the influence of each scale in the final fusion feature is controlled by using convolution kernel scale weight alpha _i, so that the importance of each scale can be flexibly adjusted by a model, and the behavior mode of pirated software is better captured.

It should be noted that, the determination manner of the convolution kernel scale weight may be diversified, on one hand, it may be preset by a user inputting a parameter, and on the other hand, it may be used as a learnable parameter, and adjusted in the process of model training iteration, and all fall within the implementation scope of the embodiment of the present application.

According to the embodiment of the application, the multi-scale convolution network module is adopted to carry out multi-scale convolution and feature fusion, so that a multi-level behavior mode from short-term abnormality to long-term stagnation behavior can be captured, behavior features with different time granularities can be identified at the same time, the detection capability of the model on complex behavior modes is improved, and particularly, the model is more sensitive to fine abnormality of pirate software in different periods. In addition, the importance of different convolution kernel scales can be automatically adjusted by the model according to actual scenes through the convolution kernel scale weights, and the adaptability of feature expression is enhanced.

FIG. 3 illustrates an operational flow diagram for an example of extraction of a timing feature matrix according to an embodiment of the application.

In step S310, a time sequence analysis is performed on the online service log according to a preset time window size to extract a time sequence feature set.

Here, the time sequence feature group includes time sequence features corresponding to a preset number of time windows, and the time sequence features include a software online update response frequency and a software function module use frequency, so that feature batches are constructed through the multiple time windows to provide a data basis for subsequent dependency analysis.

In step S320, the standard deviation of each timing feature and the covariance between different timing features are calculated based on the timing feature group to analyze the interdependencies between the different timing features.

Here, the correlation between the individual features is measured with pearson correlation coefficients (Pearson Correlation Coefficient, PCC) to define the dependencies between the different features.

Where Cov (f _u,f_v) is the covariance between timing feature f _u and timing feature f _v,AndRepresenting the standard deviation of f _u and the standard deviation of f _v, respectively, ρ _uv representing the interdependence between f _u and f _v;

Here, the standard deviation is used to measure the volatility of each feature in the sample set, and reflects whether the feature has a significant change, and the feature with a large standard deviation indicates that its value changes drastically, possibly contains more abundant information, and should be given a greater weight in the feature fusion process. Covariance is used to measure the common trend of two features in a sample set. Two eigenvalues have a positive correlation if they increase or decrease simultaneously, and a negative correlation if one increases and the other decreases. By means of covariance, features with strong correlation can be identified, and common variations of these features may be an important criterion for piracy patterns. For example, if the update frequency and the functional module usage frequency change cooperatively, then a particular type of piracy software behavior is possible.

Thus, by pearson correlation coefficients, potential relationships between timing features and static features can be identified, enabling those features with a strong correlation in piracy detection to be weighted higher.

In step S330, a correction weight for each timing feature is calculated from the correlation between different timing features.

The weight of each feature is dynamically adjusted based on its dependence on other features, and the weight of feature f _u, beta _u, can be determined from its weighted sum of correlations with other features and normalization:

where β _u is the weight of feature f _u and S is the total number of features;

It should be noted that the features of different time periods may have different influences, and the dynamic adjustment of the weights enables the model to adapt to the change of the time sequence features, especially in pirate software detection scenes with complex behavior patterns. In this embodiment, unlike static weight allocation, by taking into account interdependencies between features, a dynamic weighting mechanism is used so that those features that are highly correlated with other features have higher weights, which can be manifested as importance.

In step S340, the corresponding timing characteristics are respectively weighted and corrected based on the correction weights, so as to obtain a timing characteristic matrix.

Finally, fusing all the features in a weighted summation mode to obtain a final fused time sequence feature matrix:

According to the embodiment of the application, the specific behavior patterns in certain time periods can be identified through calculation of the standard deviation and the covariance. When a plurality of time sequence characteristics (such as the use frequency and the update frequency of the functional module) are changed together in different time periods, the interaction modes among the complex time sequence characteristics can be identified, and the detection comprehensiveness is further improved. Therefore, based on covariance and correlation analysis, the time sequence features are not simply weighted any more during fusion, but the weight distribution in the fusion process is dynamically adjusted by calculating the dependence of the time sequence features on other features, so that the rationality of the time sequence features in the whole feature fusion is ensured, and the response capability of the detection system to dynamic behaviors is improved.

In some preferred implementations of embodiments of the application, the convolution kernel scale weights are adaptively determined by means of weight calculations based on the information entropy (Entropy-based Weighting). The weight is distributed according to the information quantity (namely the information entropy) in the feature matrix generated by each convolution kernel scale by adopting the weight calculation mode of the information entropy, so that the feature matrix with high information quantity can be ensured to obtain higher weight, and the model is helped to focus on more important features.

FIG. 4 illustrates an operational flow diagram of an example of adaptively determining convolution kernel scale weights based on information entropy weight calculation in accordance with an embodiment of the present application.

As shown in fig. 4, in step S410, the eigenvalue construction probability distribution in each scale-feature matrix is calculated.

In the formula,The eigenvalue of the j-th element in the eigenvector matrix F _i,Representing characteristic valuesProbability distribution of (2); Representing characteristic values The frequency of the occurrence of this is,Is the total frequency of all eigenvalues in the eigenvalue matrix F _i.

In step S420, the information entropy of each scale feature matrix is calculated.

Where H (F _i) represents the information entropy of the feature matrix F _i.

In step S430, a convolution kernel scale weight corresponding to each convolution kernel is calculated:

Where α _i represents the convolution kernel scale weight for scale k _i×k_i.

In this way, the weight of each convolution kernel scale can be dynamically adjusted according to the information entropy of the feature matrix, so that the feature matrix with high information entropy is ensured to occupy a larger proportion in the final fusion feature. Feature matrices with high information entropy typically contain more complex behavioral patterns or abnormal patterns, and the model can more sensitively capture potential piracy by giving these features higher weights. For example, during the use of software, the use frequency of the functional module in certain time periods is abnormally high, the information entropy can capture the complexity of the mode, the convolution kernel is given higher weight, and the sensitivity of the model to the mode is improved. In addition, the feature matrix with low information entropy usually contains noise or irrelevant information, and by automatically giving lower weight, the model can effectively ignore irrelevant features, so that the interference of noise on the model is reduced. For example, features that do not change over time may be considered as irrelevant behavior or noise, whereas such features may be given lower weight by information entropy calculation. In addition, the weight of each convolution kernel scale can dynamically change along with data, so that the model can flexibly cope with feature changes in different scenes, and the detection capability of the model on diversified behavior modes is improved.

Further, in some examples of embodiments of the application, the loss function of the piracy identification model employs a hybrid loss function that includes both classification losses and regular losses based on information entropy. Specifically, the classification loss adopts a cross entropy loss function, which is used for measuring the difference between a predicted value and a real label of the model in a classification task, so that the model can accurately classify pirate software and genuine software. In addition, by integrating the information entropy regularization term, the model can dynamically adjust the weight of the convolution kernel scale according to the behavior of each sample in each training process, so that the excessive dependence on certain high-weight characteristic dimensions is prevented, the regularization term ensures that the weight of the convolution kernel remains reasonably distributed, and the phenomenon that other characteristic dimensions are ignored due to overlarge weight of certain convolution kernels is avoided.

Illustratively, the loss function of the piracy identification model is:

Where L _total represents the loss function of the piracy identification model, Representing the cross entropy loss of the mth sample,Information entropy regular loss representing the mth sample; The actual class label of the m-th sample, wherein a label value of 0 represents a non-pirated software sample, and a label value of 1 represents a pirated software sample; The model is a piracy risk probability prediction value of the model to the mth sample, C is the total number of classification categories, and C= 2;M is the total number of data samples in the data sample set; is the weight of the mth sample at convolution kernel scale k _q×k_q, λ represents the loss balance hyper-parameter.

Here, λ may be a preset or a learnable super parameter. By adjusting the hyper-parameter lambda, the model can be flexibly adjusted between classification accuracy and feature weight distribution. For example, in a scenario where classification accuracy is more important, the impact of cross entropy loss may be increased, while in a scenario where feature weight distribution has a large impact on performance, the effect of regularization terms may be enhanced.

It should be noted that, by introducing a regular term based on information entropy, the model can dynamically adjust weights of different convolution kernel scales according to the characteristics of each sample, so that the model can automatically identify which characteristic dimensions and time points are more important. The mechanism ensures that the model can adaptively adjust weights according to the performance of features in the face of complex behavior patterns (e.g., frequency of use of software, update response frequency, etc.). For example, in detecting pirated software, a particular behavioral pattern (e.g., a plug-in loading exception or a functional module usage frequency exception) may be given a higher weight, thereby enhancing recognition. Meanwhile, the weight adjustment process can dynamically reduce dependence on low information quantity or noise characteristics, and the model is prevented from being misled by irrelevant characteristics, so that the overall robustness and stability are improved.

Through the loss function of the embodiment of the application, the accuracy of the model in classification tasks is ensured by using cross entropy loss, namely, the model can accurately distinguish piracy software from copyrighted software. The information entropy regularization term is utilized to ensure the rationality of weight distribution of the model in the multi-scale feature fusion process, and avoid the excessive dependence of the model on certain features. Therefore, the balance of classification performance and characteristic weight distribution can be flexibly adjusted, classification precision and characteristic weight distribution can be simultaneously optimized through double-loss balance optimization, an optimal characteristic fusion scheme can be more rapidly found, and the adaptability of a model is enhanced.

It should be noted that, for simplicity of description, the foregoing method embodiments are all illustrated as a series of acts combined, but it should be understood and appreciated by those skilled in the art that the present application is not limited by the order of acts, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application. In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

FIG. 5 illustrates a block diagram of an example of a big data analysis based master software detection processing system in accordance with an embodiment of the present application.

As shown in fig. 5, the genuine software detection processing system 500 based on big data analysis includes an installation source acquisition unit 510, a piracy risk initial recognition unit 520, a service log acquisition unit 530, and a piracy risk determination unit 540.

The installation source obtaining unit 510 is configured to obtain installation source information of each installation software in the client device to be detected.

The piracy risk initial identifying unit 520 is configured to verify the installation source information of each installation software based on a software installation authorization library, where the software authorization library includes a plurality of authorized and pirated software names and corresponding software authorized installation channels, so as to identify whether each installation software has a potential piracy risk.

The service log obtaining unit 530 is configured to obtain, for each identified risk installation software with a potential piracy risk, an online service log of the risk installation software, and extract multidimensional risk features from the online service log, where the multidimensional risk features include a software online update response frequency, a software function module usage frequency, software plug-in loading information, and software user behavior information.

The piracy risk determining unit 540 is configured to input the multidimensional risk features of each risk installation software to a piracy identification model respectively, so as to determine whether the piracy risk exists correspondingly, where the piracy identification model adopts a deep learning model.

In some embodiments, embodiments of the present application provide a non-transitory computer readable storage medium having stored therein one or more programs including execution instructions that are readable and executable by an electronic device (including, but not limited to, a computer, a server, or a network device, etc.) for performing the steps of any of the above-described big data analysis based method of the present application.

In some embodiments, embodiments of the present application also provide a computer program product comprising a computer program stored on a non-volatile computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, cause the computer to perform the steps of any of the above-described big data analysis based method of detecting a piece of original software.

In some embodiments, the present application also provides an electronic device comprising at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of a method of processing for detection of authentic software based on analysis of big data.

Fig. 6 is a schematic hardware structure of an electronic device for executing a method for detecting and processing a master software based on big data analysis according to another embodiment of the present application, as shown in fig. 6, the device includes:

one or more processors 610, and a memory 620, one processor 610 being illustrated in fig. 6.

The apparatus for performing the big data analysis based method of detecting and processing the genuine software may further include an input device 630 and an output device 640.

The processor 610, memory 620, input devices 630, and output devices 640 may be connected by a bus or other means, for example in fig. 6.

The memory 620 is used as a non-volatile computer readable storage medium, and can be used to store non-volatile software programs, non-volatile computer executable programs, and modules, such as program instructions/modules corresponding to the method for detecting and processing the original software based on big data analysis in the embodiment of the present application. The processor 610 executes various functional applications of the server and data processing, that is, implements the above-described method embodiment of the master software detection processing method based on big data analysis, by running nonvolatile software programs, instructions, and modules stored in the memory 620.

The memory 620 may include a storage program area that may store an operating system, application programs required for at least one function, and a storage data area that may store data created according to the use of the electronic device, etc. In addition, memory 620 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 620 optionally includes memory remotely located relative to processor 610, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 630 may receive input digital or character information and generate signals related to user settings and function control of the electronic device. The output device 640 may include a display device such as a display screen.

The one or more modules are stored in the memory 620 and when executed by the one or more processors 610 perform the big data analysis based security software detection processing method of any of the method embodiments described above.

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present application.

The electronic device of the embodiments of the present application exists in a variety of forms including, but not limited to:

(1) Mobile communication devices, which are characterized by mobile communication functionality and are aimed at providing voice, data communication. Such terminals include smart phones, multimedia phones, functional phones, low-end phones, and the like.

(2) Ultra mobile personal computer equipment, which belongs to the category of personal computers, has the functions of calculation and processing and generally has the characteristic of mobile internet surfing. Such terminals include PDA, MID, and UMPC devices, etc.

(3) Portable entertainment devices such devices can display and play multimedia content. The device comprises an audio player, a video player, a palm game machine, an electronic book, an intelligent toy and a portable vehicle navigation device.

(4) Other on-board electronic devices with data interaction functions, such as on-board devices mounted on vehicles.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the related art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the method described in the respective embodiments or some parts of the embodiments.

It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present application, and not for limiting the same, and although the present application has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present application.

Claims

1. A method for detecting and processing original software based on big data analysis comprises the following steps:

Acquiring installation source information of each installation software in the client equipment to be detected;

Verifying the installation source information of each installation software based on a software installation authorization library respectively so as to identify whether each installation software has potential piracy risks;

Acquiring an online service log of risk installation software aiming at each identified risk installation software with potential piracy risks, and extracting multidimensional risk characteristics from the online service log, wherein the multidimensional risk characteristics comprise software online update response frequency, software function module use frequency, software plug-in loading information and software user behavior information, the online update response frequency is used for judging whether the software is frequently updated and is consistent with an update period of the original software, the function module use frequency is used for analyzing the use frequency of a software core function so as to identify whether the software is matched with normal use behavior of the original software, the plug-in loading information is used for analyzing whether the software is loaded with an unauthorized plug-in or expansion module, and the software user behavior information is used for recording user operation behaviors so as to evaluate whether abnormal operation or abnormal use modes exist;

Respectively inputting the multidimensional risk characteristics of each risk installation software into a piracy identification model to correspondingly determine whether piracy risks exist;

the piracy identification model comprises:

The CNN module is used for extracting the local dependency relationship between the software online updating response frequency in the multidimensional risk characteristics and the use frequency of the software functional module so as to generate corresponding time sequence mode characteristics;

The MLP module is used for extracting nonlinear relations between software plug-in loading information and software user behavior information in the multidimensional risk features so as to generate corresponding static mode features;

the feature fusion module is used for fusing the time sequence mode features and the static mode features to generate corresponding fusion features;

The classification module is used for classifying the fusion characteristics through the full connection layer so as to output a risk probability value of the risk installation software belonging to pirate software;

the extracting of the time sequence feature matrix comprises the following steps:

Performing time sequence analysis on the online service log according to the preset time window size to extract a time sequence feature group, wherein the time sequence feature group comprises time sequence features corresponding to a preset number of time windows, and the time sequence features comprise software online update response frequency and software function module use frequency;

calculating a standard deviation of each time sequence feature and covariance between different time sequence features based on the time sequence feature group so as to analyze interdependence between different time sequence features:

Where Cov (f _u,f_v) is the covariance between timing feature f _u and timing feature f _v, AndRepresenting the standard deviation of f _u and the standard deviation of f _v, respectively, ρ _uv representing the interdependence between f _u and f _v;

calculating the correction weight of each time sequence feature according to the correlation between different time sequence features:

And respectively carrying out weighted correction on the corresponding time sequence characteristics based on each correction weight to obtain a time sequence characteristic matrix X _t:

2. The method of claim 1, wherein the CNN module employs a multi-scale convolution network module having a plurality of different scale convolution kernels introduced therein to capture characteristic behavior patterns of both short-term fine granularity and long-term coarse granularity simultaneously:

Where F _i is a feature matrix generated by a convolution kernel of size k _i×k_i, The method comprises the steps of representing a convolution kernel weight matrix with a scale of k _i×k_i, wherein X _t represents an input time sequence feature matrix, conv (·) represents convolution processing, N is the total number of convolution kernels, and alpha _i is convolution kernel scale weight corresponding to the convolution kernels with the scale of k _i×k_i.

3. The method of claim 1, wherein the feature fusion module employs a self-attention mechanism based feature fusion module and is configured to generate the fused feature by:

Receiving a time sequence mode characteristic and a static mode characteristic corresponding to first risk installation software, wherein a characteristic matrix of the time sequence mode characteristic is as follows T is the total number of time steps, F ₁ is the dimension of the time sequence feature, and the feature matrix of the static mode feature isF ₂ is the dimension of the static feature;

In the formula, Is a linear transformation weight matrix of the query, key and value of the time sequence characteristics; Q _s,K_s,V_s represents the query vector, key vector and value vector corresponding to the time sequence feature respectively, and Q _p,K_p,V_p represents the query vector, key vector and value vector corresponding to the static feature respectively;

Z_s＝Attention(Q_s,K_s,V_s),

Z_p＝Attention(Q_p,K_p,V_p),

Wherein, attention (·) represents a dot product Attention computing mechanism; Is a scaling factor to prevent the dot product value from excessively affecting the gradient computation, softmax represents a Softmax activation function, converts the correlation value into a weight coefficient through the Softmax activation operation, and the weight coefficient is used for carrying out weighted summation on V so as to generate a fused feature representation, K ^T represents the transpose of the key vector K, Z _s is a time sequence feature self-attention representation, and Z _p is a static feature self-attention representation;

Z_sp＝Attention(Q_s,K_p,V_p),

Z_ps＝Attention(Q_p,K_s,V_s),

Wherein Z _sp represents a first interactive attention feature of the timing feature to the static feature and Z _ps represents a second interactive attention feature of the static feature to the timing feature;

Z_final＝Z_s+Z_p+Z_sp+Z_ps,

4. The method of claim 1, wherein the convolution kernel scale weights are adaptively determined by way of weight calculation based on information entropy:

calculating eigenvalues in each scale feature matrix to construct probability distribution:

In the formula, The eigenvalue of the j-th element in the eigenvector matrix F _i,Representing characteristic valuesProbability distribution of (2); Representing characteristic values The frequency of the occurrence of this is,Is the total frequency of all eigenvalues in the eigenvalue matrix F _i;

calculating the information entropy of each scale feature matrix:

Wherein H (F _i) represents the information entropy of the feature matrix F _i;

Calculating the convolution kernel scale weight corresponding to each convolution kernel:

Where α _i represents the convolution kernel scale weight for scale k _i×k_i.

5. The method of claim 4, wherein the loss function of the piracy identification model is:

6. A master software detection processing system based on big data analysis, which is used for realizing the method of any one of claims 1-5, and comprises the following steps:

The installation source acquisition unit is used for acquiring installation source information of each installation software in the client equipment to be detected;

the piracy risk initial identifying unit is used for respectively verifying the installation source information of each installation software based on a software installation authorization library so as to identify whether each installation software has potential piracy risk;

The system comprises a service log acquisition unit, a service log processing unit and a software processing unit, wherein the service log acquisition unit is used for acquiring an online service log of risk installation software aiming at each identified risk installation software with potential piracy risk and extracting multidimensional risk characteristics from the online service log, wherein the multidimensional risk characteristics comprise software online update response frequency, software function module use frequency, software plug-in loading information and software user behavior information;

And the piracy risk determining unit is used for respectively inputting the multidimensional risk characteristics of each risk installation software into a piracy identification model so as to correspondingly determine whether the piracy risk exists or not, and the piracy identification model adopts a deep learning model.