[go: up one dir, main page]

CN103473104A - Method for discriminating re-package of application based on keyword context frequency matrix - Google Patents

Method for discriminating re-package of application based on keyword context frequency matrix Download PDF

Info

Publication number
CN103473104A
CN103473104A CN2013104384449A CN201310438444A CN103473104A CN 103473104 A CN103473104 A CN 103473104A CN 2013104384449 A CN2013104384449 A CN 2013104384449A CN 201310438444 A CN201310438444 A CN 201310438444A CN 103473104 A CN103473104 A CN 103473104A
Authority
CN
China
Prior art keywords
application
keyword
smali
bag
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013104384449A
Other languages
Chinese (zh)
Other versions
CN103473104B (en
Inventor
郭耀
吕骁博
王浩宇
刘梦馨
陈向群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN201310438444.9A priority Critical patent/CN103473104B/en
Publication of CN103473104A publication Critical patent/CN103473104A/en
Application granted granted Critical
Publication of CN103473104B publication Critical patent/CN103473104B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Stored Programmes (AREA)

Abstract

一种基于关键词上下文频率矩阵的应用重打包辨别方法,应用于安卓系统,首先对应用程序文件进行处理,得到smali代码文件然后对smali代码处理,提取操作符序列,统计关键词信息,对每个特定类型的关键词构造上下文相关的特征三元组生成基于上下文频率的特征矩阵,对应用程序的特征矩阵进行两两对比,根据矩阵相似度得到两个应用程序的相似度。最后结合作者信息等内容判断应用程序间是否有重打包关系。用本发明提供的技术方案,可以对重打包的安卓应用程序进行判别,同时避免了对整个应用程序进行巨型字符串哈希处理的额外开销;不依赖原始文件二进制代码顺序;通过限制特征矩阵的大小,降低空间开销;提高了安卓应用程序重打包判别的执行效率。

An application repackaging identification method based on the keyword context frequency matrix is applied to the Android system. First, the application file is processed to obtain the smali code file, and then the smali code is processed, the operator sequence is extracted, and the keyword information is counted. A specific type of keywords is used to construct context-related feature triples to generate a feature matrix based on context frequency, and the feature matrix of the application is compared in pairs, and the similarity between the two applications is obtained according to the similarity of the matrix. Finally, combined with author information and other content, it is judged whether there is a repackaging relationship between applications. With the technical solution provided by the present invention, repackaged Android applications can be discriminated, while avoiding the extra overhead of performing giant character string hash processing on the entire application; it does not depend on the binary code order of the original file; by limiting the size of the feature matrix size, reducing space overhead; improving the execution efficiency of Android application repackaging discrimination.

Description

一种基于关键词上下文频率矩阵的应用重打包辨别方法A Applied Repackaging Discrimination Method Based on Keyword Context Frequency Matrix

技术领域technical field

本发明涉及一种基于关键词上下文频率矩阵的应用重打包辨别方法,具体涉及一种在安卓平台下,利用应用程序代码关键词频率矩阵,识别应用重打包的处理方法。The present invention relates to an application repackaging identification method based on a keyword context frequency matrix, and in particular to a processing method for identifying application repackaging by using an application code keyword frequency matrix under an Android platform.

背景技术Background technique

安卓(Android)系统是谷歌公司开发并推广,基于Linux的自由及开放源代码的操作系统,主要使用于移动设备,例如智能手机或平板电脑。安卓系统是目前全球市场份额最大的移动手机操作系统,来自官方的数据显示,安卓系统上的应用程序已经超过975,000个。Android (Android) system is a Linux-based free and open source operating system developed and promoted by Google, mainly used in mobile devices, such as smartphones or tablets. The Android system is currently the mobile phone operating system with the largest market share in the world. According to official data, there are more than 975,000 applications on the Android system.

通常,安卓系统应用程序由第三方开发者开发和发布,这就带来一个问题,即应用重打包。应用重打包是指,某些开发者通过不同渠道攫取其他开发者发布的应用,通过反编译,二进制代码插桩等技术,对现有应用程序进行修改(例如植入恶意代码,修改开发者信息,修改权限,对受保护内容进行破解等),再重新打包、发布。这就引发了关于版权,安全,著作权,隐私保护,恶意代码植入等诸多问题。Usually, Android system applications are developed and released by third-party developers, which brings about a problem, that is, application repackaging. Application repackaging means that some developers grab applications released by other developers through different channels, and modify existing applications through decompilation, binary code insertion and other technologies (such as implanting malicious code, modifying developer information, etc.) , modify permissions, crack protected content, etc.), and then repackage and release. This raises many issues about copyright, security, authorship, privacy protection, malicious code implantation and so on.

Silvio Cesare和Yang Xiang在《Software Similarity and Classification》一书中总结了若干关于软件相似度和聚类分析的解决方案:例如通过字符串分析,对代码文本进行相似度比较,进而得出软件相似度的信息;以及通过对代码词频进行统计,根据统计结果比较软件相似度。不过这些解决方案并没有考虑移动平台应用相关的特殊情况。Silvio Cesare and Yang Xiang summarized several solutions for software similarity and cluster analysis in the book "Software Similarity and Classification": for example, through string analysis, the code text is compared for similarity, and then the software similarity is obtained information; and compare the software similarity according to the statistical results by counting the frequency of code words. These solutions, however, do not take into account the special circumstances associated with mobile platform applications.

2012年,美国南加州大学Wu Zhou,Yajin Zhou等人提出了另一种解决方案(CODASPY’12论文):利用混淆哈希技术,提取应用程序特征,然后利用混淆哈希生成应用的指纹,再采用滚动哈希技术,将指纹生成为一个特征向量,通过两个应用的特征向量相似度比较,判断是否存在软件重打包问题。该方法需要对所有代码进行分析,复杂低效,且依赖代码文本顺序,无法处理通过插入无用代码,代码混淆,函数重命名,改变代码顺序等对代码进行修改的情况。In 2012, Wu Zhou, Yajin Zhou, et al. of the University of Southern California proposed another solution (CODASPY'12 paper): using obfuscated hash technology to extract application features, and then use obfuscated hash to generate application fingerprints, and then Using the rolling hash technology, the fingerprint is generated as a feature vector, and by comparing the similarity of the feature vectors of the two applications, it is judged whether there is a software repackaging problem. This method needs to analyze all codes, which is complex and inefficient, and depends on the order of code text, and cannot handle the situation of modifying code by inserting useless code, code obfuscation, function renaming, changing code order, etc.

发明内容Contents of the invention

本发明的目的是提供一种新的方法,使得在较小开销,较快时间内,对给出的若干安卓应用程序进行预处理,得到一个基于关键词上下文频率的特征矩阵,通过对矩阵相似度的计算,进行聚类,得到这些安卓应用程序中哪些是重打包应用的信息。The purpose of the present invention is to provide a new method, so that with less overhead and faster time, a number of Android applications given are preprocessed to obtain a feature matrix based on the keyword context frequency. Degree calculation, clustering, and information on which of these Android applications are repackaged applications.

本发明的原理是:首先对应用程序文件(apk文件)进行处理,得到smali代码文件,smali代码是原来应用程序二进制代码的一种中间表示。然后对smali代码处理,提取操作符序列,统计关键词信息,对每个特定类型的关键词构造上下文相关的特征三元组<K1,i,K2>,生成基于上下文频率的特征矩阵,对应用程序的特征矩阵进行两两对比,根据矩阵相似度得到两个应用程序的相似度。最后结合作者信息等内容判断应用程序间是否有重打包关系。The principle of the present invention is as follows: firstly, the application program file (apk file) is processed to obtain the smali code file, and the smali code is an intermediate representation of the binary code of the original application program. Then process the smali code, extract the operator sequence, count the keyword information, construct a context-related feature triplet <K1,i,K2> for each specific type of keyword, and generate a feature matrix based on the context frequency. The feature matrix of the program is compared in pairs, and the similarity of the two applications is obtained according to the similarity of the matrix. Finally, combined with author information and other content, it is judged whether there is a repackaging relationship between applications.

本发明提供的技术方案如下:The technical scheme provided by the invention is as follows:

一种基于关键词上下文频率矩阵的应用重打包辨别方法,应用于安卓系统,其特征是,包括如下步骤(流程参见图9):An application repackaging identification method based on a keyword context frequency matrix, which is applied to the Android system, is characterized in that it includes the following steps (see Figure 9 for the flow process):

A.对应用程序文件进行预处理,将二进制代码转换为smali代码文件、提取应用程序的作者签名信息并构造关键词向量;A. Preprocess the application file, convert the binary code into a smali code file, extract the author's signature information of the application, and construct a keyword vector;

B.对smali代码文件进行处理,生成smali操作符序列;B. Process the smali code file to generate a smali operator sequence;

C.生成关键词上下文频率矩阵;C. generate keyword context frequency matrix;

D.对比应用程序关键词上下文频率矩阵的相似度,判断该应用程序是否为重打包应用。D. Compare the similarity of the application keyword context frequency matrix to determine whether the application is a repackaged application.

所述的应用重打包辨别方法,其特征是,步骤A包括:The method for identifying application repackaging is characterized in that step A includes:

A1.提取安卓应用程序二进制代码文件以及META-INFO文件中的作者签名信息文件;A1. Extract the Android application binary code file and the author's signature information file in the META-INFO file;

A2.使用现有工具,将二进制代码转换为smali代码文件;A2. Use existing tools to convert the binary code into a smali code file;

A3.使用现有工具,从相应文件提取作者签名内容;A3. Use existing tools to extract the content of the author's signature from the corresponding file;

A4.构造关键词向量。A4. Construct keyword vectors.

所述的应用重打包辨别方法,其特征是,步骤B包括:The method for identifying application repackaging is characterized in that step B includes:

B1.对步骤A中得到的smali代码文件进行处理,去掉第三方库文件(主要是一些广告库,社交平台库等内容);B1. Process the smali code file obtained in step A, and remove the third-party library files (mainly some advertising libraries, social platform libraries, etc.);

B2.对步骤B1中得到的smali代码文件处理,将每条语句中的操作符以外的其他所有信息剥离,得到一个应用程序的smali操作符序列。B2. Processing the smali code file obtained in step B1, stripping off all other information except the operator in each statement, and obtaining a smali operator sequence of an application program.

所述的应用重打包辨别方法,其特征是,步骤C包括:The method for identifying application repackaging is characterized in that step C includes:

C1.构造关键词上下文频率矩阵Max;C1. Construct the keyword context frequency matrix Max;

C2.根据选定的关键词向量,对每一个关键词,在步骤B得到的smali操作符序列中的每一次出现,采用哈希算法,将其上文的K条语句和下文的K条语句分别映射为整数K1和K2,其中,所述关键词在关键词向量中对应下标为i;C2. According to the selected keyword vector, for each keyword, each occurrence in the smali operator sequence obtained in step B, using the hash algorithm, combine the above K statements and the following K statements Mapped to integers K1 and K2 respectively, wherein the corresponding subscript of the keyword in the keyword vector is i;

C3.增加特征矩阵对应位置Max[i][K1][K2]计数。C3. Increase the count of Max[i][K1][K2] corresponding to the position of the feature matrix.

所述的应用重打包辨别方法,其特征是,步骤D包括:The method for identifying application repackaging is characterized in that step D includes:

D1.对给定的安卓应用程序,计算其关键词上下文频率矩阵的相似度,以此作为标准,对两个安卓应用程序进行两两比较;D1. For a given Android application, calculate the similarity of its keyword context frequency matrix, and use this as a standard to compare two Android applications in pairs;

D2.将关键词上下文频率矩阵的相似度计算结果超过指定阈值的安卓应用程序聚类,认为这一类的安卓应用程序可能存在重打包问题;D2. Cluster the Android applications whose similarity calculation results of the keyword context frequency matrix exceed the specified threshold, and think that this type of Android applications may have a repackaging problem;

D3.结合步骤A中得到的作者签名信息,进一步对步骤D2的判定结果进行筛选排查。D3. Combining with the author's signature information obtained in step A, further screen and check the judgment results of step D2.

所述的应用重打包辨别方法,其特征是,步骤D1中,计算其关键词上下文频率矩阵的相似度的方法为:设原应用程序(即未重打包的原始应用程序)和待判断的应用程序的关键词上下文频率矩阵分别为Max1,Max2,对两个矩阵的每一位,分别用Max1[i][j][k]和Max2[i][j][k]表示,计算这两个数的最小值Min[i][j][k],用score来表示其相似度得分,score=200*ΣMin[i][j][k]/Σ(Max1[i][j][k]+Max2[i][j][k])。The method for identifying application repackaging is characterized in that, in step D1, the method for calculating the similarity of its keyword context frequency matrix is as follows: set the original application program (that is, the original application program that has not been repackaged) and the application program to be judged The keyword context frequency matrix of the program is respectively Max1 and Max2. Each bit of the two matrices is represented by Max1[i][j][k] and Max2[i][j][k], and the calculation of these two The minimum value of the number Min[i][j][k], use score to represent its similarity score, score=200*ΣMin[i][j][k]/Σ(Max1[i][j][ k]+Max2[i][j][k]).

所述的应用重打包辨别方法,其特征是,步骤D2中,所述的阈值为70。The application repackaging identification method is characterized in that, in step D2, the threshold is 70.

所述的应用重打包辨别方法,其特征是,步骤A4中,选择如下指令对应的操作符作为关键词:转移指令、函数调用指令、比较指令、声明类指令、运算指令、传送指令、抛出异常指令。The method for distinguishing application repackaging is characterized in that in step A4, the operator corresponding to the following instructions is selected as a keyword: transfer instruction, function call instruction, comparison instruction, statement instruction, operation instruction, transfer instruction, throw Exception instruction.

所述的应用重打包辨别方法,其特征是,步骤A4中,选择如下操作符作为关键词:if、goto、invoke_virtual、invoke_static、invoke_direct、add-int/lit8、move_result_object、new-array、const/4、const/16、const-string、throw、new-instance、cmpl-float、and-int/lit1。The described application repackaging method is characterized in that, in step A4, the following operators are selected as keywords: if, goto, invoke_virtual, invoke_static, invoke_direct, add-int/lit8, move_result_object, new-array, const/4 , const/16, const-string, throw, new-instance, cmpl-float, and-int/lit1.

本发明的有益效果:利用本发明提供的技术方案,可以对重打包的安卓应用程序进行判别,同时避免了对整个应用程序进行巨型字符串哈希处理的额外开销;不依赖原始文件二进制代码顺序;通过限制特征矩阵的大小,降低空间开销;提高了安卓应用程序重打包判别的执行效率。Beneficial effects of the present invention: With the technical solution provided by the present invention, the repackaged Android application program can be discriminated, while avoiding the extra overhead of performing giant string hash processing on the entire application program; it does not depend on the binary code sequence of the original file ; By limiting the size of the feature matrix, the space overhead is reduced; and the execution efficiency of Android application repackaging discrimination is improved.

附图说明Description of drawings

图1本发明的应用程序预处理流程。Fig. 1 is the application program preprocessing flow of the present invention.

图2本发明的生成smali操作符序列流程。Fig. 2 is the flow chart of generating smali operator sequence in the present invention.

图3本发明的生成关键词上下文频率矩阵流程。Fig. 3 is the process flow of generating keyword context frequency matrix in the present invention.

图4本发明的重打包结果判断流程。Fig. 4 is the judging process of the repacking result of the present invention.

图5本发明实施例提供的应用程序预处流程图。Fig. 5 is a flow chart of application program preprocessing provided by the embodiment of the present invention.

图6本发明实施例提供的生成smali操作符序列流程图。Fig. 6 is a flow chart of generating a smali operator sequence provided by an embodiment of the present invention.

图7本发明实施例提供的生成关键词上下文频率矩阵流程图。Fig. 7 is a flow chart of generating a keyword context frequency matrix provided by an embodiment of the present invention.

图8本发明实施例提供的重打包结果判断流程图。Fig. 8 is a flow chart of judging the repacking result provided by the embodiment of the present invention.

图9本发明所述方法的流程图。Fig. 9 is a flowchart of the method of the present invention.

具体实施方式Detailed ways

本发明的具体实施方式如下:The specific embodiment of the present invention is as follows:

A.在对应用程序文件进行预处理时,执行如下操作:A. When preprocessing the application file, do the following:

A1.提取安卓应用程序二进制代码文件以及META-INFO文件中的作者签名信息文件;A1. Extract the Android application binary code file and the author's signature information file in the META-INFO file;

A2.使用现有工具,例如backsmali(https://code.google.com/p/smali/),将二进制代码(.dex文件)转换为smali代码文件;A2. Use existing tools, such as backsmali (https://code.google.com/p/smali/), to convert the binary code (.dex file) into a smali code file;

A3.使用现有工具,例如keytool(JDK(Java Development Kit)开发组件工具),从相应文件(CERT.RSA)提取作者签名内容;A3. Use existing tools, such as keytool (JDK (Java Development Kit) development component tool), to extract the author's signature content from the corresponding file (CERT.RSA);

A4.构造关键词向量,关键词的选择依据是,选取出现频率比较高的语句;关键词在语义上没有明显的重复,即关键词集合可以覆盖不同功能的语句;选取语义上比较重要的指令,例如运算指令、函数调用指令等。A4. Construct keyword vectors. The basis for selecting keywords is to select sentences with a relatively high frequency of occurrence; there is no obvious repetition of keywords in semantics, that is, the keyword set can cover sentences with different functions; select semantically important instructions , such as operation instructions, function call instructions, etc.

B.在生成smali操作符序列部分,执行如下操作:B. In the part of generating the smali operator sequence, perform the following operations:

B1.对步骤A2得到的smali代码文件进行处理,去掉第三方库文件,主要是一些广告库,例如Admob,AirPush,LeadBolt,InMobi等,社交平台库,例如Facebook,OpenFeint,HeyZap等,以及其他开发所使用的第三方库;B1. Process the smali code file obtained in step A2, remove the third-party library files, mainly some advertising libraries, such as Admob, AirPush, LeadBolt, InMobi, etc., social platform libraries, such as Facebook, OpenFeint, HeyZap, etc., and other development Third-party libraries used;

B2.对步骤B1得到的smali代码文件处理,将每条语句中的操作符以外的其他所有信息剥离,这些其他信息包括操作数,以及一些其他的标示符,例如‘#’,‘.’等,得到一个应用程序的smali操作符序列。B2. Process the smali code file obtained in step B1, and strip all other information except operators in each statement, such other information includes operands, and some other identifiers, such as '#', '.', etc. , to get a sequence of smali operators for an application.

C.在生成关键词上下文频率矩阵部分,执行如下操作:C. In the part of generating keyword context frequency matrix, perform the following operations:

C1.构造关键词上下文频率矩阵Max,这是一个三维矩阵,大小为sz_kv*sz_hash*sz_hash,其中sz_kv为选定的关键词向量的大小,sz_hash是步骤C2中采取的哈希算法结果的取值范围大小,将Max的每一位初始化为0;C1. Construct the keyword context frequency matrix Max, which is a three-dimensional matrix with a size of sz_kv*sz_hash*sz_hash, where sz_kv is the size of the selected keyword vector, and sz_hash is the value of the hash algorithm result taken in step C2 Range size, initialize each bit of Max to 0;

C2.根据选定的关键词向量,对每一个关键词(在关键词向量中对应下标为i)在步骤B2得到的smali操作符序列中的每一次出现,采用特定哈希算法,将其上文的K(K为自定义的一个整数)条语句和下文的K条语句分别映射为整数K1和K2;C2. According to the selected keyword vector, for each occurrence of each keyword (the corresponding subscript in the keyword vector is i) in the smali operator sequence obtained in step B2, use a specific hash algorithm to convert it to The above K (K is a self-defined integer) statement and the following K statement are mapped to integers K1 and K2 respectively;

C3.增加特征矩阵对应位置Max[i][K1][K2]计数。C3. Increase the count of Max[i][K1][K2] corresponding to the position of the feature matrix.

D.在重打包结果判断时,执行如下操作:D. When judging the repacking result, perform the following operations:

D1.对给定的安卓应用程序,采用特定算法计算其关键词上下文频率矩阵的相似度,以此作为标准,对两个安卓应用程序进行两两比较;这里比较关键词上下文频率矩阵的相似度的算法是,设两个矩阵分别为Max1,Max2,对两个矩阵的每一位,分别用Max1[i][j][k]和Max2[i][j][k]表示,计算这两个数的最小值Min[i][j][k],用score来表示其相似度得分,score=200*ΣMin[i][j][k]/Σ(Max1[i][j][k]+Max2[i][j][k]);D1. For a given Android application, use a specific algorithm to calculate the similarity of its keyword context frequency matrix, and use this as a standard to compare two Android applications; here, compare the similarity of the keyword context frequency matrix The algorithm is to set the two matrices as Max1 and Max2 respectively, and express each bit of the two matrices as Max1[i][j][k] and Max2[i][j][k] respectively, and calculate this The minimum value of two numbers Min[i][j][k], use score to represent its similarity score, score=200*ΣMin[i][j][k]/Σ(Max1[i][j] [k]+Max2[i][j][k]);

D2.将关键词上下文频率矩阵的相似度计算结果(即步骤D1中的score)超过某个阈值的安卓应用程序聚类,根据经验性结果,阈值选为70时最优,认为这一类的安卓应用程序可能存在重打包问题。D2. Cluster the Android applications whose similarity calculation result of the keyword context frequency matrix (that is, the score in step D1) exceeds a certain threshold. According to empirical results, when the threshold is selected as 70, it is the best, and this type of application is considered Android apps may have repackaging issues.

D3.结合步骤A3得到的作者签名信息,进一步对步骤D2的结果进行筛选排查,作者信息相同的相似应用程序一般是同一应用程序的不同版本,不属于重打包应用,而作者信息不同的相似应用程序则判断为重打包应用程序。D3. Combining the author's signature information obtained in step A3, further screen and check the results of step D2. Similar applications with the same author information are generally different versions of the same application, and do not belong to repackaged applications. Similar applications with different author information The program is judged as a repackaged application.

下面通过实例对本发明做进一步说明。The present invention will be further described below by example.

实施例1:Example 1:

假定一个安卓应用程序,其中文名为“自动桌面照片滤镜”,需要检测它是否是一个重打包应用,其程序包名为AutodeskzhaopianlvjingPixlr_o_matic_V2.2.1_mumayi_ac32a.apk。Suppose an Android application, whose Chinese name is "Automatic Desktop Photo Filter", needs to detect whether it is a repackaged application, and its package name is AutodeskzhaopianlvjingPixlr_o_matic_V2.2.1_mumayi_ac32a.apk.

A.预处理的流程包括如下步骤:A. The preprocessing process includes the following steps:

A1.将apk程序包解压之后,可以得到若干文件和文件夹,其中class.dex文件是应用程序的二进制代码文件,META-INFO文件夹下的CERT.RSA文件是作者的签名信息;A1. After decompressing the apk program package, you can get several files and folders, among which the class.dex file is the binary code file of the application program, and the CERT.RSA file under the META-INFO folder is the author's signature information;

A2.使用backsmali工具,将class.dex文件转换为smali代码文件,会生成一个文件夹,其中包括多个.smali文件;A2. Use the backsmali tool to convert the class.dex file into a smali code file, and a folder will be generated, including multiple .smali files;

A3.使用keytool工具,从CERT.RSA提取作者签名内容,输入keytool-printcert-fileCERT.RSA命令,可以得到若干关于应用程序的信息,包括所有者,发布者,序列号,有效期以及证书指纹。A3. Use the keytool tool to extract the author's signature content from CERT.RSA, and enter the keytool-printcert-fileCERT.RSA command to get some information about the application, including the owner, publisher, serial number, validity period, and certificate fingerprint.

A4.构造关键词向量,根据实验得到的经验性结果,我们选择如下15个操作符作为关键字,分别是if、goto、invoke_virtual、invoke_static、invoke_direct、add-int/lit8、move_result_object、new-array、const/4、const/16、const-string、throw、new-instance、cmpl-float、and-int/lit1,这15个操作符代表了多种指令,包括转移指令、函数调用指令、比较指令、声明类指令、运算指令、传送指令、抛出异常指令,每个操作符对应关键字向量的一个下标;A4. Construct keyword vectors. According to the experimental results, we choose the following 15 operators as keywords, namely if, goto, invoke_virtual, invoke_static, invoke_direct, add-int/lit8, move_result_object, new-array, const/4, const/16, const-string, throw, new-instance, cmpl-float, and-int/lit1, these 15 operators represent a variety of instructions, including transfer instructions, function call instructions, comparison instructions, Statement instructions, operation instructions, transfer instructions, and exception throw instructions, each operator corresponds to a subscript of the keyword vector;

B.生成smali操作符序列流程,包括如下步骤:B. Generate a smali operator sequence flow, including the following steps:

B1.对预处理得到的smali代码文件处理,去掉第三方库文件,例如该apk中包含admob广告库,则将对应的所有smali文件删除;B1. Process the smali code file obtained by preprocessing, and remove the third-party library file. For example, if the apk contains admob advertising library, delete all corresponding smali files;

B2.对B1得到的smali代码文件处理,将每条语句中的操作符以外的其他所有信息剥离,这些其他信息包括操作数,以及一些其他的标示符,例如‘#’,‘.’等,得到一个应用程序的smali操作符序列。smali文件中,每条语句的格式为“操作符操作数”,操作数为具体操作的变量名或者寄存器的名字,可以通过字符串处理的方式自动完成,这一步之后,得到一个smali操作符序列文本,其中每一行是一个操作符,其中某些操作符可能与我们选定的关键词匹配。B2. To process the smali code file obtained in B1, strip all other information except the operator in each statement, such other information includes operands, and some other identifiers, such as '#', '.', etc., Get an application's smali operator sequence. In the smali file, the format of each statement is "operator operand", and the operand is the variable name or register name of the specific operation, which can be automatically completed by string processing. After this step, a smali operator sequence is obtained Text, where each line is an operator, some of which may match our selected keywords.

C.生成关键词上下文频率矩阵流程,包括如下步骤:C. Generate keyword context frequency matrix process, including the following steps:

C1.构造关键词上下文频率矩阵Max,这是一个三维矩阵,大小为sz_kv*sz_hash*sz_hash,其中sz_kv为选定的关键词向量的大小,已确定为15,sz_hash是下一步中采取的哈希算法结果的取值范围大小,根据经验性结果,将其设置为67,将Max的每一位初始化为0;C1. Construct the keyword context frequency matrix Max, which is a three-dimensional matrix with a size of sz_kv*sz_hash*sz_hash, where sz_kv is the size of the selected keyword vector, which has been determined to be 15, and sz_hash is the hash taken in the next step The value range of the algorithm result, according to the empirical results, set it to 67, and initialize each bit of Max to 0;

C2.根据选定的关键词向量,对每一个关键词(在关键词向量中对应下标为i)在上一步得到的smali操作符序列中的每一次出现,采用BKDRHash算法(BKDRHash算法参见http://www.nocow.cn/index.php/BKDRHash;这里也可以采用其它字符串哈希算法,如ELFHASH,SDBMHash,RSHash等),将其上文的K条语句和下文的K条语句分别映射为整数K1和K2,例如根据字符串匹配,检测到对应关键词向量下标i=3的关键词invoke-static的一次出现,其上下文为如下指令序列,C2. According to the selected keyword vector, for each occurrence of each keyword (the corresponding subscript in the keyword vector is i) in the smali operator sequence obtained in the previous step, use the BKDRHash algorithm (for the BKDRHash algorithm, see http ://www.nocow.cn/index.php/BKDRHash; other string hash algorithms can also be used here, such as ELFHASH, SDBMHash, RSHash, etc.), and the above K statements and the following K statements are respectively Mapped to integers K1 and K2, for example, according to string matching, an occurrence of the keyword invoke-static corresponding to the keyword vector subscript i=3 is detected, and its context is the following instruction sequence,

lineline

try_start_0try_start_0

iget_objectiget_object

invoke-staticinvoke-static

move-result-objectmove-result-object

const-stringconst-string

invoke-virtual;invoke-virtual;

根据BKDRHash算法,分别对上下文做hash运算,得到K1=17,K2=33;According to the BKDRHash algorithm, hash operations are performed on the context respectively, and K1=17, K2=33 are obtained;

C3.增加特征矩阵对应位置Max[i][K1][K2]计数,在本例子中,执行Max[3][17][33]++的操作。C3. Increase the count of Max[i][K1][K2] corresponding to the position of the feature matrix. In this example, perform the operation of Max[3][17][33]++.

D.重打包结果判断流程,执行如下操作:D. Repackage result judgment process, perform the following operations:

D1.对给定的安卓应用程序,采用特定算法计算其关键词上下文频率矩阵的相似度,以此作为标准,对两个安卓应用程序进行两两比较,假设根据前面的类似步骤,我们得到了Autodeskzhaopiantexiaochulihanhuaban_Pixlr_o_matic_V2.1.2_mumayi_1f341.apk这一应用程序以及AutodeskzhaopianlvjingPixlr_o_matic_V2.2.1_mumayi_ac32a.apk这一应用程序的关键词上下文频率矩阵,设两个矩阵分别为Max1,Max2,对两个矩阵的的每一位,分别用Max1[i][j][k]和Max2[i][j][k]表示,计算这两个数的最小值Min[i][j][k],用score来表示其相似度得分,score=200*ΣMin[i][j][k]/Σ(Max1[i][j][k]+Max2[i][j][k]),根据运算,score=86.8871;D1. For a given Android application, use a specific algorithm to calculate the similarity of its keyword context frequency matrix, and use this as a standard to compare two Android applications. Suppose that according to the previous similar steps, we get Autodeskzhaopiantexiaochulihanhuaban_Pixlr_o_matic_V2.1.2_mumayi_1f341.apk application and the keyword context frequency matrix of AutodeskzhaopianlvjingPixlr_o_matic_V2.2.1_mumayi_ac32a.apk application. Let the two matrices be Max1 and Max2 respectively. For each bit of the two matrices, use Max1[i][j][k] and Max2[i][j][k] indicate that the minimum value Min[i][j][k] of these two numbers is calculated, and the similarity score is expressed by score , score=200*ΣMin[i][j][k]/Σ(Max1[i][j][k]+Max2[i][j][k]), according to the calculation, score=86.8871;

D2.将关键词上下文频率矩阵的相似度计算结果,即步骤D1中的score,超过某个阈值的安卓应用程序聚类,根据经验性结果,阈值选为70时最优,这里86.8871大于70,因此判断这两个安卓应用程序可能存在重打包问题。D2. Cluster the results of the similarity calculation of the keyword context frequency matrix, that is, the score in step D1, and Android applications exceeding a certain threshold. According to empirical results, it is optimal when the threshold is selected as 70, where 86.8871 is greater than 70, Therefore, it is judged that these two Android applications may have repackaging problems.

D3.结合前面得到的作者签名信息,进一步对结果进行筛选排查,发现这两个应用程序的作者信息不同,进一步确定这两个应用程序为重打包应用程序。D3. Combined with the author's signature information obtained above, further screen the results and find that the author information of the two applications is different, and further determine that the two applications are repackaged applications.

Claims (10)

1. the bag discriminating conduct is beaten again in the application based on keyword context frequency matrix, is applied to Android system, it is characterized in that, comprises the steps:
A. the application programs file carries out pre-service, and binary code is converted to the smali code file, extracts author's signing messages of application program and construct the keyword vector;
B. the smali code file is processed, generated smali operational character sequence;
C. generate keyword context frequency matrix;
D. contrast the similarity of application program keyword context frequency matrix, judge whether the attach most importance to packing application of this application program.
2. the bag discriminating conduct is beaten again in application as claimed in claim 1, it is characterized in that, steps A comprises:
A1. extract the author's signing messages file in Android application binaries code file and META-INFO file;
A2. use existing instrument, binary code is converted to the smali code file;
A3. use existing instrument, from corresponding document, extract author's signature contents;
A4. construct the keyword vector.
3. the bag discriminating conduct is beaten again in application as claimed in claim 1, it is characterized in that, step B comprises:
B1. the smali code file obtained in steps A is processed, removed the third party library file;
B2. the smali code file obtained in step B1 is processed, other all information beyond the operational character in every statement are peeled off, obtain the smali operational character sequence of an application program.
4. the bag discriminating conduct is beaten again in application as claimed in claim 1, it is characterized in that, step C comprises:
C1. construct keyword context frequency matrix Max;
C2. according to selected keyword vector, to each keyword, appearance each time in the smali operational character sequence obtained at step B, adopt hash algorithm, by it, K bar statement above and K bar statement hereinafter are mapped as respectively integer K 1 and K2, wherein, described keyword is designated as i under correspondence in the keyword vector;
C3. increase eigenmatrix correspondence position Max[i] [K1] [K2] counting.
5. the bag discriminating conduct is beaten again in application as claimed in claim 1, it is characterized in that, step D comprises:
D1. to given Android application program, calculate the similarity of its keyword context frequency matrix, using this as standard, two Android application programs are compared in twos;
D2. the similarity result of calculation of keyword context frequency matrix is surpassed to the Android application program cluster of assign thresholds, think that may there be the bag problem of beating again in the Android application program of this class.
6. the bag discriminating conduct is beaten again in application as claimed in claim 5, it is characterized in that, step D further comprises:
D3. the author's signing messages obtained in integrating step A, further screen investigation to the result of determination of step D2.
7. the bag discriminating conduct is beaten again in application as claimed in claim 5, it is characterized in that, in step D1, the method of calculating the similarity of its keyword context frequency matrix is: the keyword context frequency matrix of establishing former application program and application program to be judged is respectively Max1, Max2, to each of two matrixes, use respectively Max1[i] [j] [k] and Max2[i] [j] [k] expression, calculate the minimum M in[i of this two number] [j] [k], mean its similarity score with score, score=200* Σ Min[i] [j] [k]/Σ (Max1[i] [j] [k]+Max2[i] [j] [k]).
8. the bag discriminating conduct is beaten again in application as claimed in claim 5, it is characterized in that, in step D2, described threshold value is 70.
9. the bag discriminating conduct is beaten again in application as claimed in claim 2, it is characterized in that, in steps A 4, select as the corresponding operational character that gives an order as keyword: transfer instruction, function call instruction, comparison order, the instruction of statement class, operational order, move instruction, throw exception instruction.
10. the bag discriminating conduct is beaten again in application as claimed in claim 2, it is characterized in that, in steps A 4, select following operational character as keyword: if, goto, invoke_virtual, invoke_static, invoke_direct, add-int/lit8, move_result_object, new-array, const/4, const/16, const-string, throw, new-instance, cmpl-float, and-int/lit1.
CN201310438444.9A 2013-09-24 2013-09-24 Bag discriminating conduct is beaten again in a kind of application based on keyword context frequency matrix Active CN103473104B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310438444.9A CN103473104B (en) 2013-09-24 2013-09-24 Bag discriminating conduct is beaten again in a kind of application based on keyword context frequency matrix

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310438444.9A CN103473104B (en) 2013-09-24 2013-09-24 Bag discriminating conduct is beaten again in a kind of application based on keyword context frequency matrix

Publications (2)

Publication Number Publication Date
CN103473104A true CN103473104A (en) 2013-12-25
CN103473104B CN103473104B (en) 2016-10-05

Family

ID=49797973

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310438444.9A Active CN103473104B (en) 2013-09-24 2013-09-24 Bag discriminating conduct is beaten again in a kind of application based on keyword context frequency matrix

Country Status (1)

Country Link
CN (1) CN103473104B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104123493A (en) * 2014-07-31 2014-10-29 百度在线网络技术(北京)有限公司 Method and device for detecting safety performance of application program
CN104317599A (en) * 2014-10-30 2015-01-28 北京奇虎科技有限公司 Method and device for detecting whether installation package is packaged repeatedly or not
CN105389508A (en) * 2015-11-10 2016-03-09 工业和信息化部电信研究院 Detection method and apparatus for re-packaged Android application
CN106469259A (en) * 2015-08-19 2017-03-01 北京金山安全软件有限公司 Method and device for determining whether application program is legal application program or not and electronic equipment
CN107480219A (en) * 2017-07-31 2017-12-15 北京微影时代科技有限公司 Information processing method, device, electronic equipment and computer-readable recording medium
CN108170664A (en) * 2017-11-29 2018-06-15 有米科技股份有限公司 Keyword expanding method and device based on emphasis keyword
CN109117164A (en) * 2018-06-22 2019-01-01 北京大学 Micro services update method and system based on key element difference analysis
CN110659064A (en) * 2019-09-11 2020-01-07 无锡江南计算技术研究所 Search pruning optimization method based on feature element information
CN110795530A (en) * 2019-09-11 2020-02-14 无锡江南计算技术研究所 Context-based value feature extraction system and method
CN111651193A (en) * 2020-06-03 2020-09-11 上海米哈游天命科技有限公司 Information packaging method, device, equipment and medium
CN113656810A (en) * 2021-07-16 2021-11-16 五八同城信息技术有限公司 Application program encryption method and device, electronic equipment and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120151600A1 (en) * 2010-12-14 2012-06-14 Ta Chun Yun Method and system for protecting intellectual property in software

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120151600A1 (en) * 2010-12-14 2012-06-14 Ta Chun Yun Method and system for protecting intellectual property in software

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SONG ZENGBIN;HUANG JIN;WU GUANGXU: "Matrix-based android UI development", 《IEEE:COMPUTER SCIENCE AND INFORMATION PROCESSING(CSIP),2012 INTERNATIONAL CONFERENCE》 *
WU ZHOU,EL: "Detecting repackaged smartphone applications in third-party android marketplaces", 《ACM:PROCEEDINGS OF THE SECOND ACM CONFERENCE ON DATA AND APPLICATION SECURITY AND PRIVACY》 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104123493A (en) * 2014-07-31 2014-10-29 百度在线网络技术(北京)有限公司 Method and device for detecting safety performance of application program
CN104123493B (en) * 2014-07-31 2017-09-26 百度在线网络技术(北京)有限公司 The safety detecting method and device of application program
CN104317599B (en) * 2014-10-30 2017-06-20 北京奇虎科技有限公司 Whether detection installation kit is by the method and apparatus of secondary packing
CN104317599A (en) * 2014-10-30 2015-01-28 北京奇虎科技有限公司 Method and device for detecting whether installation package is packaged repeatedly or not
CN106469259B (en) * 2015-08-19 2019-07-23 北京金山安全软件有限公司 Method and device for determining whether application program is legal application program or not and electronic equipment
CN106469259A (en) * 2015-08-19 2017-03-01 北京金山安全软件有限公司 Method and device for determining whether application program is legal application program or not and electronic equipment
CN105389508B (en) * 2015-11-10 2018-02-16 工业和信息化部电信研究院 A detection method and device for an Android repackaged application
CN105389508A (en) * 2015-11-10 2016-03-09 工业和信息化部电信研究院 Detection method and apparatus for re-packaged Android application
CN107480219A (en) * 2017-07-31 2017-12-15 北京微影时代科技有限公司 Information processing method, device, electronic equipment and computer-readable recording medium
CN108170664A (en) * 2017-11-29 2018-06-15 有米科技股份有限公司 Keyword expanding method and device based on emphasis keyword
CN108170664B (en) * 2017-11-29 2021-04-09 有米科技股份有限公司 Key word expansion method and device based on key words
CN109117164B (en) * 2018-06-22 2020-08-25 北京大学 Microservice update method and system based on difference analysis of key elements
CN109117164A (en) * 2018-06-22 2019-01-01 北京大学 Micro services update method and system based on key element difference analysis
CN110659064A (en) * 2019-09-11 2020-01-07 无锡江南计算技术研究所 Search pruning optimization method based on feature element information
CN110795530A (en) * 2019-09-11 2020-02-14 无锡江南计算技术研究所 Context-based value feature extraction system and method
CN110659064B (en) * 2019-09-11 2022-09-13 无锡江南计算技术研究所 Search pruning optimization method based on feature element information
CN110795530B (en) * 2019-09-11 2022-10-04 无锡江南计算技术研究所 Context-based value feature extraction system and method
CN111651193A (en) * 2020-06-03 2020-09-11 上海米哈游天命科技有限公司 Information packaging method, device, equipment and medium
CN113656810A (en) * 2021-07-16 2021-11-16 五八同城信息技术有限公司 Application program encryption method and device, electronic equipment and storage medium
CN113656810B (en) * 2021-07-16 2024-07-12 五八同城信息技术有限公司 Application encryption method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN103473104B (en) 2016-10-05

Similar Documents

Publication Publication Date Title
CN103473104B (en) Bag discriminating conduct is beaten again in a kind of application based on keyword context frequency matrix
CN103473346B (en) A kind of Android based on application programming interface beats again bag applying detection method
US11188650B2 (en) Detection of malware using feature hashing
CN112041815B (en) Malware detection
CN103853979B (en) Procedure identification method and device based on machine learning
CN103761475B (en) Method and device for detecting malicious code in intelligent terminal
RU2614557C2 (en) System and method for detecting malicious files on mobile devices
US9792433B2 (en) Method and device for detecting malicious code in an intelligent terminal
CN102567661B (en) Program identification method and device based on machine learning
WO2015101097A1 (en) Method and device for feature extraction
US20120317421A1 (en) Fingerprinting Executable Code
CN110825363B (en) Intelligent contract acquisition method and device, electronic equipment and storage medium
CN105868630A (en) Malicious PDF document detection method
CN104063318A (en) Rapid Android application similarity detection method
US20140150101A1 (en) Method for recognizing malicious file
CN111651768B (en) Method and device for recognizing link library function name of computer binary program
CN104680065A (en) Virus detection method, virus detection device and virus detection equipment
US12314390B2 (en) Malicious VBA detection using graph representation
CN106569860A (en) Application management method and terminal
Chen et al. Malware classification using static disassembly and machine learning
US20160134652A1 (en) Method for recognizing disguised malicious document
KR20220060203A (en) Method for Training Malware Detection Model And Method for Detecting Malware
CN107688744B (en) Malicious file classification method and device based on image feature matching
WO2019223094A1 (en) Block chain-based file protection method, and terminal device
CN104765986B (en) A kind of code protection and restoring method based on Steganography

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20131225

Assignee: LVXIN TECHNOLOGY DEVELOPMENT (BEIJING) CO.,LTD.

Assignor: Peking University

Contract record no.: 2017990000291

Denomination of invention: Method for discriminating re-package of application based on keyword context frequency matrix

Granted publication date: 20161005

License type: Common License

Record date: 20170719

OL01 Intention to license declared
OL01 Intention to license declared