CN103473104A - Method for discriminating re-package of application based on keyword context frequency matrix - Google Patents
Method for discriminating re-package of application based on keyword context frequency matrix Download PDFInfo
- Publication number
- CN103473104A CN103473104A CN2013104384449A CN201310438444A CN103473104A CN 103473104 A CN103473104 A CN 103473104A CN 2013104384449 A CN2013104384449 A CN 2013104384449A CN 201310438444 A CN201310438444 A CN 201310438444A CN 103473104 A CN103473104 A CN 103473104A
- Authority
- CN
- China
- Prior art keywords
- application
- keyword
- smali
- bag
- file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 239000011159 matrix material Substances 0.000 title claims abstract description 46
- 238000000034 method Methods 0.000 title claims abstract description 31
- 239000013598 vector Substances 0.000 claims description 20
- 101100129500 Caenorhabditis elegans max-2 gene Proteins 0.000 claims description 12
- 238000004422 calculation algorithm Methods 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 7
- 239000000284 extract Substances 0.000 claims description 7
- 238000012546 transfer Methods 0.000 claims description 5
- 108091036429 KCNQ1OT1 Proteins 0.000 claims description 3
- 238000010009 beating Methods 0.000 claims 1
- 238000011835 investigation Methods 0.000 claims 1
- 238000012856 packing Methods 0.000 claims 1
- 238000012545 processing Methods 0.000 abstract description 4
- 238000011161 development Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000007781 pre-processing Methods 0.000 description 3
- 238000012857 repacking Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007621 cluster analysis Methods 0.000 description 1
- 238000002513 implantation Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000005096 rolling process Methods 0.000 description 1
Images
Landscapes
- Stored Programmes (AREA)
Abstract
一种基于关键词上下文频率矩阵的应用重打包辨别方法,应用于安卓系统,首先对应用程序文件进行处理,得到smali代码文件然后对smali代码处理,提取操作符序列,统计关键词信息,对每个特定类型的关键词构造上下文相关的特征三元组生成基于上下文频率的特征矩阵,对应用程序的特征矩阵进行两两对比,根据矩阵相似度得到两个应用程序的相似度。最后结合作者信息等内容判断应用程序间是否有重打包关系。用本发明提供的技术方案,可以对重打包的安卓应用程序进行判别,同时避免了对整个应用程序进行巨型字符串哈希处理的额外开销;不依赖原始文件二进制代码顺序;通过限制特征矩阵的大小,降低空间开销;提高了安卓应用程序重打包判别的执行效率。
An application repackaging identification method based on the keyword context frequency matrix is applied to the Android system. First, the application file is processed to obtain the smali code file, and then the smali code is processed, the operator sequence is extracted, and the keyword information is counted. A specific type of keywords is used to construct context-related feature triples to generate a feature matrix based on context frequency, and the feature matrix of the application is compared in pairs, and the similarity between the two applications is obtained according to the similarity of the matrix. Finally, combined with author information and other content, it is judged whether there is a repackaging relationship between applications. With the technical solution provided by the present invention, repackaged Android applications can be discriminated, while avoiding the extra overhead of performing giant character string hash processing on the entire application; it does not depend on the binary code order of the original file; by limiting the size of the feature matrix size, reducing space overhead; improving the execution efficiency of Android application repackaging discrimination.
Description
技术领域technical field
本发明涉及一种基于关键词上下文频率矩阵的应用重打包辨别方法,具体涉及一种在安卓平台下,利用应用程序代码关键词频率矩阵,识别应用重打包的处理方法。The present invention relates to an application repackaging identification method based on a keyword context frequency matrix, and in particular to a processing method for identifying application repackaging by using an application code keyword frequency matrix under an Android platform.
背景技术Background technique
安卓(Android)系统是谷歌公司开发并推广,基于Linux的自由及开放源代码的操作系统,主要使用于移动设备,例如智能手机或平板电脑。安卓系统是目前全球市场份额最大的移动手机操作系统,来自官方的数据显示,安卓系统上的应用程序已经超过975,000个。Android (Android) system is a Linux-based free and open source operating system developed and promoted by Google, mainly used in mobile devices, such as smartphones or tablets. The Android system is currently the mobile phone operating system with the largest market share in the world. According to official data, there are more than 975,000 applications on the Android system.
通常,安卓系统应用程序由第三方开发者开发和发布,这就带来一个问题,即应用重打包。应用重打包是指,某些开发者通过不同渠道攫取其他开发者发布的应用,通过反编译,二进制代码插桩等技术,对现有应用程序进行修改(例如植入恶意代码,修改开发者信息,修改权限,对受保护内容进行破解等),再重新打包、发布。这就引发了关于版权,安全,著作权,隐私保护,恶意代码植入等诸多问题。Usually, Android system applications are developed and released by third-party developers, which brings about a problem, that is, application repackaging. Application repackaging means that some developers grab applications released by other developers through different channels, and modify existing applications through decompilation, binary code insertion and other technologies (such as implanting malicious code, modifying developer information, etc.) , modify permissions, crack protected content, etc.), and then repackage and release. This raises many issues about copyright, security, authorship, privacy protection, malicious code implantation and so on.
Silvio Cesare和Yang Xiang在《Software Similarity and Classification》一书中总结了若干关于软件相似度和聚类分析的解决方案:例如通过字符串分析,对代码文本进行相似度比较,进而得出软件相似度的信息;以及通过对代码词频进行统计,根据统计结果比较软件相似度。不过这些解决方案并没有考虑移动平台应用相关的特殊情况。Silvio Cesare and Yang Xiang summarized several solutions for software similarity and cluster analysis in the book "Software Similarity and Classification": for example, through string analysis, the code text is compared for similarity, and then the software similarity is obtained information; and compare the software similarity according to the statistical results by counting the frequency of code words. These solutions, however, do not take into account the special circumstances associated with mobile platform applications.
2012年,美国南加州大学Wu Zhou,Yajin Zhou等人提出了另一种解决方案(CODASPY’12论文):利用混淆哈希技术,提取应用程序特征,然后利用混淆哈希生成应用的指纹,再采用滚动哈希技术,将指纹生成为一个特征向量,通过两个应用的特征向量相似度比较,判断是否存在软件重打包问题。该方法需要对所有代码进行分析,复杂低效,且依赖代码文本顺序,无法处理通过插入无用代码,代码混淆,函数重命名,改变代码顺序等对代码进行修改的情况。In 2012, Wu Zhou, Yajin Zhou, et al. of the University of Southern California proposed another solution (CODASPY'12 paper): using obfuscated hash technology to extract application features, and then use obfuscated hash to generate application fingerprints, and then Using the rolling hash technology, the fingerprint is generated as a feature vector, and by comparing the similarity of the feature vectors of the two applications, it is judged whether there is a software repackaging problem. This method needs to analyze all codes, which is complex and inefficient, and depends on the order of code text, and cannot handle the situation of modifying code by inserting useless code, code obfuscation, function renaming, changing code order, etc.
发明内容Contents of the invention
本发明的目的是提供一种新的方法,使得在较小开销,较快时间内,对给出的若干安卓应用程序进行预处理,得到一个基于关键词上下文频率的特征矩阵,通过对矩阵相似度的计算,进行聚类,得到这些安卓应用程序中哪些是重打包应用的信息。The purpose of the present invention is to provide a new method, so that with less overhead and faster time, a number of Android applications given are preprocessed to obtain a feature matrix based on the keyword context frequency. Degree calculation, clustering, and information on which of these Android applications are repackaged applications.
本发明的原理是:首先对应用程序文件(apk文件)进行处理,得到smali代码文件,smali代码是原来应用程序二进制代码的一种中间表示。然后对smali代码处理,提取操作符序列,统计关键词信息,对每个特定类型的关键词构造上下文相关的特征三元组<K1,i,K2>,生成基于上下文频率的特征矩阵,对应用程序的特征矩阵进行两两对比,根据矩阵相似度得到两个应用程序的相似度。最后结合作者信息等内容判断应用程序间是否有重打包关系。The principle of the present invention is as follows: firstly, the application program file (apk file) is processed to obtain the smali code file, and the smali code is an intermediate representation of the binary code of the original application program. Then process the smali code, extract the operator sequence, count the keyword information, construct a context-related feature triplet <K1,i,K2> for each specific type of keyword, and generate a feature matrix based on the context frequency. The feature matrix of the program is compared in pairs, and the similarity of the two applications is obtained according to the similarity of the matrix. Finally, combined with author information and other content, it is judged whether there is a repackaging relationship between applications.
本发明提供的技术方案如下:The technical scheme provided by the invention is as follows:
一种基于关键词上下文频率矩阵的应用重打包辨别方法,应用于安卓系统,其特征是,包括如下步骤(流程参见图9):An application repackaging identification method based on a keyword context frequency matrix, which is applied to the Android system, is characterized in that it includes the following steps (see Figure 9 for the flow process):
A.对应用程序文件进行预处理,将二进制代码转换为smali代码文件、提取应用程序的作者签名信息并构造关键词向量;A. Preprocess the application file, convert the binary code into a smali code file, extract the author's signature information of the application, and construct a keyword vector;
B.对smali代码文件进行处理,生成smali操作符序列;B. Process the smali code file to generate a smali operator sequence;
C.生成关键词上下文频率矩阵;C. generate keyword context frequency matrix;
D.对比应用程序关键词上下文频率矩阵的相似度,判断该应用程序是否为重打包应用。D. Compare the similarity of the application keyword context frequency matrix to determine whether the application is a repackaged application.
所述的应用重打包辨别方法,其特征是,步骤A包括:The method for identifying application repackaging is characterized in that step A includes:
A1.提取安卓应用程序二进制代码文件以及META-INFO文件中的作者签名信息文件;A1. Extract the Android application binary code file and the author's signature information file in the META-INFO file;
A2.使用现有工具,将二进制代码转换为smali代码文件;A2. Use existing tools to convert the binary code into a smali code file;
A3.使用现有工具,从相应文件提取作者签名内容;A3. Use existing tools to extract the content of the author's signature from the corresponding file;
A4.构造关键词向量。A4. Construct keyword vectors.
所述的应用重打包辨别方法,其特征是,步骤B包括:The method for identifying application repackaging is characterized in that step B includes:
B1.对步骤A中得到的smali代码文件进行处理,去掉第三方库文件(主要是一些广告库,社交平台库等内容);B1. Process the smali code file obtained in step A, and remove the third-party library files (mainly some advertising libraries, social platform libraries, etc.);
B2.对步骤B1中得到的smali代码文件处理,将每条语句中的操作符以外的其他所有信息剥离,得到一个应用程序的smali操作符序列。B2. Processing the smali code file obtained in step B1, stripping off all other information except the operator in each statement, and obtaining a smali operator sequence of an application program.
所述的应用重打包辨别方法,其特征是,步骤C包括:The method for identifying application repackaging is characterized in that step C includes:
C1.构造关键词上下文频率矩阵Max;C1. Construct the keyword context frequency matrix Max;
C2.根据选定的关键词向量,对每一个关键词,在步骤B得到的smali操作符序列中的每一次出现,采用哈希算法,将其上文的K条语句和下文的K条语句分别映射为整数K1和K2,其中,所述关键词在关键词向量中对应下标为i;C2. According to the selected keyword vector, for each keyword, each occurrence in the smali operator sequence obtained in step B, using the hash algorithm, combine the above K statements and the following K statements Mapped to integers K1 and K2 respectively, wherein the corresponding subscript of the keyword in the keyword vector is i;
C3.增加特征矩阵对应位置Max[i][K1][K2]计数。C3. Increase the count of Max[i][K1][K2] corresponding to the position of the feature matrix.
所述的应用重打包辨别方法,其特征是,步骤D包括:The method for identifying application repackaging is characterized in that step D includes:
D1.对给定的安卓应用程序,计算其关键词上下文频率矩阵的相似度,以此作为标准,对两个安卓应用程序进行两两比较;D1. For a given Android application, calculate the similarity of its keyword context frequency matrix, and use this as a standard to compare two Android applications in pairs;
D2.将关键词上下文频率矩阵的相似度计算结果超过指定阈值的安卓应用程序聚类,认为这一类的安卓应用程序可能存在重打包问题;D2. Cluster the Android applications whose similarity calculation results of the keyword context frequency matrix exceed the specified threshold, and think that this type of Android applications may have a repackaging problem;
D3.结合步骤A中得到的作者签名信息,进一步对步骤D2的判定结果进行筛选排查。D3. Combining with the author's signature information obtained in step A, further screen and check the judgment results of step D2.
所述的应用重打包辨别方法,其特征是,步骤D1中,计算其关键词上下文频率矩阵的相似度的方法为:设原应用程序(即未重打包的原始应用程序)和待判断的应用程序的关键词上下文频率矩阵分别为Max1,Max2,对两个矩阵的每一位,分别用Max1[i][j][k]和Max2[i][j][k]表示,计算这两个数的最小值Min[i][j][k],用score来表示其相似度得分,score=200*ΣMin[i][j][k]/Σ(Max1[i][j][k]+Max2[i][j][k])。The method for identifying application repackaging is characterized in that, in step D1, the method for calculating the similarity of its keyword context frequency matrix is as follows: set the original application program (that is, the original application program that has not been repackaged) and the application program to be judged The keyword context frequency matrix of the program is respectively Max1 and Max2. Each bit of the two matrices is represented by Max1[i][j][k] and Max2[i][j][k], and the calculation of these two The minimum value of the number Min[i][j][k], use score to represent its similarity score, score=200*ΣMin[i][j][k]/Σ(Max1[i][j][ k]+Max2[i][j][k]).
所述的应用重打包辨别方法,其特征是,步骤D2中,所述的阈值为70。The application repackaging identification method is characterized in that, in step D2, the threshold is 70.
所述的应用重打包辨别方法,其特征是,步骤A4中,选择如下指令对应的操作符作为关键词:转移指令、函数调用指令、比较指令、声明类指令、运算指令、传送指令、抛出异常指令。The method for distinguishing application repackaging is characterized in that in step A4, the operator corresponding to the following instructions is selected as a keyword: transfer instruction, function call instruction, comparison instruction, statement instruction, operation instruction, transfer instruction, throw Exception instruction.
所述的应用重打包辨别方法,其特征是,步骤A4中,选择如下操作符作为关键词:if、goto、invoke_virtual、invoke_static、invoke_direct、add-int/lit8、move_result_object、new-array、const/4、const/16、const-string、throw、new-instance、cmpl-float、and-int/lit1。The described application repackaging method is characterized in that, in step A4, the following operators are selected as keywords: if, goto, invoke_virtual, invoke_static, invoke_direct, add-int/lit8, move_result_object, new-array, const/4 , const/16, const-string, throw, new-instance, cmpl-float, and-int/lit1.
本发明的有益效果:利用本发明提供的技术方案,可以对重打包的安卓应用程序进行判别,同时避免了对整个应用程序进行巨型字符串哈希处理的额外开销;不依赖原始文件二进制代码顺序;通过限制特征矩阵的大小,降低空间开销;提高了安卓应用程序重打包判别的执行效率。Beneficial effects of the present invention: With the technical solution provided by the present invention, the repackaged Android application program can be discriminated, while avoiding the extra overhead of performing giant string hash processing on the entire application program; it does not depend on the binary code sequence of the original file ; By limiting the size of the feature matrix, the space overhead is reduced; and the execution efficiency of Android application repackaging discrimination is improved.
附图说明Description of drawings
图1本发明的应用程序预处理流程。Fig. 1 is the application program preprocessing flow of the present invention.
图2本发明的生成smali操作符序列流程。Fig. 2 is the flow chart of generating smali operator sequence in the present invention.
图3本发明的生成关键词上下文频率矩阵流程。Fig. 3 is the process flow of generating keyword context frequency matrix in the present invention.
图4本发明的重打包结果判断流程。Fig. 4 is the judging process of the repacking result of the present invention.
图5本发明实施例提供的应用程序预处流程图。Fig. 5 is a flow chart of application program preprocessing provided by the embodiment of the present invention.
图6本发明实施例提供的生成smali操作符序列流程图。Fig. 6 is a flow chart of generating a smali operator sequence provided by an embodiment of the present invention.
图7本发明实施例提供的生成关键词上下文频率矩阵流程图。Fig. 7 is a flow chart of generating a keyword context frequency matrix provided by an embodiment of the present invention.
图8本发明实施例提供的重打包结果判断流程图。Fig. 8 is a flow chart of judging the repacking result provided by the embodiment of the present invention.
图9本发明所述方法的流程图。Fig. 9 is a flowchart of the method of the present invention.
具体实施方式Detailed ways
本发明的具体实施方式如下:The specific embodiment of the present invention is as follows:
A.在对应用程序文件进行预处理时,执行如下操作:A. When preprocessing the application file, do the following:
A1.提取安卓应用程序二进制代码文件以及META-INFO文件中的作者签名信息文件;A1. Extract the Android application binary code file and the author's signature information file in the META-INFO file;
A2.使用现有工具,例如backsmali(https://code.google.com/p/smali/),将二进制代码(.dex文件)转换为smali代码文件;A2. Use existing tools, such as backsmali (https://code.google.com/p/smali/), to convert the binary code (.dex file) into a smali code file;
A3.使用现有工具,例如keytool(JDK(Java Development Kit)开发组件工具),从相应文件(CERT.RSA)提取作者签名内容;A3. Use existing tools, such as keytool (JDK (Java Development Kit) development component tool), to extract the author's signature content from the corresponding file (CERT.RSA);
A4.构造关键词向量,关键词的选择依据是,选取出现频率比较高的语句;关键词在语义上没有明显的重复,即关键词集合可以覆盖不同功能的语句;选取语义上比较重要的指令,例如运算指令、函数调用指令等。A4. Construct keyword vectors. The basis for selecting keywords is to select sentences with a relatively high frequency of occurrence; there is no obvious repetition of keywords in semantics, that is, the keyword set can cover sentences with different functions; select semantically important instructions , such as operation instructions, function call instructions, etc.
B.在生成smali操作符序列部分,执行如下操作:B. In the part of generating the smali operator sequence, perform the following operations:
B1.对步骤A2得到的smali代码文件进行处理,去掉第三方库文件,主要是一些广告库,例如Admob,AirPush,LeadBolt,InMobi等,社交平台库,例如Facebook,OpenFeint,HeyZap等,以及其他开发所使用的第三方库;B1. Process the smali code file obtained in step A2, remove the third-party library files, mainly some advertising libraries, such as Admob, AirPush, LeadBolt, InMobi, etc., social platform libraries, such as Facebook, OpenFeint, HeyZap, etc., and other development Third-party libraries used;
B2.对步骤B1得到的smali代码文件处理,将每条语句中的操作符以外的其他所有信息剥离,这些其他信息包括操作数,以及一些其他的标示符,例如‘#’,‘.’等,得到一个应用程序的smali操作符序列。B2. Process the smali code file obtained in step B1, and strip all other information except operators in each statement, such other information includes operands, and some other identifiers, such as '#', '.', etc. , to get a sequence of smali operators for an application.
C.在生成关键词上下文频率矩阵部分,执行如下操作:C. In the part of generating keyword context frequency matrix, perform the following operations:
C1.构造关键词上下文频率矩阵Max,这是一个三维矩阵,大小为sz_kv*sz_hash*sz_hash,其中sz_kv为选定的关键词向量的大小,sz_hash是步骤C2中采取的哈希算法结果的取值范围大小,将Max的每一位初始化为0;C1. Construct the keyword context frequency matrix Max, which is a three-dimensional matrix with a size of sz_kv*sz_hash*sz_hash, where sz_kv is the size of the selected keyword vector, and sz_hash is the value of the hash algorithm result taken in step C2 Range size, initialize each bit of Max to 0;
C2.根据选定的关键词向量,对每一个关键词(在关键词向量中对应下标为i)在步骤B2得到的smali操作符序列中的每一次出现,采用特定哈希算法,将其上文的K(K为自定义的一个整数)条语句和下文的K条语句分别映射为整数K1和K2;C2. According to the selected keyword vector, for each occurrence of each keyword (the corresponding subscript in the keyword vector is i) in the smali operator sequence obtained in step B2, use a specific hash algorithm to convert it to The above K (K is a self-defined integer) statement and the following K statement are mapped to integers K1 and K2 respectively;
C3.增加特征矩阵对应位置Max[i][K1][K2]计数。C3. Increase the count of Max[i][K1][K2] corresponding to the position of the feature matrix.
D.在重打包结果判断时,执行如下操作:D. When judging the repacking result, perform the following operations:
D1.对给定的安卓应用程序,采用特定算法计算其关键词上下文频率矩阵的相似度,以此作为标准,对两个安卓应用程序进行两两比较;这里比较关键词上下文频率矩阵的相似度的算法是,设两个矩阵分别为Max1,Max2,对两个矩阵的每一位,分别用Max1[i][j][k]和Max2[i][j][k]表示,计算这两个数的最小值Min[i][j][k],用score来表示其相似度得分,score=200*ΣMin[i][j][k]/Σ(Max1[i][j][k]+Max2[i][j][k]);D1. For a given Android application, use a specific algorithm to calculate the similarity of its keyword context frequency matrix, and use this as a standard to compare two Android applications; here, compare the similarity of the keyword context frequency matrix The algorithm is to set the two matrices as Max1 and Max2 respectively, and express each bit of the two matrices as Max1[i][j][k] and Max2[i][j][k] respectively, and calculate this The minimum value of two numbers Min[i][j][k], use score to represent its similarity score, score=200*ΣMin[i][j][k]/Σ(Max1[i][j] [k]+Max2[i][j][k]);
D2.将关键词上下文频率矩阵的相似度计算结果(即步骤D1中的score)超过某个阈值的安卓应用程序聚类,根据经验性结果,阈值选为70时最优,认为这一类的安卓应用程序可能存在重打包问题。D2. Cluster the Android applications whose similarity calculation result of the keyword context frequency matrix (that is, the score in step D1) exceeds a certain threshold. According to empirical results, when the threshold is selected as 70, it is the best, and this type of application is considered Android apps may have repackaging issues.
D3.结合步骤A3得到的作者签名信息,进一步对步骤D2的结果进行筛选排查,作者信息相同的相似应用程序一般是同一应用程序的不同版本,不属于重打包应用,而作者信息不同的相似应用程序则判断为重打包应用程序。D3. Combining the author's signature information obtained in step A3, further screen and check the results of step D2. Similar applications with the same author information are generally different versions of the same application, and do not belong to repackaged applications. Similar applications with different author information The program is judged as a repackaged application.
下面通过实例对本发明做进一步说明。The present invention will be further described below by example.
实施例1:Example 1:
假定一个安卓应用程序,其中文名为“自动桌面照片滤镜”,需要检测它是否是一个重打包应用,其程序包名为AutodeskzhaopianlvjingPixlr_o_matic_V2.2.1_mumayi_ac32a.apk。Suppose an Android application, whose Chinese name is "Automatic Desktop Photo Filter", needs to detect whether it is a repackaged application, and its package name is AutodeskzhaopianlvjingPixlr_o_matic_V2.2.1_mumayi_ac32a.apk.
A.预处理的流程包括如下步骤:A. The preprocessing process includes the following steps:
A1.将apk程序包解压之后,可以得到若干文件和文件夹,其中class.dex文件是应用程序的二进制代码文件,META-INFO文件夹下的CERT.RSA文件是作者的签名信息;A1. After decompressing the apk program package, you can get several files and folders, among which the class.dex file is the binary code file of the application program, and the CERT.RSA file under the META-INFO folder is the author's signature information;
A2.使用backsmali工具,将class.dex文件转换为smali代码文件,会生成一个文件夹,其中包括多个.smali文件;A2. Use the backsmali tool to convert the class.dex file into a smali code file, and a folder will be generated, including multiple .smali files;
A3.使用keytool工具,从CERT.RSA提取作者签名内容,输入keytool-printcert-fileCERT.RSA命令,可以得到若干关于应用程序的信息,包括所有者,发布者,序列号,有效期以及证书指纹。A3. Use the keytool tool to extract the author's signature content from CERT.RSA, and enter the keytool-printcert-fileCERT.RSA command to get some information about the application, including the owner, publisher, serial number, validity period, and certificate fingerprint.
A4.构造关键词向量,根据实验得到的经验性结果,我们选择如下15个操作符作为关键字,分别是if、goto、invoke_virtual、invoke_static、invoke_direct、add-int/lit8、move_result_object、new-array、const/4、const/16、const-string、throw、new-instance、cmpl-float、and-int/lit1,这15个操作符代表了多种指令,包括转移指令、函数调用指令、比较指令、声明类指令、运算指令、传送指令、抛出异常指令,每个操作符对应关键字向量的一个下标;A4. Construct keyword vectors. According to the experimental results, we choose the following 15 operators as keywords, namely if, goto, invoke_virtual, invoke_static, invoke_direct, add-int/lit8, move_result_object, new-array, const/4, const/16, const-string, throw, new-instance, cmpl-float, and-int/lit1, these 15 operators represent a variety of instructions, including transfer instructions, function call instructions, comparison instructions, Statement instructions, operation instructions, transfer instructions, and exception throw instructions, each operator corresponds to a subscript of the keyword vector;
B.生成smali操作符序列流程,包括如下步骤:B. Generate a smali operator sequence flow, including the following steps:
B1.对预处理得到的smali代码文件处理,去掉第三方库文件,例如该apk中包含admob广告库,则将对应的所有smali文件删除;B1. Process the smali code file obtained by preprocessing, and remove the third-party library file. For example, if the apk contains admob advertising library, delete all corresponding smali files;
B2.对B1得到的smali代码文件处理,将每条语句中的操作符以外的其他所有信息剥离,这些其他信息包括操作数,以及一些其他的标示符,例如‘#’,‘.’等,得到一个应用程序的smali操作符序列。smali文件中,每条语句的格式为“操作符操作数”,操作数为具体操作的变量名或者寄存器的名字,可以通过字符串处理的方式自动完成,这一步之后,得到一个smali操作符序列文本,其中每一行是一个操作符,其中某些操作符可能与我们选定的关键词匹配。B2. To process the smali code file obtained in B1, strip all other information except the operator in each statement, such other information includes operands, and some other identifiers, such as '#', '.', etc., Get an application's smali operator sequence. In the smali file, the format of each statement is "operator operand", and the operand is the variable name or register name of the specific operation, which can be automatically completed by string processing. After this step, a smali operator sequence is obtained Text, where each line is an operator, some of which may match our selected keywords.
C.生成关键词上下文频率矩阵流程,包括如下步骤:C. Generate keyword context frequency matrix process, including the following steps:
C1.构造关键词上下文频率矩阵Max,这是一个三维矩阵,大小为sz_kv*sz_hash*sz_hash,其中sz_kv为选定的关键词向量的大小,已确定为15,sz_hash是下一步中采取的哈希算法结果的取值范围大小,根据经验性结果,将其设置为67,将Max的每一位初始化为0;C1. Construct the keyword context frequency matrix Max, which is a three-dimensional matrix with a size of sz_kv*sz_hash*sz_hash, where sz_kv is the size of the selected keyword vector, which has been determined to be 15, and sz_hash is the hash taken in the next step The value range of the algorithm result, according to the empirical results, set it to 67, and initialize each bit of Max to 0;
C2.根据选定的关键词向量,对每一个关键词(在关键词向量中对应下标为i)在上一步得到的smali操作符序列中的每一次出现,采用BKDRHash算法(BKDRHash算法参见http://www.nocow.cn/index.php/BKDRHash;这里也可以采用其它字符串哈希算法,如ELFHASH,SDBMHash,RSHash等),将其上文的K条语句和下文的K条语句分别映射为整数K1和K2,例如根据字符串匹配,检测到对应关键词向量下标i=3的关键词invoke-static的一次出现,其上下文为如下指令序列,C2. According to the selected keyword vector, for each occurrence of each keyword (the corresponding subscript in the keyword vector is i) in the smali operator sequence obtained in the previous step, use the BKDRHash algorithm (for the BKDRHash algorithm, see http ://www.nocow.cn/index.php/BKDRHash; other string hash algorithms can also be used here, such as ELFHASH, SDBMHash, RSHash, etc.), and the above K statements and the following K statements are respectively Mapped to integers K1 and K2, for example, according to string matching, an occurrence of the keyword invoke-static corresponding to the keyword vector subscript i=3 is detected, and its context is the following instruction sequence,
lineline
try_start_0try_start_0
iget_objectiget_object
invoke-staticinvoke-static
move-result-objectmove-result-object
const-stringconst-string
invoke-virtual;invoke-virtual;
根据BKDRHash算法,分别对上下文做hash运算,得到K1=17,K2=33;According to the BKDRHash algorithm, hash operations are performed on the context respectively, and K1=17, K2=33 are obtained;
C3.增加特征矩阵对应位置Max[i][K1][K2]计数,在本例子中,执行Max[3][17][33]++的操作。C3. Increase the count of Max[i][K1][K2] corresponding to the position of the feature matrix. In this example, perform the operation of Max[3][17][33]++.
D.重打包结果判断流程,执行如下操作:D. Repackage result judgment process, perform the following operations:
D1.对给定的安卓应用程序,采用特定算法计算其关键词上下文频率矩阵的相似度,以此作为标准,对两个安卓应用程序进行两两比较,假设根据前面的类似步骤,我们得到了Autodeskzhaopiantexiaochulihanhuaban_Pixlr_o_matic_V2.1.2_mumayi_1f341.apk这一应用程序以及AutodeskzhaopianlvjingPixlr_o_matic_V2.2.1_mumayi_ac32a.apk这一应用程序的关键词上下文频率矩阵,设两个矩阵分别为Max1,Max2,对两个矩阵的的每一位,分别用Max1[i][j][k]和Max2[i][j][k]表示,计算这两个数的最小值Min[i][j][k],用score来表示其相似度得分,score=200*ΣMin[i][j][k]/Σ(Max1[i][j][k]+Max2[i][j][k]),根据运算,score=86.8871;D1. For a given Android application, use a specific algorithm to calculate the similarity of its keyword context frequency matrix, and use this as a standard to compare two Android applications. Suppose that according to the previous similar steps, we get Autodeskzhaopiantexiaochulihanhuaban_Pixlr_o_matic_V2.1.2_mumayi_1f341.apk application and the keyword context frequency matrix of AutodeskzhaopianlvjingPixlr_o_matic_V2.2.1_mumayi_ac32a.apk application. Let the two matrices be Max1 and Max2 respectively. For each bit of the two matrices, use Max1[i][j][k] and Max2[i][j][k] indicate that the minimum value Min[i][j][k] of these two numbers is calculated, and the similarity score is expressed by score , score=200*ΣMin[i][j][k]/Σ(Max1[i][j][k]+Max2[i][j][k]), according to the calculation, score=86.8871;
D2.将关键词上下文频率矩阵的相似度计算结果,即步骤D1中的score,超过某个阈值的安卓应用程序聚类,根据经验性结果,阈值选为70时最优,这里86.8871大于70,因此判断这两个安卓应用程序可能存在重打包问题。D2. Cluster the results of the similarity calculation of the keyword context frequency matrix, that is, the score in step D1, and Android applications exceeding a certain threshold. According to empirical results, it is optimal when the threshold is selected as 70, where 86.8871 is greater than 70, Therefore, it is judged that these two Android applications may have repackaging problems.
D3.结合前面得到的作者签名信息,进一步对结果进行筛选排查,发现这两个应用程序的作者信息不同,进一步确定这两个应用程序为重打包应用程序。D3. Combined with the author's signature information obtained above, further screen the results and find that the author information of the two applications is different, and further determine that the two applications are repackaged applications.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310438444.9A CN103473104B (en) | 2013-09-24 | 2013-09-24 | Bag discriminating conduct is beaten again in a kind of application based on keyword context frequency matrix |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310438444.9A CN103473104B (en) | 2013-09-24 | 2013-09-24 | Bag discriminating conduct is beaten again in a kind of application based on keyword context frequency matrix |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103473104A true CN103473104A (en) | 2013-12-25 |
CN103473104B CN103473104B (en) | 2016-10-05 |
Family
ID=49797973
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310438444.9A Active CN103473104B (en) | 2013-09-24 | 2013-09-24 | Bag discriminating conduct is beaten again in a kind of application based on keyword context frequency matrix |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103473104B (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104123493A (en) * | 2014-07-31 | 2014-10-29 | 百度在线网络技术(北京)有限公司 | Method and device for detecting safety performance of application program |
CN104317599A (en) * | 2014-10-30 | 2015-01-28 | 北京奇虎科技有限公司 | Method and device for detecting whether installation package is packaged repeatedly or not |
CN105389508A (en) * | 2015-11-10 | 2016-03-09 | 工业和信息化部电信研究院 | Detection method and apparatus for re-packaged Android application |
CN106469259A (en) * | 2015-08-19 | 2017-03-01 | 北京金山安全软件有限公司 | Method and device for determining whether application program is legal application program or not and electronic equipment |
CN107480219A (en) * | 2017-07-31 | 2017-12-15 | 北京微影时代科技有限公司 | Information processing method, device, electronic equipment and computer-readable recording medium |
CN108170664A (en) * | 2017-11-29 | 2018-06-15 | 有米科技股份有限公司 | Keyword expanding method and device based on emphasis keyword |
CN109117164A (en) * | 2018-06-22 | 2019-01-01 | 北京大学 | Micro services update method and system based on key element difference analysis |
CN110659064A (en) * | 2019-09-11 | 2020-01-07 | 无锡江南计算技术研究所 | Search pruning optimization method based on feature element information |
CN110795530A (en) * | 2019-09-11 | 2020-02-14 | 无锡江南计算技术研究所 | Context-based value feature extraction system and method |
CN111651193A (en) * | 2020-06-03 | 2020-09-11 | 上海米哈游天命科技有限公司 | Information packaging method, device, equipment and medium |
CN113656810A (en) * | 2021-07-16 | 2021-11-16 | 五八同城信息技术有限公司 | Application program encryption method and device, electronic equipment and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120151600A1 (en) * | 2010-12-14 | 2012-06-14 | Ta Chun Yun | Method and system for protecting intellectual property in software |
-
2013
- 2013-09-24 CN CN201310438444.9A patent/CN103473104B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120151600A1 (en) * | 2010-12-14 | 2012-06-14 | Ta Chun Yun | Method and system for protecting intellectual property in software |
Non-Patent Citations (2)
Title |
---|
SONG ZENGBIN;HUANG JIN;WU GUANGXU: "Matrix-based android UI development", 《IEEE:COMPUTER SCIENCE AND INFORMATION PROCESSING(CSIP),2012 INTERNATIONAL CONFERENCE》 * |
WU ZHOU,EL: "Detecting repackaged smartphone applications in third-party android marketplaces", 《ACM:PROCEEDINGS OF THE SECOND ACM CONFERENCE ON DATA AND APPLICATION SECURITY AND PRIVACY》 * |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104123493A (en) * | 2014-07-31 | 2014-10-29 | 百度在线网络技术(北京)有限公司 | Method and device for detecting safety performance of application program |
CN104123493B (en) * | 2014-07-31 | 2017-09-26 | 百度在线网络技术(北京)有限公司 | The safety detecting method and device of application program |
CN104317599B (en) * | 2014-10-30 | 2017-06-20 | 北京奇虎科技有限公司 | Whether detection installation kit is by the method and apparatus of secondary packing |
CN104317599A (en) * | 2014-10-30 | 2015-01-28 | 北京奇虎科技有限公司 | Method and device for detecting whether installation package is packaged repeatedly or not |
CN106469259B (en) * | 2015-08-19 | 2019-07-23 | 北京金山安全软件有限公司 | Method and device for determining whether application program is legal application program or not and electronic equipment |
CN106469259A (en) * | 2015-08-19 | 2017-03-01 | 北京金山安全软件有限公司 | Method and device for determining whether application program is legal application program or not and electronic equipment |
CN105389508B (en) * | 2015-11-10 | 2018-02-16 | 工业和信息化部电信研究院 | A detection method and device for an Android repackaged application |
CN105389508A (en) * | 2015-11-10 | 2016-03-09 | 工业和信息化部电信研究院 | Detection method and apparatus for re-packaged Android application |
CN107480219A (en) * | 2017-07-31 | 2017-12-15 | 北京微影时代科技有限公司 | Information processing method, device, electronic equipment and computer-readable recording medium |
CN108170664A (en) * | 2017-11-29 | 2018-06-15 | 有米科技股份有限公司 | Keyword expanding method and device based on emphasis keyword |
CN108170664B (en) * | 2017-11-29 | 2021-04-09 | 有米科技股份有限公司 | Key word expansion method and device based on key words |
CN109117164B (en) * | 2018-06-22 | 2020-08-25 | 北京大学 | Microservice update method and system based on difference analysis of key elements |
CN109117164A (en) * | 2018-06-22 | 2019-01-01 | 北京大学 | Micro services update method and system based on key element difference analysis |
CN110659064A (en) * | 2019-09-11 | 2020-01-07 | 无锡江南计算技术研究所 | Search pruning optimization method based on feature element information |
CN110795530A (en) * | 2019-09-11 | 2020-02-14 | 无锡江南计算技术研究所 | Context-based value feature extraction system and method |
CN110659064B (en) * | 2019-09-11 | 2022-09-13 | 无锡江南计算技术研究所 | Search pruning optimization method based on feature element information |
CN110795530B (en) * | 2019-09-11 | 2022-10-04 | 无锡江南计算技术研究所 | Context-based value feature extraction system and method |
CN111651193A (en) * | 2020-06-03 | 2020-09-11 | 上海米哈游天命科技有限公司 | Information packaging method, device, equipment and medium |
CN113656810A (en) * | 2021-07-16 | 2021-11-16 | 五八同城信息技术有限公司 | Application program encryption method and device, electronic equipment and storage medium |
CN113656810B (en) * | 2021-07-16 | 2024-07-12 | 五八同城信息技术有限公司 | Application encryption method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN103473104B (en) | 2016-10-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103473104B (en) | Bag discriminating conduct is beaten again in a kind of application based on keyword context frequency matrix | |
CN103473346B (en) | A kind of Android based on application programming interface beats again bag applying detection method | |
US11188650B2 (en) | Detection of malware using feature hashing | |
CN112041815B (en) | Malware detection | |
CN103853979B (en) | Procedure identification method and device based on machine learning | |
CN103761475B (en) | Method and device for detecting malicious code in intelligent terminal | |
RU2614557C2 (en) | System and method for detecting malicious files on mobile devices | |
US9792433B2 (en) | Method and device for detecting malicious code in an intelligent terminal | |
CN102567661B (en) | Program identification method and device based on machine learning | |
WO2015101097A1 (en) | Method and device for feature extraction | |
US20120317421A1 (en) | Fingerprinting Executable Code | |
CN110825363B (en) | Intelligent contract acquisition method and device, electronic equipment and storage medium | |
CN105868630A (en) | Malicious PDF document detection method | |
CN104063318A (en) | Rapid Android application similarity detection method | |
US20140150101A1 (en) | Method for recognizing malicious file | |
CN111651768B (en) | Method and device for recognizing link library function name of computer binary program | |
CN104680065A (en) | Virus detection method, virus detection device and virus detection equipment | |
US12314390B2 (en) | Malicious VBA detection using graph representation | |
CN106569860A (en) | Application management method and terminal | |
Chen et al. | Malware classification using static disassembly and machine learning | |
US20160134652A1 (en) | Method for recognizing disguised malicious document | |
KR20220060203A (en) | Method for Training Malware Detection Model And Method for Detecting Malware | |
CN107688744B (en) | Malicious file classification method and device based on image feature matching | |
WO2019223094A1 (en) | Block chain-based file protection method, and terminal device | |
CN104765986B (en) | A kind of code protection and restoring method based on Steganography |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
EE01 | Entry into force of recordation of patent licensing contract | ||
EE01 | Entry into force of recordation of patent licensing contract |
Application publication date: 20131225 Assignee: LVXIN TECHNOLOGY DEVELOPMENT (BEIJING) CO.,LTD. Assignor: Peking University Contract record no.: 2017990000291 Denomination of invention: Method for discriminating re-package of application based on keyword context frequency matrix Granted publication date: 20161005 License type: Common License Record date: 20170719 |
|
OL01 | Intention to license declared | ||
OL01 | Intention to license declared |