IT202100012488A1

IT202100012488A1 - Method of configuring neural networks and method of processing binary files

Info

Publication number: IT202100012488A1
Application number: IT102021000012488A
Authority: IT
Inventors: Daniele Canavese; Leonardo Regano; Cataldo Basile
Original assignee: Torino Politecnico
Priority date: 2021-05-14
Filing date: 2021-05-14
Publication date: 2022-11-14
Also published as: WO2022238851A1

Description

?Metodo di configurazione di reti neurali e metodo di elaborazione di file binari? ?Neural Network Configuration Method and Binary File Processing Method?

DESCRIZIONE DESCRIPTION

CAMPO DELLA TECNICA FIELD OF TECHNIQUE

La presente invenzione riguarda il campo della sicurezza del software e le protezioni ad esso applicate. The present invention relates to the field of software security and the protections applied thereto.

STATO DELLA TECNICA STATE OF THE ART

Il software, a causa delle sue caratteristiche intrinseche, ? altamente sensibile sotto l?aspetto della sicurezza. The software, due to its intrinsic characteristics, is highly sensitive in terms of security.

Si pensi ad esempio che il software sovente incorpora o gestisce dati confidenziali, privati o propriet? intellettuale di terze parti. Esso rende inoltre esplicito il know-how che ha reso possibile la sua creazione. Tali informazioni sono disponibili sotto forma di istruzioni e strutture dati che compongono un dato software o che vengono da questo trattate. Sono quindi potenzialmente disponibile a chiunque abbia una copia del software. Pertanto, il software pu? anche essere visto come un contenitore di assets fondamentali per il business dell?azienda che l?ha sviluppato. Consider, for example, that software often incorporates or manages confidential, private or proprietary data? third party intellectual. It also makes explicit the know-how that made its creation possible. This information is available in the form of instructions and data structures that make up a given software or that are processed by it. I am therefore potentially available to anyone with a copy of the software. Therefore, the software can also be seen as a container of fundamental assets for the business of the company that developed it.

Il software ? esposto a numerose minacce e rischi. In uno scenario di tipo MATE (Man-At-The-End), un attaccante, avendo a disposizione un determinato software e controllando l?ambiente in cui esso verr? eseguito, pu? analizzare il comportamento di detto software, ad esempio mediante un debugger, o pu? disassemblarlo o decompilarlo per estrarne la struttura logica. Queste operazioni di reverse engineering possono permettere quindi di ottenere alcuni degli asset presenti nel software. Gli attacchi MATE sono frequenti e possono consistere nell?individuazione ed eventuale riutilizzo di funzionalit? ritenute strategiche, la compromissione dei controlli di licenza, l?identificazione di falle o contesti di utilizzo che possono essere sfruttati per compromettere le funzionalit? del software stesso o degli ambienti in cui esso viene eseguito, etc. The software ? exposed to numerous threats and risks. In a scenario of type MATE (Man-At-The-End), an attacker, having available a certain software and controlling the? Environment in which it will come? performed, can analyze the behavior of said software, for example using a debugger, or can? disassemble or decompile it to extract its logical structure. These reverse engineering operations can therefore allow to obtain some of the assets present in the software. MATE attacks are frequent and can consist in the identification and possible reuse of functionalities? considered strategic, the compromise of license controls, the identification of flaws or contexts of use that can be exploited to compromise the functionalities? the software itself or the environments in which it runs, etc.

Data la severit? dei rischi cui il software ? esposto, non si pu? prescindere dall?adottare opportune mitigazioni. In questo contesto le mitigazioni sono i metodi e tecnologie di protezione del software, adottati sia in fase di sviluppo del software che immediatamente prima della sua distribuzione. Il processo di protezione del software combina l?uso di funzioni crittografiche, trasformazioni del software stesso e tecniche di ingegneria del software al fine di mitigare i rischi. Le protezioni del software possono fare affidamento su funzionalit? di sicurezza disponibili nell?ambiente in cui il software ? eseguito, ma possono anche essere integrate nel software stesso, utilizzando tecnologie di protezione appositamente progettate. Given the severity of the risks which the software ? exposed, you can not? regardless of adopting appropriate mitigations. In this context, mitigations are the software protection methods and technologies, adopted both in the software development phase and immediately before its distribution. The software protection process combines the use of cryptographic functions, software transformations, and software engineering techniques to mitigate risk. Can software protections rely on functionality? of security available in? the environment in which the software ? executed, but can also be integrated into the software itself, using specially designed protection technologies.

Pur non esistendo soluzioni definitive che possano impedire ad attaccanti di ottenere gli asset nel software, sono note nell?arte numerose soluzioni di protezione adottate, volte a ritardare il pi? possibile l?attaccante e quindi preservare il business model adottato dal proprietario di tale software. While there are no definitive solutions that can prevent attackers from obtaining the assets in the software, numerous protection solutions adopted are known in the art, aimed at delaying the most effective attacks. possible the attacker and therefore preserve the business model adopted by the owner of this software.

Si pensi ad esempio allo scenario di una software house - le software house rappresentano una parte importante del mercato delle protezioni del software - nel momento in cui intende commercializzare un nuovo videogame. Una larga fetta delle vendite di questo tipo di prodotto si concentra nei primi giorni successivi all?uscita sul mercato. Pertanto, diventa fondamentale ritardare il pi? possibile il momento un attaccante riesca a violare con successo le protezioni software contenute nell?opera, ad esempio i controlli di licenza, rilasciando crack o facendo circolare copie illegali del videogame. Consider, for example, the scenario of a software house - software houses represent an important part of the software protection market - when it intends to commercialize a new video game. A large part of the sales of this type of product is concentrated in the first days following its release on the market. Therefore, it becomes essential to delay as soon as possible. It is possible when an attacker successfully breaches the software protections contained in the work, such as licensing controls, releasing cracks or circulating illegal copies of the video game.

L?efficacia delle tecniche di protezione applicate al software sono sovente valutate, prima che un dato software venga rilasciato, da personale specializzato interno o esterno alla azienda che sviluppa il software. Tuttavia, questo processo di valutazione ? spesso eseguito in modo manuale o per mezzo di strumenti semi-automatizzati, richiedendo pertanto il dispendio di una notevole quantit? di tempo e risorse. Inoltre, il tempo molto spesso ? limitato a causa dei modelli di business adottati che impongono un time-to-market stringente per riuscire ad imporsi sul mercato prima dei concorrenti. The effectiveness of the protection techniques applied to the software are often evaluated, before a given software is released, by internal or external specialized personnel of the company that develops the software. However, this evaluation process ? often performed manually or by means of semi-automated tools, thus requiring the expenditure of a considerable amount? of time and resources. Also, the weather very often ? limited due to the business models adopted which impose a stringent time-to-market in order to be able to establish itself on the market before the competitors.

Una prima fase che attaccanti prima devono svolgere di estrarre asset dal software consiste nell?identificare le tecniche di protezione che sono state applicate a specifiche porzioni del software, al fine di disabilitarle o eliminarle. Una valutazione dell?efficacia delle protezioni non pu? prescindere dallo stimare quanto facile ? riconoscere quali protezioni sono state applicate. A first step that attackers must first perform to extract assets from the software is to identify the protection techniques that have been applied to specific portions of the software, in order to disable or eliminate them. An evaluation of the effectiveness of the protections cannot regardless of estimating how easy ? recognize which protections have been applied.

SOMMARIO DELL'INVENZIONE SUMMARY OF THE INVENTION

Attualmente gli strumenti noti nello stato dell?arte volti ad individuare le tecniche di protezione applicate ad un software non sono ancora soddisfacenti poich? offrono, quando presenti, un limitato livello di automazione e di conseguenza dipendono fortemente dall?attivit? manuale, in un contesto in cui il tempo gioca un ruolo fondamentale per il successo o il fallimento di un prodotto software sul mercato. Currently the tools known in the state of the art aimed at identifying the protection techniques applied to a software are not yet satisfactory since offer, when present, a limited level of automation and consequently depend heavily on? activity? manual, in a context where time plays a fundamental role for the success or failure of a software product on the market.

Inoltre, detti strumenti non sono ottimizzati per lo svolgimento di attivit? di individuazione delle aree protette all?interno del software stesso. Furthermore, these tools are not optimized for carrying out tasks? identification of protected areas within the software itself.

Lo scopo della presente invenzione ? quello di migliorare il grado di automazione nella fase di individuazione delle tecniche di protezione applicate al software. The purpose of the present invention ? that of improving the degree of automation in the phase of identifying the protection techniques applied to the software.

? oggetto della presente invenzione un metodo per l?automazione del processo di individuazione delle protezioni software applicate ad un file binario e dell?individuazione delle aree protette all?interno di detto file come descritto dalla rivendicazione 1 e da sue forme di realizzazione preferite descritte dalle rivendicazioni 2-12. ? object of the present invention a method for automating the process of identifying the software protections applied to a binary file and of identifying the protected areas within said file as described by claim 1 and by preferred embodiments thereof described by the claims 2-12.

? inoltre oggetto della presente invenzione un metodo di elaborazione di file come descritto dalla rivendicazione 13 e da sue forme di realizzazione preferite descritte dalle rivendicazioni 14-16. ? further object of the present invention is a file processing method as described by claim 13 and by preferred embodiments thereof described by claims 14-16.

BREVE DESCRIZIONE DEI DISEGNI BRIEF DESCRIPTION OF THE DRAWINGS

Forme di realizzazione preferite della presente invenzione verranno descritte nel seguito, a puro titolo esemplificativo, con riferimento ai disegni allegati, in cui: Preferred embodiments of the present invention will be described below, purely by way of example, with reference to the accompanying drawings, in which:

- La Fig.1 mostra un esempio di un sistema informatico utilizzabile ai fini della presente invenzione; - Fig.1 shows an example of a computer system usable for the purposes of the present invention;

- La Fig.2 mostra un esempio schematico di un file binario in cui sono evidenziate le funzioni e l?eventuale presenza di protezioni; - Fig.2 shows a schematic example of a binary file in which the functions and the possible presence of protections are highlighted;

- La Fig.3 mostra, mediante blocchi funzionali, un esempio di un metodo di configurazione di una rete neurale in grado di ottenere informazioni circa le tecniche di protezione eventualmente presenti in un file da analizzare; - Fig.3 shows, by means of functional blocks, an example of a configuration method of a neural network capable of obtaining information about the protection techniques possibly present in a file to be analysed;

- La Fig.4 mostra un esempio di funzione codificata; - Fig.4 shows an example of a coded function;

- La Fig.5 mostra, mediante blocchi funzionali, un?architettura semplificata di una rete neurale basata su celle LSTM (Long Short Term Memory); - Fig.5 shows, by means of functional blocks, a simplified architecture of a neural network based on LSTM (Long Short Term Memory) cells;

- La Fig.6 mostra, mediante blocchi funzionali, un?architettura semplificata di una rete neurale di tipo transformer BERT (Bidirectional Encoder Representations from Transformers). - Fig.6 shows, by means of functional blocks, a simplified architecture of a BERT (Bidirectional Encoder Representations from Transformers) neural network.

DESCRIZIONE DETTAGLIATA DELL?INVENZIONE DETAILED DESCRIPTION OF THE INVENTION

La seguente descrizione dettagliata delle forme di realizzazione preferite si riferisce ai disegni allegati che ne costituiscono una parte e mostrano, a titolo esemplificativo, specifiche forme di realizzazione della presente invenzione. La seguente descrizione non ? pertanto da intendersi in senso limitativo, e la portata delle invenzioni ? definita solo dalle rivendicazioni allegate. The following detailed description of the preferred embodiments refers to the accompanying drawings which form a part thereof and show, by way of example, specific embodiments of the present invention. The following description is not ? therefore to be understood in a limiting sense, and the extent of the inventions? defined only by the appended claims.

La Fig. 1 mostra un esempio di un sistema informatico 10 configurato per fornire informazioni circa eventuali protezioni del software contenute in un file da analizzare. Fig. 1 shows an example of a computer system 10 configured to provide information about any software protections contained in a file to be analysed.

Il sistema 10 comprende, per esempio, un dispositivo di elaborazione generalpurpose 20, sotto forma di un convenzionale personal computer, che include un?unit? di processamento 21, una memoria di sistema 22 ed un bus di sistema 23 che accoppia la memoria di sistema 22 e altri componenti di sistema all'unit? di processamento 21. System 10 includes, for example, a general purpose computing device 20, in the form of a conventional personal computer, which includes a unit processor 21, a system memory 22 and a system bus 23 which couples the system memory 22 and other system components to the unit? of processing 21.

Il bus di sistema 23 pu? essere uno qualsiasi dei differenti tipi di bus in grado di consentire un canale di comunicazione tra differenti dispositivi hardware per lo scambio di informazioni. La memoria di sistema 22 comprende, per esempio, una memoria di sola lettura (ROM) 24 e una memoria ad accesso casuale (RAM) 25. Un sistema di input/output di base (BIOS) 26, memorizzato nella ROM 24, contiene le routine di base che trasferiscono le informazioni tra i componenti del personal computer 20. Il BIOS 24 contiene inoltre le routine di avvio del sistema. Il personal computer 20 comprende inoltre un drive per l'hard disk 27 per leggere da e scrivere su almeno un hard disk 29. Il drive per l'hard disk 27 ? collegato al bus di sistema 23 tramite un'interfaccia per unit? hard disk 32. The system bus 23 can? be any one of several types of bus capable of allowing a communication channel between different hardware devices to exchange information. System memory 22 comprises, for example, a read-only memory (ROM) 24 and a random access memory (RAM) 25. A basic input/output system (BIOS) 26, stored in ROM 24, contains the basic routines which transfer information between the components of the personal computer 20. The BIOS 24 also contains the system startup routines. The personal computer 20 further includes a hard disk drive 27 for reading from and writing to at least one hard disk 29. The hard disk drive 27? connected to the system bus 23 via an interface for the unit? hard drives 32.

Per esempio, il sistema 10 include l?hard disk 29 ma potrebbe comprendere altri tipi di supporti, quali schede di memoria, hard disk esterni, RAM, ROM e simili. For example, system 10 includes hard disk 29 but could include other types of media, such as memory cards, external hard disks, RAM, ROM and the like.

I moduli di programma possono essere memorizzati su hard disk 29 sulla ROM 24 e sulla RAM 25. Program modules can be stored on hard disk 29 on ROM 24 and RAM 25.

Un utente pu? inserire comandi e informazioni nel personal computer 20 attraverso uno o pi? dispositivi di input come, per esempio, una tastiera 40 e un dispositivo di puntamento ottico 42. Questi e altri dispositivi di input sono spesso collegati all'unit? di processamento 21 attraverso una specifica interfaccia di input 46 che dipende dal tipo di porta utilizzata come ad esempio una porta seriale, una porta parallela, una porta USB, etc. accoppiata al bus di sistema 23. Un monitor 47 o un altro dispositivo di visualizzazione si collega anche al bus di sistema 23 tramite un'interfaccia come un adattatore video 48. Oltre al monitor, i personal computer possono comprendere inoltre altre periferiche di output (non mostrate) come ad esempio una stampante. A user can enter commands and information in the personal computer 20 through one or more? input devices such as, for example, a keyboard 40 and an optical pointing device 42. These and other input devices are often connected to the unit? of processing 21 through a specific input interface 46 which depends on the type of port used such as for example a serial port, a parallel port, a USB port, etc. coupled to the system bus 23. A monitor 47 or other display device also connects to the system bus 23 via an interface such as a video adapter 48. In addition to the monitor, personal computers may also include other output devices (not shown) such as a printer.

Il personal computer 20 pu? operare, in una rete di scambio di dati, utilizzando connessioni logiche a uno o pi? computer remoti come il computer remoto 49. Il computer remoto 49 pu? essere un altro personal computer, un server, un router, un PC di rete o un altro nodo della rete. Comprende tipicamente molti o tutti i componenti descritti sopra in relazione al personal computer 20. Tuttavia, nell'esempio in Fig.1 ? mostrato per semplicit? solo un dispositivo di archiviazione 50. Le connessioni logiche mostrate in Fig. 1 possono comprendere una rete di tipo LAN e/o WAN 51 comuni negli uffici, nelle reti di computer aziendali, nelle intranet e in Internet. The personal computer 20 can? operate, in a data exchange network, using logical connections to one or more? remote computers such as remote computer 49. Remote computer 49 can? be another personal computer, a server, a router, a network PC or another node on the network. It typically includes many or all of the components described above in relation to the personal computer 20. However, in the example in Fig.1 ? shown for simplicity? only a storage device 50. The logical connections shown in Fig. 1 may comprise a LAN and/or WAN type network 51 common in offices, corporate computer networks, intranets and the Internet.

Quando si trova in un ambiente di rete LAN/WAN, il PC 20 si collega ad una rete 51 attraverso un'interfaccia di rete o un adattatore 53 che pu? essere una scheda di rete cablata o wireless. In un ambiente di rete, i moduli di programma rappresentati come residenti all'interno del personal computer 20 o porzioni di essi possono essere memorizzati in un dispositivo di archiviazione remoto 50. When in a LAN/WAN network environment, PC 20 connects to a network 51 through a network interface or adapter 53 which can be a wired or wireless network card. In a networked environment, program modules represented as residing within the personal computer 20 or portions thereof may be stored in a remote storage device 50.

I moduli di programma possono comprendere: il sistema operativo 35, uno pi? programmi applicativi 36, almeno una rete neurale NN(Pi) (modulo di processamento 33) e un modulo di addestramento MOD_TRAIN 34. In particolare, ? prevista una pluralit? di reti neurali NN(Pi), ciascuna associata ad una particolare protezione del software. Program modules may include: OS 35, one more? application programs 36, at least one neural network NN(Pi) (processing module 33) and a training module MOD_TRAIN 34. In particular, ? expected a plurality? of neural networks NN(Pi), each associated with a particular software protection.

Ognuna delle reti neurali NN(Pi) pu? essere implementata in hardware, in software o in una loro combinazione. Il modulo di addestramento MOD_TRAIN 34 ha il compito di addestrare ogni rete neurale NN(Pi) per mezzo di data set, ovvero una collezione di dati utilizzati come campioni allo scopo di ?insegnare? alla rete neurale NN(Pi) come reagire a fronte di specifici dati in ingresso. Each of the neural networks NN(Pi) pu? be implemented in hardware, software, or a combination thereof. The MOD_TRAIN 34 training module has the task of training each neural network NN(Pi) by means of data sets, ie a collection of data used as samples for the purpose of ?teaching? to the neural network NN(Pi) how to react to specific input data.

Come verr? descritto in modo pi? dettagliato pi? avanti, ciascuna rete neurale NN(Pi) ? addestrata per ottenere informazioni circa una specifica tecnica di protezione eventualmente presente in un file da analizzare. How will I come? described more more detailed forward, each neural network NN(Pi) ? trained to obtain information about a specific security technique that may be present in a file to be analyzed.

La figura 2 mostra un esempio schematizzato di file binario che pu? appartenere, a titolo esemplificativo, ad un?applicazione o ad una libreria software. Il file binario ? formato da una pluralit? di funzioni FNZ 1 ? FNZ n, ognuna delle quali ? composta da una sequenza di istruzioni assembly, dette anche righe di codice o pi? in generale codice. Una o pi? di detta pluralit? di funzioni contenute nel file binario potrebbero necessitare di protezione software qualora il loro contenuto rappresenti un asset, ovvero costituisca un valore in termini economici e/o di know-how. Figure 2 shows a schematic example of a binary file that can belong, for example, to an application or a software library. The binary file ? formed by a plurality of functions FNZ 1 ? FNZ n, each of which ? composed of a sequence of assembly instructions, also called lines of code or pi? in general code. One or more? of said plurality? of functions contained in the binary file may require software protection if their content represents an asset, or constitutes a value in economic and/or know-how terms.

Gli assets che possono costituire un?area critica all?interno del file binario possono essere ad esempio, ma non solo, gli algoritmi proprietari (o altra propriet? intellettuale), segreti crittografici oppure i controlli di sicurezza come ad esempio i controlli di licenza di un software commerciale. The assets that can constitute a critical area within the binary file can be for example, but not limited to, proprietary algorithms (or other intellectual property), cryptographic secrets or security controls such as for example license controls of commercial software.

In figura 2, la funzione FNZ 1 e la funzione FNZ 6 sono mostrate con il simbolo di uno scudo a voler specificare che queste due funzioni sono il risultato dell?applicazione di protezioni del software poich? contenenti almeno un asset. Si noti che particolari righe di codice o funzioni potrebbero essere state protette al fine di confondere gli attaccanti pur non essendo degli asset. In figure 2, the FNZ 1 function and the FNZ 6 function are shown with the symbol of a shield to want to specify that these two functions are the result of the application of software protections since? containing at least one asset. Note that particular lines of code or functions may have been protected to confuse attackers even though they are not assets.

Al contrario, la funzione FNZ 4 ? invece mostrata con un simbolo di una X a voler indicare che non si tratta di una funzione protetta con alcun tipo di protezione software. Conversely, the FNZ 4 function? instead shown with an X symbol to indicate that it is not a protected function with any type of software protection.

Le protezioni software, dopo essere state applicate al codice lasciano una loro impronta digitale, o fingerprint, anomala rispetto al codice non protetto. After being applied to the code, software protections leave their own digital imprint, or fingerprint, which is anomalous with respect to unprotected code.

Esempi di impronta digitale presente nel codice a seguito dell?applicazione di una protezione software potrebbero essere flussi di controllo o condizioni logiche particolarmente complessi. Examples of fingerprints present in the code following the application of a software protection could be particularly complex control flows or logical conditions.

Ogni protezione software possiede un?impronta digitale caratteristica che potrebbe permettere di dedurre alcune informazioni come ad esempio quali peculiarit? possiedono gli asset protetti e quali propriet? di sicurezza si ? deciso di applicare. Esempi di protezioni software, possono essere: il control flow flattening, i predicati opachi, le branch functions, l?encode arithmetic, la conversione di dati in funzioni (ad esempio con macchine di Mealy), la fusione o suddivisione di variabili, la ricodifica di variabili (ad esempio, l?xor masking, il residue number encoding, ...), la white-box cryptography, la virtualizzazione mediante l?uso di virtual machine o di compilazione JIT, i controlli sul call stack, le code guards, il control flow tagging, l?anti-debugging, la code mobility, il client/server code splitting, l?anti-cloning e la software attestation. Each software protection has a characteristic fingerprint that could allow to deduce some information such as which peculiarities? own the protected assets and which properties? security yes? decided to apply. Examples of software protections can be: control flow flattening, opaque predicates, branch functions, arithmetic encoding, conversion of data into functions (for example with Mealy machines), merging or subdivision of variables, recoding of variables (for example, xor masking, residual number encoding, ...), white-box cryptography, virtualization through the use of virtual machines or JIT compilation, call stack checks, code guards , control flow tagging, anti-debugging, code mobility, client/server code splitting, anti-cloning and software attestation.

La figura 3 mostra, mediante un diagramma di flusso, una forma di realizzazione preferita di un metodo di configurazione di reti neurali 100 implementabile, per esempio, mediante il sistema 10. Figure 3 shows, by means of a flowchart, a preferred embodiment of a configuration method of neural networks 100 that can be implemented, for example, by means of the system 10.

Il metodo 100 consente di configurare una o pi? reti neurali NN(Pi), ciascuna da impiegare per ottenere informazioni circa una specifica tecnica di protezione eventualmente presente in un file da analizzare. Method 100 allows you to configure one or more? neural networks NN(Pi), each to be used to obtain information about a specific protection technique possibly present in a file to be analysed.

Dopo una fase di inizio, il metodo 100 prevede una prima fase 110 in cui sono forniti uno o pi? file sorgenti (cio? un file espresso con un linguaggio ad alto livello) impiegati ai fini di un addestramento di una rete neurale. Si consideri, per brevit?, l?impiego di un singolo file sorgente. Secondo l?esempio descritto, tale file sorgente ?, inizialmente, privo di protezioni software. After a starting phase, the method 100 provides a first phase 110 in which one or more? source files (that is, a file expressed with a high-level language) used for training a neural network. For brevity, consider using a single source file. According to the example described, this source file is, initially, without software protections.

Successivamente, secondo un esempio, a tale file sorgente vengono applicate una o pi? protezioni software. Nell?esempio considerato ci si riferisce al caso in cui sono applicate pi? protezioni software P1,?,Pn. In particolare, tali protezioni P1,?,Pn sono applicate ad una o pi? funzioni del file sorgente. Subsequently, according to an example, one or more? software protections. In the considered example we refer to the case in which more? software protections P1,?,Pn. In particular, these P1,?,Pn protections are applied to one or more? source file functions.

Come ? anche ribadito pi? avanti, non a tutte le funzioni del file sorgente si applicano le protezioni, ma alcune di esse rimangono prive di protezione. Ci? permetter? di addestrare le reti neurale NN(Pi) anche al riconoscimento di funzioni non protette. As ? also reiterated more? further, not all functions in the source file have protections applied, but some of them remain unprotected. There? allow? to train the neural networks NN(Pi) also to recognize unprotected functions.

Secondo l?esempio qui descritto, il file sorgente provvisto delle protezioni software P1,?,Pn viene quindi compilato, ottenendo, un file binario. According to the example described here, the source file provided with the software protections P1,?,Pn is then compiled, obtaining a binary file.

Si noti che ? possibile prevedere che sia direttamente fornito un file binario al quale sono state gi? applicate le protezioni P1,?,Pn., evitando cio? di effettuare la prima fase 110. Si osservi che alcune protezioni sono applicate al file sorgente mentre altre sono applicate direttamente sul file binario. Note that ? possible to provide that it is directly provided a binary file to which have already been? apply the protections P1,?,Pn., avoiding what? to carry out the first phase 110. Note that some protections are applied to the source file while others are applied directly to the binary file.

In una seconda fase 120 si effettua il disassemblaggio del file binario ottenibile, per esempio, dalla precedente compilazione allo scopo di estrarne la pluralit? di funzioni in esso contenute (cio?, delle sue porzioni di codice). In a second phase 120 the disassembly of the binary file obtainable, for example, from the previous compilation is carried out in order to extract the plurality thereof. of functions contained in it (that is, of its portions of code).

L?operazione di disassemblaggio permette di ottenere il file precedentemente compilato sotto forma di codice assembly andando a sostituire ogni codice operativo del linguaggio macchina con una sequenza di caratteri che lo rappresenta in forma mnemonica, cio? in un modo facilmente interpretabile da un operatore. Anche i dati e gli indirizzi di memoria possono essere riscritti in assembly secondo una base numerica, ad esempio esadecimale, oppure in forma simbolica utilizzando stringhe di testo (identificatori). Il programma in formato assembly risulter? quindi relativamente pi? leggibile rispetto al binario corrispondente. The disassembly operation makes it possible to obtain the previously compiled file in the form of assembly code by replacing each machine language operating code with a sequence of characters which represents it in mnemonic form, ie? in a way that can be easily interpreted by an operator. Data and memory addresses can also be rewritten in assembly according to a numerical base, such as hexadecimal, or in symbolic form using text strings (identifiers). Will the program in assembly format result? therefore relatively more readable against the corresponding binary.

Esempi di codici operativi in formato mnemonico sono la ADD per l?operazione di somma o la MOV ad indicare un?operazione di copia. Examples of operational codes in mnemonic format are the ADD for the addition operation or the MOV to indicate a copy operation.

Inoltre, ? effettuata una terza fase 130 che ha lo scopo di raccogliere in un data set, o collezione di dati, la pluralit? di funzioni protette estratte dal file assembly contestualmente all?indicazione delle protezioni presenti per ogni funzione. Tale indicazione ? un identificativo del tipo di protezione applicata P1,?,Pn. Il data set pu? avere una struttura matriciale. Furthermore, ? carried out a third phase 130 which has the purpose of collecting in a data set, or collection of data, the plurality? of protected functions extracted from the assembly file together with the indication of the protections present for each function. This indication? an identifier of the type of protection applied P1,?,Pn. The data set can have a matrix structure.

Si noti che il data set pu? essere anche ottenuto mediante una libreria di funzioni protette e associate ad una libreria di protezioni, senza effettuare l?applicazione delle protezioni alle funzioni del file binario e l?estrazione delle funzioni dal file binario indicata nella seconda fase 120. Note that the data set can also be obtained by means of a library of protected functions associated with a library of protections, without carrying out the application of the protections to the functions of the binary file and the extraction of the functions from the binary file indicated in the second phase 120.

L?insieme delle informazioni riguardanti la funzione protetta e l?indicazione delle protezioni applicate alla medesima funzione definisce un campione CHMP. Tale campione CHMP ?, per esempio, una riga del data set nel seguente formato: The set of information regarding the protected function and the indication of the protections applied to the same function defines a CHMP sample. This CHMP sample is, for example, a row of the data set in the following format:

(asmj, P1,?,Pn) (asmj, P1,?,Pn)

dove asmj ? la j-esima funzione protetta estratta dal file assembly e gli identificativi P1,?,Pn rappresentano le specifiche protezioni software applicate alla funzione asmj. where asmj ? the jth protected function extracted from the assembly file and the identifiers P1,?,Pn represent the specific software protections applied to the asmj function.

In particolare gli identificativi P1,?,Pn possono essere delle variabili booleane che indicano se la protezione Pi ? stata applicata alla funzione asmj. Per funzioni alle quali non ? stata applicata alcuna protezione gli identificativi P1,?,Pn assumono il valore ?falso?. In particular, the identifiers P1,?,Pn can be boolean variables which indicate whether the protection Pi ? been applied to the asmj function. For functions to which not ? no protection has been applied, the identifiers P1,?,Pn assume the value ?false?.

I campioni CHMP saranno impiegati nelle successive fasi del metodo 100. Si noti che per effettuare una buona fase di addestramento di una rete neurale ? necessario che il data set contenga un numero sufficientemente elevato di campioni CHMP. The CHMP samples will be used in the subsequent phases of method 100. Note that in order to carry out a good training phase of a neural network ? The data set must contain a large enough number of CHMP samples.

Secondo una forma preferita del metodo 100, la prima fase 110 (fase di compilazione) e la seconda fase 120 (fase di disassemblaggio) possono essere eseguite pi? volte utilizzando differenti combinazioni di protezioni che prevedano l?applicazione di una o pi? sequenze di protezioni a ciascuna funzione. According to a preferred embodiment of the method 100, the first phase 110 (compilation phase) and the second phase 120 (disassembly phase) can be performed more than once. times using different combinations of protections that provide for the application of one or more? protection sequences for each function.

Come gi? accennato, vantaggiosamente, il data set ? costruito anche utilizzando funzioni compilate alle quali non ? stata applicata alcuna protezione software. Le funzioni compilate senza alcuna protezione software vengono chiamate funzioni vanilla ed hanno lo scopo di bilanciare il data set. Lo scopo di tale bilanciamento ? quello di evitare che un data set sbilanciato influisca negativamente sul processo di apprendimento di una rete neurale, descritto nel seguito del presente documento, portandola a focalizzarsi sugli eventi prevalenti, trascurando quelli rari. In particolare, le funzioni vanilla servono a far apprendere alla rete neurale come sono fatte funzioni non protette, in modo che possa distinguere queste dalle funzioni protette con la specifica tecnica di protezione che la rete neurale ? addestrata ad identificare. How already? mentioned, advantageously, the data set ? also built using compiled functions to which not ? no software protection has been applied. Functions compiled without any software protection are called vanilla functions and are intended to balance the data set. The purpose of this balancing? that of preventing an unbalanced data set from negatively affecting the learning process of a neural network, described later in this document, leading it to focus on prevailing events, neglecting rare ones. In particular, the vanilla functions are used to make the neural network learn how unprotected functions are made, so that it can distinguish these from the functions protected with the specific protection technique that the neural network ? trained to identify.

In una quarta fase 140 si esegue la codifica di ogni funzione (asmj) appartenente alla pluralit? di funzioni convertendole in funzioni codificate CHMP_COD. Questa operazione ha lo scopo di trasformare in una sequenza di valori numerici le istruzioni contenute nelle funzioni di ogni campione CHMP espresse in linguaggio assembly, in particolare, le istruzioni contenenti codici operativi, dati ed indirizzi quando espressi in formato mnemonico. In a fourth step 140, the coding of each function (asmj) belonging to the plurality? of functions by converting them into CHMP_COD coded functions. This operation has the purpose of transforming the instructions contained in the functions of each CHMP sample expressed in assembly language into a sequence of numerical values, in particular, the instructions containing operating codes, data and addresses when expressed in mnemonic format.

Al termine di detta fase di codifica 140 ogni funzione codificata CHMP_COD appartenente al data set sar? espressa come sequenza di valori numerici risultando pertanto adeguata ad essere utilizzata in una fase di addestramento di una rete neurale. La fase di codifica 140 pu? opzionalmente comprendere due ulteriori sotto-fasi che permettono ad una rete neurale di raggiungere pi? velocemente la convergenza: la sotto fase di mascheratura e la sotto fase di scalamento. At the end of said coding phase 140 each CHMP_COD coded function belonging to the data set will be? expressed as a sequence of numerical values, thus resulting suitable for use in a training phase of a neural network. The coding phase 140 can? optionally include two further sub-phases that allow a neural network to reach more? convergence quickly: the masking sub-phase and the scaling sub-phase.

In una quinta fase 150, si esegue un addestramento di una delle reti neurali NN(Pi). Per esempio, ? addestrata una prima rete neurale NN(P1) associata ad una prima protezione del software P1. In a fifth step 150, one of the neural networks NN(Pi) is trained. For example, ? trained a first neural network NN(P1) associated with a first software protection P1.

In particolare, la prima rete neurale NN(P1) ? addestrata per fornire un primo indice di probabilit? PIi (in questo caso i = 1) che indica la probabilit? che una data funzione j-esima (asmj) sia protetta dalla prima protezione P1. In particular, the first neural network NN(P1) ? trained to provide a first probability index? PIi (in this case i = 1) which indicates the probability? that a given j-th function (asmj) is protected by the first protection P1.

Inoltre, nel caso in cui si sia individuata una funzione alla quale risulta applicata la prima protezione P1 con un primo indice di probabilit? PIi,j superiore ad una determinata soglia, la prima rete neurale NN(P1) pu? fornire anche un secondo indice FAi,j,k. Tale secondo indice FAi,j,k rappresenta la possibilit? che alle istruzioni di una specifica area (indicata genericamente con l?indice k) di tale funzione (asmj) sia stata applicata la prima protezione P1 (Pi con i= 1). Per esempio, dove ? pi? alto il valore del secondo indice pi? l?area ? ?sospetta?. Furthermore, if a function has been identified to which the first protection P1 is applied with a first probability index? PIi,j higher than a given threshold, the first neural network NN(P1) can? also provide a second index FAi,j,k. This second index FAi,j,k represents the possibility? that the first protection P1 (Pi with i= 1) has been applied to the instructions of a specific area (generally indicated with the index k) of this function (asmj). For example, where? more high the value of the second index pi? the area ? ?suspect?.

Ad esempio, per ogni istruzione di una funzione, la prima rete neurale NN(P1) indica la probabilit? che questa presenti la prima protezione P1. Ci? permette di identificare le istruzioni della funzione alle quali ? stata applicata una determinata protezione o che in alternativa sono state introdotte dall'applicazione della protezione, o in altre parole, di individuare la posizione di una protezione all?interno di una funzione. For example, for each instruction of a function, the first neural network NN(P1) indicates the probability? that this presents the first protection P1. There? allows you to identify the instructions of the function to which ? a certain protection has been applied or which alternatively have been introduced by the application of the protection, or in other words, to identify the position of a protection within a function.

L?addestramento della prima rete neurale NN(P1) ? effettuato utilizzando il data set che comprende le funzioni codificate CHMP_COD relative alla prima protezione P1 e le funzioni codificate CHMP_COD protette con ogni possibile coppia di protezioni Pi (per esempio, P1+P2, P1+P3, ... , P1+Pn). Si noti che anche ? possibile utilizzare combinazioni pi? lunghe di quelle sopra indicate, per esempio triple (e.g. P1+P2+P3), quadruple (e.g. P1+P2+P3+P4), ecc. In tal modo si potrebbe migliorare l'accuratezza in casi in cui si vogliano identificare particolari combinazioni di protezioni pi? lunghe (triple, quadruple ecc.), ad esempio perch? ? noto che sono utilizzate nello stato dell'arte. The training of the first neural network NN(P1) ? carried out using the data set which includes the CHMP_COD coded functions relating to the first protection P1 and the CHMP_COD coded functions protected with every possible pair of protections Pi (for example, P1+P2, P1+P3, ... , P1+Pn). Note that also ? possible to use combinations more? longer than those indicated above, for example triple (e.g. P1+P2+P3), quadruple (e.g. P1+P2+P3+P4), etc. In this way, accuracy could be improved in cases where particular combinations of protections are to be identified. long (triple, quadruple, etc.), for example, why? ? known to be used in the state of the art.

Secondo l?esempio descritto, l?addestramento ? effettuato mediante il modulo di addestramento MOD_TRAIN 34. According to the example described, the training ? carried out using the MOD_TRAIN 34 training module.

L?addestramento pu? essere ripetuto per ogni rete neurale NN(Pi) relativa anche alle altre protezioni di interesse P2-Pn. Training can be repeated for each neural network NN(Pi) also relating to the other protections of interest P2-Pn.

Si osservi che la rete neurale NN(Pi) pu? essere scelta tra le reti neurali in grado di gestire sequenze e/o le reti neurali aventi un meccanismo di attenzione. Observe that the neural network NN(Pi) can? be chosen between neural networks capable of managing sequences and/or neural networks having an attention mechanism.

Una tipologia di rete neurale in grado di gestire le sequenze ?, ad esempio, una rete neurale di tipo ricorrente, ovvero una rete in cui sono presenti connessioni di retroazione. Tale retroazione crea una sorta di ?memoria? di quanto accaduto nel passato recente rendendo disponibile al tempo T un?informazione processata al tempo T-1 o T-2 facendo pertanto dipendere il valore dell?uscita corrente non solo dai valori di ingresso corrente, ma anche dagli ingressi precedenti. Un esempio di rete neurale ricorrente ? la rete LSTM (Long Short-Term Memory). A type of neural network capable of managing sequences is, for example, a recurrent type neural network, i.e. a network in which there are feedback connections. This feedback creates a sort of ?memory? of what happened in the recent past, making information processed at time T-1 or T-2 available at time T, thus making the value of the current output depend not only on the current input values, but also on the previous inputs. An example of a recurrent neural network? the LSTM (Long Short-Term Memory) network.

L?idea alla base del meccanismo di attenzione ? quella di poter definire su quali parti del vettore di input la rete neurale deve concentrarsi per generare l?output appropriato. In altri termini, un meccanismo di attenzione consente di elaborare dei dati in ingresso mentre si occupa anche delle informazioni rilevanti contenute in altri dati di ingresso. Il meccanismo di attenzione consente inoltre di mascherare quei dati che non contengono informazioni rilevanti. Esempi di reti neurali che utilizzano il meccanismo di attenzione possono essere ad esempio le reti neurali ricorrenti, come ad esempio la gi? citata LSTM, oppure le reti neurali come ad esempio BERT (Bidirectional Encoder Representations from Transform). The idea behind the attention mechanism? that of being able to define on which parts of the input vector the neural network must concentrate in order to generate the appropriate output. In other words, an attention mechanism allows you to process some input data while also attending to the relevant information contained in other input data. The attention mechanism also allows you to mask data that does not contain relevant information. Examples of neural networks that use the attention mechanism can be, for example, recurrent neural networks, such as the gi? cited LSTM, or neural networks such as BERT (Bidirectional Encoder Representations from Transform).

Le reti neurali NN(Pi), addestrate come sopra descritto, possono essere impiegate in un metodo di classificazione applicato ad un file binario da analizzare (cio? un file distinto da quello usato per l?addestramento nel metodo di configurazione 100). The neural networks NN(Pi), trained as described above, can be used in a classification method applied to a binary file to be analyzed (that is, a file distinct from the one used for training in the configuration method 100).

In questo caso, il file binario da analizzare viene disassemblato e dal file assembly cos? ottenuto si estraggono le relative funzioni (asm) che si vogliono analizzare. Ci? pu? essere ottenuto mediante un convenzionale disassemblatore. In this case, the binary file to be analyzed is disassembled and from the assembly file cos? Once obtained, the relative functions (asm) to be analyzed are extracted. There? can? be obtained by a conventional disassembler.

Ciascuna funzione (asm) ? quindi processata da ogni rete neurale NN(Pi). Each function (asm) ? then processed by each neural network NN(Pi).

Ciascuna di tali reti neurali NN(Pi) restituir? un relativo primo indice di probabilit? PIi,j associato ad una specifica protezione Pi e anche, preferibilmente, il secondo indice di FAi,j,k per ogni funzione. Each of these neural networks NN(Pi) will return? a relative first probability index? PIi,j associated with a specific protection Pi and also, preferably, the second index of FAi,j,k for each function.

L?insieme dei valori del primo indice di probabilit? PIi,j consente di effettuare una classificazione delle protezioni P1,?, Pn eventualmente presenti nel file binario analizzato. The set of values of the first probability index? PIi,j allows classification of the protections P1,?, Pn possibly present in the analyzed binary file.

I valori dei secondi indici FAi,j,k sono associati ad ulteriori indicazioni che identificano la posizione delle istruzioni all?interno di ogni funzione aventi tali valori del secondo indice. The values of the second indices FAi,j,k are associated with further indications which identify the position of the instructions within each function having these values of the second index.

Quindi il metodo di classificazione permetter? di valutare la qualit? della sicurezza delle protezioni applicate al file binario analizzate perch? l?individuazione di tali protezioni da parte delle reti neurali NN(Pi) indica che la protezione ? identificabile rapidamente, quindi un attaccante viene 'ritardato' di meno. So the classification method will allow? to evaluate the quality? of the safety of the protections applied to the binary file analyzed why? the identification of such protections by neural networks NN(Pi) indicates that the protection ? quickly identifiable, so an attacker is 'delayed' less.

Con riferimento ad esempi di applicazioni pratiche, si osservi che le societ? specializzate nella protezione del software tipicamente operano utilizzando due team distinti: il primo (team di protezione) si occupa di proteggere effettivamente il software, mentre il secondo team (team di reverse engineering), emula il comportamento dei possibili attaccanti, cercando di identificare gli asset presenti all?interno dell?applicazione e le protezioni utilizzate, per poi rimuovere/aggirare queste ultime compromettendo la sicurezza degli asset. Il team di protezione propone una soluzione iniziale, la cui bont? ? valutata dal team di reverse engineering. Queste operazioni sono quindi eseguite iterativamente finch? non ? stato raggiunto un livello di protezione sufficiente (oppure ? terminato il tempo a disposizione). With reference to examples of practical applications, it should be noted that the companies specialists in software protection typically operate using two distinct teams: the first (protection team) deals with actually protecting the software, while the second team (reverse engineering team), emulates the behavior of possible attackers, trying to identify the assets present within the application and the protections used, to then remove/bypass the latter, compromising the security of the assets. The security team proposes an initial solution, whose goodness? ? evaluated by the reverse engineering team. These operations are then performed iteratively until? Not ? a sufficient level of protection has been achieved (or the time available has run out).

Il metodo di classificazione descritto, basato sul metodo di configurazione 100, pu? essere quindi utilizzato dalle societ? specializzate nella protezione del software in due modalit? differenti. Il team di protezione pu? ottenere una valutazione rapida della identificabilit? delle protezioni scelte (senza aspettare i risultati delle attivit? di reverse engineering). Contestualmente, il metodo di classificazione descritto pu? anche essere utilizzato dal team di reverse engineering per automatizzare e velocizzare l?identificazione degli asset, un primo passaggio imprescindibile nelle loro attivit?. The classification method described, based on the configuration method 100, can then be used by the company? specialized in software protection in two modalities? different. The security team can get a quick assessment of the identifiability? of the chosen protections (without waiting for the results of the reverse engineering activities). Contextually, the classification method described can It can also be used by the reverse engineering team to automate and speed up the identification of assets, an essential first step in their activities.

I metodi di configurazione e classificazione sopra descritti sono applicabili ad ogni tipo di protezione sopra elencata e, con particolare efficacia, alle seguenti tipologie di protezione, come emerso da test effettuati dalla Richiedente: control flow flattening, predicati opachi, branch functions, l?encode arithmetic. The configuration and classification methods described above are applicable to each type of protection listed above and, with particular effectiveness, to the following types of protection, as emerged from tests carried out by the Applicant: control flow flattening, opaque predicates, branch functions, l?encode arithmetic.

Nel seguito sono descritte forme di attuazione particolari di alcune delle fasi del metodo 100 sopra descritto. Particular embodiments of some of the steps of the method 100 described above are described below.

Codifica Encoding

La fase di codifica 140, come detto in precedenza, ha lo scopo di trasformare le istruzioni contenute nelle funzioni di ogni campione CHMP in sequenze di valori numerici. Questa trasformazione ha l?obiettivo di codificare ogni funzione (asmj) dei campioni CHMP appartenenti al data set come una matrice di valori numerici le cui righe sono le codifiche delle istruzioni componenti la funzione stessa. The coding step 140, as previously stated, has the purpose of transforming the instructions contained in the functions of each CHMP sample into sequences of numerical values. This transformation has the objective of encoding each function (asmj) of the CHMP samples belonging to the data set as a matrix of numerical values whose rows are the encodings of the instructions making up the function itself.

La funzione codificata, come mostrato nella figura 4a, presenta due pseudo-istruzioni fittizie (anch?esse codificate assieme al resto della funzione) di <begin> ed <end> aggiunte rispettivamente all?inizio ed alla fine della funzione (asmj) per esplicitarne i confini. Tali pseudo-istruzioni fittizie sono inserite in una fase preliminare della fase di codifica 140. The encoded function, as shown in figure 4a, has two dummy pseudo-instructions (also encoded together with the rest of the function) of <begin> and <end> added respectively to the beginning and at the end of the function (asmj) to make it explicit the boundaries. These dummy pseudo-instructions are inserted in a preliminary phase of the coding phase 140.

Opzionalmente, sempre in detta fase preliminare, il numero di istruzioni che costituiscono la funzione pu? essere troncata alle prime n istruzioni. A solo scopo esemplificativo in figura 4c ? mostrata la funzione codificata troncata ad una dimensione massima prefissata pari a 4 rispetto alla funzione in figura 4a. L?operazione di troncamento si rende necessaria nel momento in cui il tipo di rete neurale scelta per essere addestrata ad operare come un classificatore prevede che la sequenza in ingresso abbia una dimensione massima che non va superata. Optionally, still in said preliminary phase, the number of instructions that make up the function can? be truncated at the first n statements. By way of example only, in figure 4c ? shown is the encoded function truncated to a prefixed maximum size of 4 with respect to the function in Figure 4a. The truncation operation becomes necessary when the type of neural network chosen to be trained to operate as a classifier requires that the input sequence have a maximum size which must not be exceeded.

La figura 4b mostra invece il dettaglio di una i-esima istruzione appartenente alla funzione di figura 4a dopo essere stata codificata. Secondo l?esempio, l?istruzione codificata ? composta da: Figure 4b instead shows the detail of an i-th instruction belonging to the function of figure 4a after it has been encoded. According to the example, the codified instruction ? composed by:

- un opcode rappresentante la codifica numerica dell?opcode dell?istruzione; - un indirizzo rappresentante l?indirizzo dell?istruzione; - an opcode representing the numeric coding of the instruction opcode; - an address representing the address of the education;

- una serie di parametri op1-op6 rappresentanti la codifica degli eventuali operandi dell?istruzione. - a series of parameters op1-op6 representing the coding of any operands of the instruction.

Il numero dei parametri pu? variare sulla base dell?architettura hardware scelta. The number of parameters pu? vary according to the chosen hardware architecture.

Nella figura 4b ? rappresentata la generalizzazione di un?istruzione basata su architettura di tipo ARM dove le istruzioni pi? complesse possono avere fino a sei operandi. Tuttavia, il numero di operandi gestibile dalla fase di codifica 140 non ha un limite al numero massimo di operandi in modo da potersi agevolmente adattare alle istruzioni di differenti architetture hardware. In figure 4b ? represented the generalization of an? instruction based on ARM type architecture where the instructions pi? complexes can have up to six operands. However, the number of operands manageable by the coding step 140 does not have a limit to the maximum number of operands so as to be able to easily adapt to the instructions of different hardware architectures.

Viene di seguito descritto un esempio di codifica di un?istruzione effettuata dal presente metodo tenendo conto che i valori numerici riportati relativi alla codifica sono da intendersi a puro scopo descrittivo, dato che tali valori possono cambiare in base all?architettura hardware e alla rete neurale scelti. An example of coding of an instruction carried out by this method is described below, taking into account that the numerical values reported relating to the coding are to be intended for purely descriptive purposes, given that these values can change according to the hardware architecture and the neural network chosen.

Un esempio di istruzione assembly ? ?0x1234 add r0, r2, 5? che esprime l?assegnazione r0 = r2 5 e pu? essere suddivisa in cinque differenti parti: An example of assembly instruction ? ?0x1234 add r0, r2, 5? which expresses the assignment r0 = r2 5 and pu? be divided into five different parts:

- 0x1234: ? il valore numerico che indica l?indirizzo dell?istruzione che nell?esempio considerato ? un intero espresso in esadecimale, che equivale al numero 4660 in base decimale, ed indica la posizione dell?istruzione in memoria; - 0x1234: ? the numeric value that indicates the address of the instruction that in the considered example ? an integer expressed in hexadecimal, which is equivalent to the number 4660 in decimal base, and indicates the position of the instruction in memory;

- add: ? il tipo di operazione (opcode) dell?istruzione, che in questo caso ? una somma; - add: ? the type of operation (opcode) of? the instruction, which in this case ? a sum;

- r0: ? il primo operando dell?istruzione e fa riferimento al registro di memoria r0; - r0: ? the first operand of the instruction and refers to the memory register r0;

- r2: ? il secondo operando dell?istruzione e fa riferimento al registro di memoria r2; - r2: ? the second operand of the instruction e refers to the memory register r2;

- 5: ? il terzo operando dell?istruzione e indica il numero intero 5. - 5: ? the third operand of the instruction and indicates the integer 5.

In questo esempio, l?istruzione utilizza solo tre operandi, il quarto, il quinto e il sesto operando saranno assenti. In this example, the instruction uses only three operands, the fourth, fifth and sixth operands will be absent.

Per esempio, la fase di codifica 140 trasforma ogni riga di istruzione delle funzioni di ogni campione CHMP in una sequenza di 1250 valori cos? suddivisi: For example, the encoding step 140 transforms each function instruction line of each CHMP sample into a sequence of 1250 values so? subdivided:

- 236 dedicati alla codifica dell?opcode; - 236 dedicated to the encoding of the opcode;

- 4 dedicati alla codifica dell?indirizzo dell?istruzione; - 4 dedicated to coding the instruction address;

- 273 dedicati alla codifica del primo operando; - 273 dedicated to encoding the first operand;

- 239 dedicati alla codifica del secondo operando; - 239 dedicated to encoding the second operand;

- 239 dedicati alla codifica del terzo operando; - 239 dedicated to the encoding of the third operand;

- 239 dedicati alla codifica del quarto operando; - 239 dedicated to the encoding of the fourth operand;

- 16 dedicati alla codifica del quinto operando; - 16 dedicated to the encoding of the fifth operand;

- 4 dedicati alla codifica del sesto operando. - 4 dedicated to the coding of the sixth operand.

Si consideri che la codifica dell?istruzione deve essere sempre una sequenza lunga 1250 valori anche nel caso di un numero inferiore di operandi, nell?esempio pari a 6. Dato che l?istruzione di ADD considerata ha solo tre operandi, anche l?assenza degli operandi mancanti ? codificata per non alterare la lunghezza della sequenza di codifica di valori. Consider that the coding of the instruction must always be a sequence 1250 long values even in the case of a lower number of operands, in the example equal to 6. Given that the ADD instruction considered has only three operands, even the absence any operands missing ? encoded so as not to alter the length of the coding sequence of values.

La codifica dell?opcode avviene su una sequenza di 236 valori numerici e rappresenta l?embedding dell?opcode stesso. Per embedding si intende una tecnica standard di modellazione dove le parole o i numeri vengono mappati in sequenze numeriche. Detti valori sono stati pre-calcolati per mezzo di una rete neurale utilizzante algoritmi standard, quali ad esempio gli algoritmi CBOW (Continuous Bag Of Words) e Skip-gram, descritti nel documento The coding of the opcode takes place on a sequence of 236 numerical values and represents the embedding of the opcode itself. Embedding is a standard modeling technique where words or numbers are mapped into number sequences. These values have been pre-calculated by means of a neural network using standard algorithms, such as for example the CBOW (Continuous Bag Of Words) and Skip-gram algorithms, described in the document

?Distributed Representations of Words and Phrases and their Compositionality?, in Proceedings of the 26<th >International Conference on Neural Information Processing Systems, Volume 2 (NIPS?13), pp. 3111-3119, disponibile al seguente indirizzo: https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf ?Distributed Representations of Words and Phrases and their Compositionality?, in Proceedings of the 26<th >International Conference on Neural Information Processing Systems, Volume 2 (NIPS?13), pp. 3111-3119, available at the following address: https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf

La codifica dell?indirizzo avviene su una sequenza di 4 valori numerici dove ognuno dei 4 valori ? determinato secondo un preciso criterio. The coding of the address takes place on a sequence of 4 numerical values where each of the 4 values ? determined according to a specific criterion.

Il primo valore della sequenza ? valorizzato a 1 quando l?istruzione contiene un indirizzo, o a 0 quando mancante, come nel caso di alcune istruzioni speciali. Le istruzioni <begin> e <end> sono esempi di istruzioni speciali. The first value of the sequence ? valued at 1 when the instruction contains an address, or at 0 when missing, as in the case of some special instructions. The <begin> and <end> statements are examples of special statements.

Il secondo valore della sequenza ? valorizzato a 0 quando l?istruzione non ha un indirizzo oppure se il risultato dell?operazione matematica ? stata eseguita su operandi non validi (NaN) oppure vale infinito. In tutti gli altri casi il secondo valore della sequenza ? valorizzato a 1. The second value of the sequence ? valued at 0 when the instruction does not have an address or if the result of the mathematical operation ? performed on invalid operands (NaN) or is infinite. In all other cases the second value of the sequence ? valued at 1.

Il terzo valore della sequenza contiene il valore dell?indirizzo espresso in base decimale quando presente, altrimenti ? valorizzato a 0. The third value of the sequence contains the value of the address expressed in decimal when present, otherwise ? valued at 0.

Il quarto valore della sequenza ? valorizzato a 1 quando l?indirizzo dell?istruzione contiene il punto esclamativo (!) come accade per alcuni indirizzi speciali, altrimenti ? valorizzato a 0. Il punto esclamativo ? usato in certi (rarissimi) casi in ARM per indicare l'operazione di write-back, ovvero che il risultato di un'operazione deve essere scritto dentro un certo indirizzo. Per esempio, se in un?istruzione ? scritto l'indirizzo "1234!" significa che il risultato dell'istruzione deve essere scritto nella cella di memoria 1234. Se non ? presente il punto esclamativo, allora il risultato dell'operazione non viene scritto nella cella di memoria 1234 (che tipicamente verr? solo letta). The fourth value of the sequence ? valued at 1 when the address of the instruction contains the exclamation point (!) as happens for some special addresses, otherwise ? valued at 0. The exclamation point ? used in certain (very rare) cases in ARM to indicate the write-back operation, ie that the result of an operation must be written into a certain address. For example, if in a? instruction ? wrote the address "1234!" does it mean that the result of the instruction must be written in the memory cell 1234. If not ? present the exclamation point, then the result of the operation is not written in the memory cell 1234 (which typically will only be read).

Secondo detti criteri, l?indirizzo 0x1234 contenuto nell?istruzione di esempio ? codificato come 1, 1, 4660, 0. According to said criteria, the address 0x1234 contained in the example instruction ? encoded as 1, 1, 4660, 0.

La codifica del primo operando, come detto in precedenza, avviene su una sequenza di 273 valori che sono suddivisi in 9 sezioni per poter rappresentare i diversi tipi di operando. The coding of the first operand, as previously mentioned, takes place on a sequence of 273 values which are divided into 9 sections in order to represent the different types of operand.

Una prima sezione denominata ?stringa? ? utilizzata per codificare un operando di tipo stringa su un valore della sequenza. Una seconda sezione denominata ?numero? ? utilizzata per codificare un operando di tipo numerico su una sequenza di 4 valori. Una terza, quarta e quinta sequenza denominate ?endian?, ?cond? e ?stato CPU? sono utilizzate per codificare gli stati interni del processore rispettivamente su sequenze di 2, 15 e 4 valori. La sesta sezione denominata ?registri? ? utilizzata per codificare i registri di memoria su una sequenza di 154 valori. Una settima sezione denominata ?barriera? ? utilizzata per codificare le barriere di memoria su una sequenza di 12 valori. Una barriera di memoria ? un tipo di operazione che permette alla CPU di imporre un vincolo sull?ordinamento delle operazioni evitando un?esecuzione fuori ordine a causa delle ottimizzazioni delle prestazioni delle CPU moderne. A first section named ?string? ? used to encode a string type operand on a value of the sequence. A second section named ?number? ? used to encode a numeric type operand on a sequence of 4 values. A third, fourth and fifth sequence called ?endian?, ?cond? and ?CPU state? they are used to encode the internal states of the processor respectively on sequences of 2, 15 and 4 values. The sixth section called ?register? ? used to encode memory registers on a sequence of 154 values. A seventh section called ?barrier? ? used to encode memory barriers on a sequence of 12 values. A memory barrier? a type of operation that allows the CPU to impose a constraint on the ordering of operations avoiding out-of-order execution due to performance optimizations of modern CPUs.

Un?ottava sezione denominata ?indirizzo? ? utilizzata per codificare gli indirizzi di memoria su una sequenza di 65 valori. Infine, la nona sezione denominata ?coproc? ? utilizzata per codificare un coprocessore matematico su una sequenza di 16 valori. An eighth section called ?address? ? used to encode memory addresses on a sequence of 65 values. Finally, the ninth section called ?coproc? ? used to encode a math coprocessor on a sequence of 16 values.

Nell?istruzione d?esempio il primo operando ? un registro, r0, e pertanto sar? codificato nella sesta sezione ?registri? mentre le sequenze di valori delle altre sezioni saranno tutte valorizzare a 0. In the example instruction the first operand ? a register, r0, and therefore sar? codified in the sixth section ?register? while the sequences of values of the other sections will all be set to 0.

Dei 154 valori di detta sezione, i primi 123 valori sono associati ai registri dove il primo valore ? associato al registro r0, il secondo valore ? associato al registro r1, il terzo valore ? associato al registro r2, etc. In fase di codifica, il valore associato al registro da codificare sar? valorizzato a 1 mentre gli altri valori associati agli altri registri saranno valorizzati a 0. I restanti 31 valori sono utilizzati per codificare le operazioni matematiche speciali sui registri come ad esempio gli scalamenti. Dato che il primo operando, r0, del nostro esempio non necessita di operazioni matematiche speciali, la sequenza di 31 valori sar? tutta valorizzata a 0. Of the 154 values of this section, the first 123 values are associated with the registers where the first value ? associated with register r0, the second value ? associated with register r1, the third value ? associated with register r2, etc. In the coding phase, the value associated with the register to be encoded will be? valued at 1 while the other values associated with the other registers will be valued at 0. The remaining 31 values are used to code the special mathematical operations on the registers such as for example the scalings. Since the first operand, r0, of our example does not need any special mathematical operations, the sequence of 31 values will be? all valued at 0.

La codifica del secondo operando, similmente a quanto descritto per il primo operando, avviene su una sequenza di 239 valori che sono suddivisi in 4 sezioni per poter rappresentare i diversi tipi di operando. The coding of the second operand, similarly to what is described for the first operand, takes place on a sequence of 239 values which are divided into 4 sections in order to represent the different types of operand.

Una prima sezione denominata ?numero? ? utilizzata per codificare un operando di tipo numerico su una sequenza di 4 valori. Una seconda sezione denominata ?registri? ? utilizzata per codificare i registri di memoria su una sequenza di 154 valori. Una terza sezione denominata ?indirizzo? ? utilizzata per codificare gli indirizzi di memoria su una sequenza di 65 valori. Una quarta sezione denominata ?reg coproc? ? utilizzata per codificare un registro di coprocessore matematico su una sequenza di 16 valori. A first section called ?number? ? used to encode a numeric type operand on a sequence of 4 values. A second section called ?register? ? used to encode memory registers on a sequence of 154 values. A third section named ?address? ? used to encode memory addresses on a sequence of 65 values. A fourth section named ?reg coproc? ? used to encode a math coprocessor register on a sequence of 16 values.

Nell?istruzione d?esempio il secondo operando ? nuovamente un registro, r2, e pertanto sar? codificato nella seconda sezione ?registri? mentre le sequenze di valori delle altre sezioni saranno tutte valorizzare a 0. In the example instruction the second operand ? again a register, r2, and therefore sar? coded in the second section ?register? while the sequences of values of the other sections will all be set to 0.

Dei 154 valori di detta sezione, i primi 123 valori sono associati ai registri dove il primo valore ? associato al registro r0, il secondo valore ? associato al registro r1, il terzo valore ? associato al registro r2, etc. In fase di codifica, il valore associato al registro da codificare sar? valorizzato a 1 mentre gli altri valori associati agli altri registri saranno valorizzati a 0. I restanti 31 valori sono utilizzati per codificare le operazioni matematiche speciali sui registri come ad esempio gli scalamenti. Dato che il secondo operando, r2, del nostro esempio non necessita di operazioni matematiche speciali, la sequenza di 31 valori sar? tutta valorizzata a 0. Of the 154 values of this section, the first 123 values are associated with the registers where the first value ? associated with register r0, the second value ? associated with register r1, the third value ? associated with register r2, etc. In the coding phase, the value associated with the register to be encoded will be? valued at 1 while the other values associated with the other registers will be valued at 0. The remaining 31 values are used to code the special mathematical operations on the registers such as for example the scalings. Since the second operand, r2, of our example does not require any special mathematical operations, the sequence of 31 values will be? all valued at 0.

La codifica del terzo operando avviene su una sequenza di 239 valori che sono suddivisi in 4 sezioni, come per il secondo operando, per poter rappresentare i diversi tipi di operando. Una prima sezione denominata ?numero? ? utilizzata per codificare un operando di tipo numerico su una sequenza di 4 valori. Una seconda sezione denominata ?registri? ? utilizzata per codificare i registri di memoria su una sequenza di 154 valori. Una terza sezione denominata ?indirizzo? ? utilizzata per codificare gli indirizzi di memoria su una sequenza di 65 valori. Una quarta sezione denominata ?reg coproc? ? utilizzata per codificare un registro di un coprocessore matematico su una sequenza di 16 valori. The coding of the third operand takes place on a sequence of 239 values which are divided into 4 sections, as for the second operand, in order to represent the different types of operand. A first section called ?number? ? used to encode a numeric type operand on a sequence of 4 values. A second section called ?register? ? used to encode memory registers on a sequence of 154 values. A third section named ?address? ? used to encode memory addresses on a sequence of 65 values. A fourth section named ?reg coproc? ? used to encode a math coprocessor register on a sequence of 16 values.

Nell?istruzione d?esempio il terzo operando ? un numero, 5, pertanto sar? codificato nella prima sezione ?numero? mentre le sequenze di valori delle altre sezioni saranno tutte valorizzare a 0. In the example instruction the third operand ? a number, 5, therefore sar? encoded in the first section ?number? while the sequences of values of the other sections will all be set to 0.

Il primo valore della sezione ?numero? ? valorizzato a 1 quando l?operando contiene un valore numerico, altrimenti ? valorizzato a 0. Il secondo valore della sezione ?numero? ? valorizzato a 1 quando il valore numerico non ? NaN o infinito, altrimenti ? valorizzato a 0. Il terzo valore della sezione ?numero? ? valorizzato con il valore numerico dell?operando, che nell?esempio ? uguale a 5. Il quarto valore della sezione ?numero? ? valorizzato a 1 quando viene usata una notazione con il punto esclamativo (!), altrimenti ? valorizzato a 0. The first value of the section ?number? ? valued at 1 when the operand contains a numeric value, otherwise ? valued at 0. The second value of the section ?number? ? valued at 1 when the numeric value is not ? NaN or infinity, otherwise ? valued at 0. The third value of the section ?number? ? valued with the numerical value of the operand, which in the example ? equal to 5. The fourth value of the section ?number? ? set to 1 when exclamation point (!) notation is used, otherwise ? valued at 0.

Secondo i criteri definiti per la sezione numero, il terzo operando dell?esempio, il numero 5, ? codificato come 1150. According to the criteria defined for the section number, the third operand of the example, the number 5, ? coded as 1150.

La codifica del quarto operando avviene su una sequenza di 239 valori che sono suddivisi in 4 sezioni, come per il secondo ed il terzo operando, per poter rappresentare i diversi tipi di operando. Una prima sezione denominata ?numero? ? utilizzata per codificare un operando di tipo numerico su una sequenza di 4 valori. Una seconda sezione denominata ?registri? ? utilizzata per codificare i registri di memoria su una sequenza di 154 valori. Una terza sezione denominata ?indirizzo? ? utilizzata per codificare gli indirizzi di memoria su una sequenza di 65 valori. Una quarta sezione denominata ?reg coproc? ? utilizzata per codificare registro di un coprocessore matematico su una sequenza di 16 valori. The coding of the fourth operand takes place on a sequence of 239 values which are divided into 4 sections, as for the second and third operand, in order to represent the different types of operand. A first section called ?number? ? used to encode a numeric type operand on a sequence of 4 values. A second section called ?register? ? used to encode memory registers on a sequence of 154 values. A third section named ?address? ? used to encode memory addresses on a sequence of 65 values. A fourth section named ?reg coproc? ? used to encode a math coprocessor register on a sequence of 16 values.

L?istruzione d?esempio non possiede un quarto operando ma dovr? comunque essere codificato per mantenere coerente la lunghezza della sequenza di valori. In questo caso la codifica i valori di tutte le sezioni saranno valorizzati a 0. The example instruction doesn't have a fourth operand but it will have to? however be encoded to keep the length of the sequence of values consistent. In this case the coding values of all sections will be set to 0.

La codifica del quinto operando avviene su una sequenza di 16 valori compresi in un?unica sezione denominata ?reg coproc? ed ? utilizzata per codificare un registro di un coprocessore matematico su una sequenza di 16 valori. The encoding of the fifth operand takes place on a sequence of 16 values included in a single section called ?reg coproc? and ? used to encode a math coprocessor register on a sequence of 16 values.

L?istruzione d?esempio anche in questo caso non possiede un quinto operando ma dovr? comunque essere codificato per mantenere coerente la lunghezza della sequenza di valori. Come per il caso precedente, in mancanza dell?operando, i valori della sezione saranno valorizzati tutti a 0. The instruction of example also in this case doesn't have a fifth operand but it will have to? however be encoded to keep the length of the sequence of values consistent. As for the previous case, in the absence of the operand, the values of the section will all be set to 0.

La codifica del sesto operando avviene su una sequenza di 4 valori compresi in un?unica sezione denominata ?numero? ? utilizzata per codificare un operando di tipo numerico su una sequenza di 4 valori. The coding of the sixth operand takes place on a sequence of 4 values included in a single section called ?number? ? used to encode a numeric type operand on a sequence of 4 values.

L?istruzione d?esempio anche in quest?ultimo caso non possiede il sesto operando ma dovr? comunque essere codificato per mantenere coerente la lunghezza della sequenza di valori. Come per i casi precedenti, in mancanza dell?operando, i valori della sezione saranno valorizzati tutti a 0. The instruction of example also in this last case doesn't have the sixth operand but it will have to? however be encoded to keep the length of the sequence of values consistent. As for the previous cases, in the absence of the operand, the values of the section will all be set to 0.

In conclusione, l?istruzione di esempio ?0x1234 add r0, r2, 5? al termine della fase di codifica sar? rappresentata su una sequenza numerica di 1250 valori cos? composta: - I primi 236 valori rappresentano l?opcode add e sono codificati con la sequenza (0.5741776823997498, 0.5895169377326965, 0.44707465171813965, 0.5283305644989014, ?); In conclusion, the example statement ?0x1234 add r0, r2, 5? at the end of the coding phase sar? represented on a numerical sequence of 1250 values cos? composed: - The first 236 values represent the add opcode and are encoded with the sequence (0.5741776823997498, 0.5895169377326965, 0.44707465171813965, 0.5283305644989014, ?);

- i successivi 4 valori rappresentano l?indirizzo 0x1234 e sono codificati con la sequenza (1, 1, 4660, 0); - the next 4 values represent the address 0x1234 and are coded with the sequence (1, 1, 4660, 0);

- i successivi 273 valori rappresentano il primo operando r0 e sono codificati con la sequenza (0, ?, 0, 1, 0, ?); - the following 273 values represent the first operand r0 and are coded with the sequence (0, ?, 0, 1, 0, ?);

- i successivi 239 valori rappresentano il secondo operando r2 e sono codificati con la sequenza (0, ?, 0, 0, 0, 1, 0, ?); - the following 239 values represent the second operand r2 and are coded with the sequence (0, ?, 0, 0, 0, 1, 0, ?);

- i successivi 239 valori rappresentano il terzo operando 5 e sono codificati con la sequenza (1, 1, 5, 0, ?); - the following 239 values represent the third operand 5 and are coded with the sequence (1, 1, 5, 0, ?);

- i successivi 239 valori rappresentano il quarto operando assente e sono codificati con la sequenza (0, 0, 0, 0, 0, 0, 0, 0, 0, ?); - the following 239 values represent the fourth absent operand and are coded with the sequence (0, 0, 0, 0, 0, 0, 0, 0, 0, ?);

- i successivi 16 valori rappresentano il quinto operando assente e sono codificati con la sequenza (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0); - the following 16 values represent the absent fifth operand and are coded with the sequence (0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);

- i successivi 4 valori rappresentano il sesto operando assente e sono codificati con la sequenza (0, 0, 0, 0). - the next 4 values represent the sixth absent operand and are coded with the sequence (0, 0, 0, 0).

Come detto in precedenza, la fase di codifica 140 pu? opzionalmente comprendere due ulteriori sotto-fasi che consentono ad una rete neurale di raggiungere pi? velocemente la convergenza: la sotto fase di mascheratura e la sotto fase di scalamento. As previously stated, the coding step 140 can? optionally include two further sub-phases that allow a neural network to reach more? convergence quickly: the masking sub-phase and the scaling sub-phase.

La sotto-fase di mascheratura ha il compito di eliminare ogni colonna dai campioni codificati appartenenti al dataset che presenta lo stesso valore in tutti i campioni. L?eliminazione di dette colonne ? possibile in quanto la fase di codifica 140 ? in grado di rappresentare pi? istruzioni di quelle che un processore sarebbe in grado di fare e inoltre alcuni dei valori non vengono mai utilizzati nella realt?. La sotto-fase di mascheratura consente di ridurre la lunghezza delle sequenze codificate e pertanto la rete neurale in fase di addestramento non utilizza del tempo per soffermarsi su dati che non sono realmente importanti. Nel caso d?esempio la sotto-fase di mascheratura ? in grado di ridurre la lunghezza della sequenza di valori da 1250 a 768. The masking sub-phase has the task of eliminating each column from the coded samples belonging to the dataset which has the same value in all the samples. The elimination of these columns ? possible because the coding phase 140 ? able to represent more instructions than a processor would be able to do, and also some of the values are never actually used. The masking sub-phase allows to reduce the length of the encoded sequences and therefore the neural network under training does not spend time dwelling on data that is not really important. In the case d?example the sub-phase of masking ? able to reduce the length of the sequence of values from 1250 to 768.

La sotto-fase di riscalamento effettua un riscalamento dei valori all?interno di un range compreso tra 0 e 1. Sebbene i campioni codificati appartenenti al data set contengono per la maggior parte delle volte i valori 0 e 1, in alcuni casi contengono valori molto pi? grandi. Ad esempio, nel caso della codifica dell?indirizzo espresso con una sequenza lunga 4, il terzo valore della sequenza contiene l?indirizzo espresso in base decimale che nell?esempio descritto precedente valeva 4660. Altro esempio di valori presenti nella decodifica diverso da 0 e da 1 ? ad esempio presente nella sezione ?numeri? della codifica del primo, secondo, terzo, quarto e sesto operando. Nell?esempio precedente il terzo operando era il numero 5, codificato come terzo valore della sezione numero del terzo operando. L?operazione di riscalamento evita che ai numeri pi? grandi sia attribuito un peso maggiore in fase di addestramento a cui non sempre corrisponde maggiore importanza. Il peso maggiore attribuito ad un determinato valore nella fase iniziale dell?addestramento di una rete neurale rallenta il processo di apprendimento dato che la rete impiegherebbe pi? tempo per capire che il maggior peso attribuito a tali valori era privo di fondamento. The rescaling sub-phase performs a rescaling of the values within a range between 0 and 1. Although the coded samples belonging to the data set mostly contain the values 0 and 1, in some cases they contain very more big. For example, in the case of encoding the address expressed with a 4-long sequence, the third value of the sequence contains the address expressed in decimal base which in the example described above was worth 4660. Another example of values present in the decoding other than 0 and from 1 ? for example present in the section ?numbers? of the coding of the first, second, third, fourth and sixth operand. In the previous example, the third operand was the number 5, encoded as the third value of the number section of the third operand. The operation of rescaling avoids that the numbers pi? great weight is given a greater weight in the training phase which does not always correspond to greater importance. The greater weight attributed to a certain value in the initial phase of the training of a neural network slows down the learning process since the network would take more? time to understand that the greater weight attributed to these values was without foundation.

Addestramento Training

Quando la fase di codifica 140 ? si ? conclusa, i campioni codificati CHMP_COD possono essere utilizzati nella fase di addestramento 150 inviandoli alla rete neurale NN(Pi). When the coding phase 140 ? Yes ? concluded, the CHMP_COD coded samples can be used in the training phase 150 by sending them to the neural network NN(Pi).

Come detto in precedenza, in questa fase ? possibile utilizzare qualsiasi rete neurale in grado di gestire le sequenze e con meccanismo di attenzione. In particolare, verranno descritti due possibili fasi di addestramento alternative, utilizzando due diverse architetture di dette reti neurali. As previously mentioned, at this stage ? You can use any neural network capable of handling sequences and with an attention mechanism. In particular, two possible alternative training steps will be described, using two different architectures of said neural networks.

La figura 5 mostra una prima architettura semplificata basata su celle LSTM, una tipologia di rete neurale ricorrente con un meccanismo di memoria a lungo termine che consente l?elaborazione di sequenze di dati. Le informazioni di dette sequenze vengono memorizzate di modo che, grazie alla presenza di cicli, procedendo nella sequenza le informazioni memorizzate nelle celle siano di ausilio all?elaborazione dei nuovi dati. In questo modo la rete neurale ? in grado di interpretare in ordine le istruzioni assembly contenute nei campioni codificati CHMP_COD. Una codifica semantica della funzione contenuta nel campione codificato CHMP_COD ? utilizzata come input della rete neurale dove un primo strato 1S_LSTM di celle LSTM analizza le istruzioni assembly contenute nel campione codificato CHMP_COD. Dette celle LSTM sono celle di tipo bidirezionale in grado di analizzare le istruzioni assembly dalla prima all?ultima e in direzione opposta dall?ultima verso la prima. Figure 5 shows a first simplified architecture based on LSTM cells, a type of recurrent neural network with a long-term memory mechanism that allows processing of data sequences. The information of said sequences are memorized in such a way that, thanks to the presence of cycles, by proceeding in the sequence the information memorized in the cells is of assistance in the processing of the new data. In this way the neural network ? able to interpret the assembly instructions contained in the CHMP_COD code samples in order. A semantic encoding of the function contained in the CHMP_COD code sample? used as input of the neural network where a first layer 1S_LSTM of LSTM cells analyzes the assembly instructions contained in the coded sample CHMP_COD. Said LSTM cells are bidirectional cells capable of analyzing the assembly instructions from first to last and in the opposite direction from last to first.

Un meccanismo di attenzione ATT riceve i dati di output dal primo strato di celle 1S_LSTM da cui ? in grado di estrarre i livelli di attenzione LIV_ATT delle singole istruzioni dando indicazione di quali istruzioni assembly contengono la protezione. Maggiore ? il livello di attenzione, pi? ? probabile che l?istruzione assembly faccia parte di una protezione. Tale livello di attenzione LIV_ATT corrisponde al secondo indice FAi,j,k, sopra descritto. An attention mechanism ATT receives the output data from the first cell layer 1S_LSTM from which ? able to extract the LIV_ATT attention levels of the single instructions giving indication of which assembly instructions contain the protection. Greater ? the level of attention, pi? ? The assembly instruction is likely part of a guard. This level of attention LIV_ATT corresponds to the second index FAi,j,k, described above.

L?output del meccanismo di attenzione ATT ? sommato all?output del primo strato di celle 1S_LSTM in modo che siano aggiunte le informazioni dei livelli di attenzione LIV_ATT ed entrano successivamente verso un secondo strato 2S_LSTM di celle LSTM, che analizza le istruzioni assembly contenute nel campione codificato. The output of the attention mechanism ATT ? added to the output of the first layer of cells 1S_LSTM so that the information of the attention levels LIV_ATT are added and subsequently enter a second layer 2S_LSTM of LSTM cells, which analyzes the assembly instructions contained in the coded sample.

L?output del secondo strato di celle 2S_LSTM entra in ingresso ad un ultimo strato TRAS_LIN dove una trasformazione lineare ed una funzione sigmoide sono utilizzati per calcolare i valori del primo indice PIi,j per ogni istruzione assembly in un intervallo compreso tra 0 e 1. Il valore finale (o punteggio) CLASS ? dato dal punteggio ottenuto dall?ultima istruzione contenuta nel campione codificato CHMP_COD, ovvero la pseudo-istruzione <end>. Detto punteggio pi? ? prossimo al valore 1 e pi? ? probabile la presenza di una protezione. Detto punteggio CLASS corrisponde al primo indice di probabilit? PIi,j. The output of the second layer of cells 2S_LSTM enters the input of a last layer TRAS_LIN where a linear transformation and a sigmoid function are used to calculate the values of the first index PIi,j for each assembly instruction in a range between 0 and 1. The final value (or score) CLASS ? given by the score obtained by the last instruction contained in the CHMP_COD coded sample, i.e. the <end> pseudo-instruction. Said score pi? ? close to the value 1 and more? ? probably the presence of a protection. Said CLASS score corresponds to the first probability index? PIi,j.

Una seconda architettura, come mostrato in figura 6, ? basata sul sistema di tipo trasformer BERT (rappresentazioni dell'encoder bidirezionale da transformer), ovvero un?architettura non ricorrente con meccanismo di attenzione. Dato che questo tipo di rete neurale necessita che la funzione in ingresso abbia una lunghezza massima nota a priori, la precedente fase di codifica 140 effettua l?operazione di troncamento della funzione come mostrato in figura 4c. Per tale motivo in questo frangente l?ultima istruzione della funzione potrebbe non essere la pseudo-istruzione <end>. A second architecture, as shown in figure 6, ? based on the BERT transformer-type system (representations of the bidirectional encoder from transformer), i.e. a non-recurring architecture with an attention mechanism. Since this type of neural network requires that the input function have a maximum length known in advance, the previous coding step 140 performs the function truncation operation as shown in figure 4c. For this reason, the last instruction of the function may not be the <end> pseudo-instruction at this juncture.

Al tecnico del ramo sono note le particolarit? di questa architettura di rete neurale ed i limiti superati presenti in architetture di reti neurali di tipo ricorrente. Are the particularities known to the person skilled in the art? of this neural network architecture and the exceeded limits present in architectures of neural networks of the recurring type.

La figura 6 mostra un?architettura semplificata basata su BERT dove una codifica semantica della funzione contenuta nel campione codificato CHMP_COD viene sommata ad una codifica posizionale della funzione per aggiungere un?informazione di posizione dalle istruzioni assembly contenute nel campione codificato CHMP_COD. Questa prima operazione si rende necessaria per esplicitare l?informazione di posizione dell?istruzione dato che la rete neurale di tipo transformer, non avendo ricorrenza, non possiede la nozione di posizione di un elemento all?interno di una sequenza. La codifica posizionale ? un prerequisito alla fase di addestramento e sono dei coefficienti calcolati secondo formule standard memorizzati all?interno di una matrice. Figure 6 shows a simplified architecture based on BERT where a semantic coding of the function contained in the coded sample CHMP_COD is added to a positional coding of the function to add a positional information from the assembly instructions contained in the coded sample CHMP_COD. This first operation is necessary to make explicit the position information of the instruction given that the transformer type neural network, having no recurrence, does not possess the notion of position of an element within a sequence. Positional coding? a prerequisite to the training phase and are coefficients calculated according to standard formulas memorized within a matrix.

Il risultato in output della fase precedente entra come input ad una serie di strati di encoding S_ENCOD che ne analizzano il contenuto, dove ogni strato di encoding possiede un meccanismo di attenzione che ? utilizzato per valutare i livelli di attenzione LIV_ATT delle istruzioni assembly in modo analogo a quanto detto nell?architettura di rete neurale di tipo LSTM. Tali livelli di attenzione LIV_ATT corrispondono a valori del secondo indice FAi,j,k, sopra descritto. The output result of the previous phase enters as input to a series of S_ENCOD encoding layers that analyze its content, where each encoding layer has an attention mechanism that ? used to evaluate the LIV_ATT attention levels of the assembly instructions in a similar way to what was said in the LSTM type neural network architecture. These levels of attention LIV_ATT correspond to values of the second index FAi,j,k, described above.

L?output dagli strati di encoding S_ENCOD entra in ingresso ad un ultimo strato TRAS_LIN dove una trasformazione lineare ed una funzione sigmoide sono utilizzati per calcolare i punteggi di classificazione per ogni istruzione assembly in un intervallo compreso tra 0 e 1. A differenza di quanto accade nello strato TRAS_LIN di una rete LSTM, il valore del punteggio finale CLASS ? relativo alla prima istruzione contenuta nel campione codificato CHMP_COD, ovvero la pseudo-istruzione <begin>. Anche in questo caso il punteggio finale CLASS corrisponde al primo indice di probabilit? PIi,j, sopra introdotto. Si noti che oltre ai due esempi descritti ? possibile procedere all?addestramento di una qualsiasi architettura, tra quelle in grado di gestire sequenze e/o aventi un meccanismo di attenzione utilizzando un qualsiasi algoritmo di addestramento standard. The output from the S_ENCOD encoding layers enters a final layer TRAS_LIN where a linear transformation and a sigmoid function are used to calculate the classification scores for each assembly instruction in a range between 0 and 1. Unlike what happens in the TRAS_LIN layer of an LSTM network, the value of the final score CLASS ? related to the first instruction contained in the CHMP_COD code sample, i.e. the <begin> pseudo-instruction. Also in this case the final CLASS score corresponds to the first probability index? PIi,j, introduced above. Note that in addition to the two examples described ? It is possible to proceed with the training of any architecture, among those able to manage sequences and/or having an attention mechanism using any standard training algorithm.

La soluzione descritta produce una metrica qualitativa dell?identificazione delle tecniche di protezione utilizzabile come stima della qualit? della soluzione di protezione scelta, introducendo un elevato grado di automazione nell?individuazione delle tecniche di protezione applicate ad un file e delle aree protette all?interno del file stesso. The solution described produces a qualitative metric of the identification of protection techniques that can be used as an estimate of the quality? of the chosen protection solution, introducing a high degree of automation in the identification of the protection techniques applied to a file and of the protected areas within the file itself.

Si noti che i risultati ottenuti mediante l?applicazione del metodo della presente invenzione possono essere utilizzati come strumento di validazione dell?esposizione al rischio degli asset, come validazione dell?invisibilit? di tecniche di protezione sviluppate e come identificazione dei metodi con cui vengono offuscati virus e malware per contribuire all?aggiornamento di strumenti di antivirus. It should be noted that the results obtained by applying the method of the present invention can be used as a tool for validating the risk exposure of the assets, as a validation of the invisibility of the assets. of protection techniques developed and how to identify methods by which viruses and malware are obfuscated to help update antivirus tools.

La soluzione descritta permette quindi raggiungere un livello di protezione del software pi? elevato a parit? di tempo impiegato o un livello di protezione equivalente a quella ottenibile dalle tecniche note in un intervallo di tempo sensibilmente inferiore. The solution described therefore allows you to achieve a higher level of software protection? elevated to parity of time employed or a level of protection equivalent to that obtainable by known techniques in a considerably shorter time interval.

Claims

1. A method (100) of configuring neural networks, comprising the steps of:

a) define (110; 120) a plurality? of functions (asm) and apply a plurality? software protections (P1,..,Pn) for these functions;

b) construct (130) a data set comprising a plurality? of samples each including a function (asmj) of the plurality? and at least one of said software protections (P1,..,Pn) applied to the respective function;

c) codify (140) each function (asmj) of the data set to obtain a plurality? of coded samples (CHMP_COD) each expressed as a sequence of numerical values;

d) train (150) a neural network (NN(Pi)) using the plurality? of coded samples (CHMP_COD) so that it is able to process a file to be analyzed and provide information relating to software protections applied to said file to be analysed.

The method (100) of claim 1, wherein:

said phase to define the plurality? of functions (110, 120) also includes the definition of a plurality? of vanilla functions to which software protections are not applied;

said data set includes a further plurality? of samples each including a function (asmj) of the plurality? of vanilla functions.

3. The method (100) of claim 1, wherein the step of defining (110, 120) the plurality of of functions includes the phases of:

provide a source file including said plurality? of functions (asm); apply (110) the plurality? of software protections (P1,..,Pn) to the plurality? of functions (asm) of the source file and compile the source file provided with the software protections obtaining a binary compiled file;

disassemble (120) the binary compiled file obtaining a file in assembly format and extract the plurality? of functions (asm) from the file in assembly format.

4. The method (100) of claim 1, wherein the step of defining (110, 120) the plurality of of functions includes the phases of:

provide a binary file including said plurality? of functions (asm) to which are applied a plurality? of software protections (P1,..,Pn);

disassemble (120) the binary file obtaining a file in assembly format and extract the plurality? of functions (asm) from the file in assembly format.

5. The method (100) of claim 1, wherein said neural network (NN(P1)) ? associated with a single type of software protection (P1).

6. The method (100) of claim 2, wherein said neural network (NN(P1)) ? configured so that said information includes a first probability index? (PI1,j) indicative of a probability? that a first function (asmj) of the plurality? of functions has been protected by the first protection (P1).

7. The method (100) of claim 3, wherein said neural network (NN(P1)) ? configured so that said information includes a second index (FAi,j,k) which represents a possibility? that the first protection (P1) has been applied to the instructions of a specific area of the first function.

8. The method (100) of claim 3, wherein the plurality? of software protections (P1,..,Pn) includes at least one of the following protections: control flow flattening, opaque predicates, branch functions, encode arithmetic, conversion of data into functions, merging or splitting of variables, recoding of variables, white-box cryptography, virtualization using virtual machines or JIT compilation, call stack controls, code guards, control flow tagging, anti-debugging, code mobility, client/server code splitting, anti-cloning and software attestation.

9. The method of claim 1, wherein said method is made to configure a plurality? of neural networks (NN(Pi) each associated with a relative software protection.

The method (100) of claim 1, wherein said neural network (NN(Pi)) ? realized according to at least one of the following types of neural network: network capable of managing sequences, network having an attention mechanism.

The method (100) of claim 9, wherein said neural network (NN(Pi)) ? built according to at least one of the following types of neural network: LSTM network, BERT network, GRU network, transformer-XL network.

12. The method (100) of claim 1, wherein the plurality? of coded samples includes a plurality? of sequences of numerical values and said coding step further comprises:

a phase of masking in which you eliminate from the plurality? of sequences of numerical values repeated values in each sequence of the plurality?;

a phase of rescaling of said numerical values within a pre-established interval.

13. A file processing method, including the steps of:

- provide a binary file to parse including a plurality? of functions to analyze; - disassemble the binary file to be analyzed to obtain an assembly file;

- extract from the assembly file the plurality? of functions to analyze(asm);

- codify each function to be analyzed by expressing it as a relative sequence of numerical values;

- provide a plurality? of neural networks (NN(Pi)), each associated with a relative software protection (P1,?,Pn), configured according to the configuration method (100) of at least one of the preceding claims;

- process the plurality? of functions to analyze (asm) through the plurality? of neural networks (NN(Pi)) to search for information relating to software protections within the plurality? of functions to analyze.

14. The method of claim 13, wherein each of said neural networks (NN(Pi)) ? associated with a respective type of software protection (Pi).

15. The method of claim 14, in which to process the plurality? of functions to analyze (asm) through the plurality? of neural networks (NN(Pi)) returns classification information including a plurality? of probability indices? (IPi) each indicative of a probability? that a relative function (asmj) of the plurality? of functions is protected by one of these protections (Pi).

16. The method of claim 14, in which to process the plurality? of functions to analyze (asm) through the plurality? of neural networks (NN(Pi)) returns position information including a plurality? of second indices (FAi,j,k) each indicative of a possibility? that a corresponding protection (Pi) has been applied to instructions of a specific area of a function.