CN116451186A

CN116451186A - Sensitive data security protection method and system

Info

Publication number: CN116451186A
Application number: CN202310431940.5A
Authority: CN
Inventors: 徐浩; 罗剑芳; 罗维佳; 吴勇; 丁卓; 朱凯
Original assignee: Guangzhou Zhangdong Intelligent Technology Co ltd
Current assignee: Guangzhou Zhangdong Intelligent Technology Co ltd
Priority date: 2023-04-21
Filing date: 2023-04-21
Publication date: 2023-07-18
Anticipated expiration: 2043-04-21
Also published as: CN116451186B

Abstract

The application discloses a sensitive data security protection method and system, the method includes: monitoring a browser and a copying and cutting board, and when the content copied or cut by the copying and cutting board is monitored to be sourced from a preset IP address or domain name of the browser, marking the content in the copying and cutting board for the first time; when the marked copy cut panel content is pasted to the code editing tool, the content pasted this time is marked for the second time in the code editing tool. The application can prevent safety risks.

Description

Sensitive data security protection method and system

Technical Field

The application relates to software technology, in particular to a sensitive data security protection method and system.

Background

With the development of AI technology, writing codes by using AI has become a way of developing software, however, there is a certain potential safety hazard in such a way, and the potential safety hazard includes that a part of codes generated by AI may have defects or loopholes, and on the other hand, when data are processed by AI, the AI model may learn some sensitive data, so that a safety problem occurs.

Disclosure of Invention

The present invention aims to solve at least one of the technical problems existing in the prior art. Therefore, the invention provides a sensitive data security protection method and a sensitive data security protection system, which are used for preventing sensitive data from being leaked and preventing code risks.

In one aspect, an embodiment of the present application provides a method for protecting sensitive data security, including:

monitoring a browser and a copying and cutting board, and when the content copied or cut by the copying and cutting board is monitored to be sourced from a preset IP address or domain name of the browser, marking the content in the copying and cutting board for the first time;

when the marked copy cut panel content is pasted to the code editing tool, the content pasted this time is marked for the second time in the code editing tool.

In some embodiments, the method further comprises the steps of:

disabling pasting of copy-cut-sheet content to the browser when it is monitored that the content copied or cut by the copy-cut-sheet originates from a local file marked as sensitive data;

the sensitive data includes user information and a code.

In some embodiments, the method further comprises the steps of:

when the content marked by the first mark is pasted to the local file, the third mark is carried out on the local file;

when the content in the file marked by the third mark is pasted to the code editing tool, the second mark is carried out on the pasted content in the code editing tool.

In some embodiments, the method further comprises the steps of:

and when part or all of the content of the file marked by the third mark is copied and pasted to a second local file, marking the second local file by the third mark.

In some embodiments, the second marking is performed, in particular:

marking the marked code segment in a highlighting or thickening mode;

wherein all characters are marked independently.

In some embodiments, the marked code segments are configured to be visible to a user of the preset authority.

In some embodiments, the method further comprises the steps of:

this is recorded when it is monitored that the content copied or cut by the copy cut-out board originates from a local file marked as sensitive data.

In some embodiments, the path of the marked file is written into the mark-up document when the third marking is performed.

In some embodiments, the preconditions for first marking the content in the copy shear plate include:

the composition of the content in the copy cut is detected, and when the proportion of the specific punctuation contained in the copy cut is greater than a threshold value, it is determined that the content in the copy cut is a code.

In another aspect, an embodiment of the present application provides a sensitive data security protection system, including:

a memory for storing a program;

and the processor is used for loading the program to execute the sensitive data security protection method.

According to the embodiment of the application, through monitoring the browser and the copying and cutting board, when the content copied or cut by the copying and cutting board is monitored to be sourced from a preset IP address or domain name of the browser, the content in the copying and cutting board is marked for the first time; when the marked copy cut-board content is pasted to the code editing tool, performing second marking on the pasted content in the code editing tool; in this way, codes derived from AI generation can be marked, reducing risks associated with entering the software architecture, and can be discovered in time when developers copy the codes into the code system, facilitating internal inspection of code quality and assessment of security risks.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are required to be used in the description of the embodiments will be briefly described.

FIG. 1 is a flow chart of a method for secure protection of sensitive data according to an embodiment of the present application;

fig. 2 is a schematic diagram of a marking process provided in an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present application more clear, the technical solutions of the present application will be clearly and completely described by implementation with reference to the accompanying drawings in the examples of the present application, and it is apparent that the described examples are some, but not all, examples of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

Large Language Models (LLMs) refer to deep learning models trained using large amounts of text data that can generate natural language text or understand the meaning of language text. The large language model can process various natural language tasks, such as text classification, question-answering, dialogue and the like, and is an important path to artificial intelligence. As the scale of these large language models grows, their parameter magnitudes have evolved from tens of millions to billions of such tools have been able to generate code with specific functionality according to the instructions of the user. However, since the training data of the model is not controlled as the model is used, it is uncertain whether the solution generated by the model has risks. Especially when the solution is sufficiently complex, it is uncertain whether or not there is a vulnerability.

With the development of technology, these large language models are more intelligent, and not only can reply to the content of the user, but also can further execute instructions. While most AI models support users to deliver data to them for their learning. This situation, if the user uses improperly, may be learned by the AI model and even steal some of the data content, creating a significant security risk.

Referring to fig. 1, an embodiment of the present application discloses a sensitive data security protection method, which is mainly applied in development environments of developers, including personal computers, servers and the like of the developers, and mainly focuses on preventing risks generated by using an AI model, and may be embedded into some current security management systems as a detection branch for the AI model. The method comprises the following steps:

s1, monitoring a browser and a copying and cutting board, and when the content copied or cut by the copying and cutting board is monitored to be sourced from a preset IP address or domain name of the browser, marking the content in the copying and cutting board for the first time. It will be appreciated that by utilizing a program that runs as the system is started, the browser and replica clipboard can be monitored in a background fashion, as the user from a particular web site (as determined by IP and web site domain name), the replica clipboard is monitored. When a user copies content at a particular website, the content is marked. The method mainly aims at monitoring whether a user generates codes through an AI generated website, and the codes may have the problems of poor quality, loopholes, other risks and the like. These problems will affect the quality and safety issues of the software engineering. The software is easy to leak, and particularly in the case that the AI can write more complex codes, the problem cannot be easily found by a manual inspection mode. The purpose of this step is to monitor this replication behaviour.

In the present embodiment, the main monitoring object is code. Thus, the content to be marked can be further subdivided. If it is determined that the software code is on the copy-on-board, it may be further monitored, and if it is determined that it is not the code, it may be left untagged, i.e. not being monitored.

Then a determination is made as to whether the code in the copy-and-paste board can be made by: wherein it is preferable to determine that the contents in the copy cut are codes when the proportion of the specific mark contained in the copy cut is greater than a threshold value by monitoring the composition of the contents in the copy cut. The principle is that the code is typically constituted by english, or english symbols and punctuation marks of a particular programming language. For example, in general conventional sentences, there are relatively few brackets, semicolons, and the like. And in normal sentences, the proportion of punctuation marks to the total number of characters is relatively low. Therefore, by detecting the ratio of the number of punctuation coincidences that meet the common code grammar in the entire content, it is possible to recognize whether or not the above content is code more accurately and simply.

Of course, alternatively, the pasted content may be encoded, and input into a training model (such as a language model, an SVM model, etc.) for classification, thereby determining whether the content is encoded. Of course, this approach is relatively costly.

S2, when the marked copy cut panel content is pasted to the code editing tool, second marking is carried out on the pasted content in the code editing tool. It will be appreciated that when the content in the copy-and-paste board is marked, a notification is sent to the listener to the code editing tool when it is detected that the content in the copy-and-paste board is pasted to the code editing tool, and the marking plug-in the code editing tool marks the code copied to the code editing tool. Each character is marked. The marking mode can be visible to specific authorities or all people. The marking mode can adopt a highlighting mode, a thickening mode and the like.

S3, when the content copied or cut by the copy cut board is monitored to be sourced from a local file marked as sensitive data, the copy cut board content is forbidden to be pasted to the browser;

the sensitive data includes user information and a code.

It will be appreciated that the local code file, the file related to user privacy, may be marked as a sensitive file, at which time a list of sensitive files is loaded at the start of the listener. The listener listens to each foreground program (i.e., the program currently operated by the user), and if the current program operation object is sensitive data, monitors the user's copy behavior. If this data is copied to the browser for transmission to the AI model, leakage may occur. Of course, compared to the prior art, the main focus of the present application is on the prevention of AI models. It includes the use of AI to generate codes that may have adverse effects on the incorporation of software engineering, as well as the protection against code being infused into a third party AI model, creating security concerns such as data leakage.

In some embodiments, the method further comprises the steps of:

and S4, when the content marked by the first mark is pasted to the local file, performing third mark on the local file.

And S5, when the content in the file marked by the third mark is pasted to the code editing tool, the second mark is carried out on the pasted content in the code editing tool.

It will be appreciated that when the tagged content is copied to the local file, the local file may be tagged, at which point the listener keeps track of some files that may contain AI-generated code by maintaining a list. At this time, if the user copies the AI-generated code to the local file and then to the code editing tool, the AI-generated code will still be marked.

And S6, when part or all of the content of the file marked by the third mark is copied and pasted to the second local file, marking the third mark on the second local file. And when the third marking is carried out, writing the path of the marked file into the marking document.

It will be appreciated that this embodiment employs a contamination mechanism whereby when a file is marked, both its copy and the document containing part of its content are marked. In this way, the user can be prevented from bypassing the mechanism. The design purpose is to standardize the working process of the developer, and although the code generated by the AI cannot be completely prevented from entering the software engineering (for example, the user does not copy but directly transcribe the code), the developer can be reminded or forced to more carefully examine the code, so that the safety risk is reduced. The marks left also facilitate careful evaluation of these codes in code reviews.

In some embodiments, the second marking is performed, in particular:

marking the marked code segment in a highlighting or thickening mode;

wherein all characters are marked independently. Independent marking means that in code editing, a trace is left as long as one character is not deleted.

In some embodiments, the marked code segments are configured to be visible to a user of the preset authority. For example, only the developer of the advanced job position is visible. This allows a review of other developers' behavior of programming with AI.

This is recorded when it is monitored that the content copied or cut by the copy cut-out board originates from a local file marked as sensitive data. It will be appreciated that with this strategy, situations that may lead to data risk can be discovered and prevented in time.

a memory for storing a program;

Note that the above is only a preferred embodiment of the present application and the technical principle applied. Those skilled in the art will appreciate that the present application is not limited to the particular embodiments described herein, but is capable of numerous obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the present application. Therefore, while the present application has been described in connection with the above embodiments, the present application is not limited to the above embodiments, but may include many other equivalent embodiments without departing from the spirit of the present application, the scope of which is defined by the scope of the appended claims.

Claims

1. A method for securing sensitive data, comprising:

2. The method of claim 1, further comprising the steps of:

the sensitive data includes user information and a code.

3. The method of claim 1, further comprising the steps of:

4. The method for protecting sensitive data security according to claim 1, further comprising the steps of:

5. The method for protecting sensitive data according to claim 1, wherein the second marking is performed, specifically:

marking the marked code segment in a highlighting or thickening mode;

wherein all characters are marked independently.

6. The sensitive data security method of claim 1, wherein the marked code segments are configured to be visible to a user of the preset authority.

7. The method of claim 1, further comprising the steps of:

8. A method of securing sensitive data as claimed in claim 3 wherein the third marking is performed by writing the path of the marked file into the marking document.

9. The method for protecting sensitive data security according to claim 1, wherein: the preconditions for first marking the content in the copy cut-out include:

10. A sensitive data security system, comprising:

a memory for storing a program;

a processor for loading the program to perform the sensitive data security method of any one of claims 1-9.