CN115632803A

CN115632803A - Multi-dimensional Internet sensitive information detection method, system and equipment

Info

Publication number: CN115632803A
Application number: CN202210895595.6A
Authority: CN
Inventors: 刘阳; 吕绪银; 闫继文; 刘杰; 戴军; 崔晓鑫; 曹瑞
Original assignee: Shandong Xingwei Jiuzhou Safety Technology Co ltd
Current assignee: Shandong Xingwei Jiuzhou Safety Technology Co ltd
Priority date: 2022-07-28
Filing date: 2022-07-28
Publication date: 2023-01-20

Abstract

This application discloses a multi-dimensional Internet sensitive information detection method, system and equipment, which relate to the technical field of network information security. Information detection tasks are set according to sensitive information identification, and different identifications of different types of sensitive information are different; the information The detection task sets the detection frequency and allocates detection channels; performs task detection according to the detection frequency and detection channels to obtain the first detection data; screens the first detection data to obtain the second detection data, and the second detection The data is used to assess the security of sensitive information. Use the identification of sensitive information to detect sensitive information through multiple detection channels, and analyze the detected data to ensure the accuracy of the detected information. Finally, the target correlation is obtained by comparing the detected data with the original sensitive information identification, and then the security degree of sensitive information is evaluated, which improves the accuracy of external investigation of sensitive information compared with traditional methods.

Description

A multi-dimensional Internet sensitive information detection method, system and equipment

技术领域technical field

本申请涉及网络信息安全技术领域，具体涉及一种多维度的互联网敏感信息探测方法、系统及设备。This application relates to the technical field of network information security, in particular to a multi-dimensional Internet sensitive information detection method, system and equipment.

背景技术Background technique

信息化时代为了避免企业敏感信息泄露事件频发,国家相继出台了数据安全法、网络安全法加强管理,进一步促使企业加大安全投入防范此类风险。但是，复杂的外部攻击形势下，黑客窃取企事业单位敏感数据并散布、售卖到网络。再加上企业内部管理不善,员工缺乏安全意识,将敏感文件、项目核心代码上传到网络上,容易造成无法掌控的数据泄露,规模较大的还可能对国家造成重大安全影响。In the information age, in order to avoid frequent incidents of corporate sensitive information leakage, the state has successively promulgated data security laws and network security laws to strengthen management, further urging enterprises to increase security investment to prevent such risks. However, under the complex external attack situation, hackers steal sensitive data of enterprises and institutions and distribute and sell it to the network. Coupled with the poor internal management of the enterprise and the lack of security awareness of employees, uploading sensitive files and project core codes to the Internet can easily cause uncontrollable data leakage, and large-scale ones may also have a major security impact on the country.

为此，如何在现如今的安全形势下，尽可能避免因敏感信息泄露而造成各方面的损失,已经成为企业安全建设的重要一环。传统技术中，企业为了防止敏感信息泄露一般采取内防和外查的方式。内防则是加大数据安全的强度，尽量避免敏感数据的泄露。外查则是针对对外界公开的数据进行筛查，以确定企业的敏感数据是否泄露。Therefore, in today's security situation, how to avoid all kinds of losses caused by the leakage of sensitive information as much as possible has become an important part of enterprise security construction. In traditional technology, enterprises generally adopt internal defense and external investigation in order to prevent the leakage of sensitive information. Internal defense is to increase the strength of data security and try to avoid the leakage of sensitive data. External inspection is to screen the data disclosed to the outside world to determine whether the company's sensitive data has been leaked.

但是，传统技术中的外查手段多是对指定的网络位置进行监测，并且往往渠道单一，无法准确的对外界信息进行筛查。However, most of the external inspection methods in traditional technologies are to monitor designated network locations, and the channels are often single, so it is impossible to accurately screen external information.

发明内容Contents of the invention

本申请为了解决上述技术问题，提出了如下技术方案：In order to solve the above technical problems, the application proposes the following technical solutions:

第一方面，本申请实施例提供了一种多维度的互联网敏感信息探测方法，所述方法包括：根据敏感信息标识设置信息探测任务，不同所述敏感信息种类的不同标识不同；将所述信息探测任务设置探测频度并分配探测渠道；根据所述探测频度和探测渠道进行任务探测获取第一探测数据；将所述第一探测数据进行筛查获得第二探测数据，所述第二探测数据用于评估敏感信息安全度。In the first aspect, the embodiment of the present application provides a multi-dimensional Internet sensitive information detection method, the method includes: setting information detection tasks according to sensitive information identifiers, and different identifiers of different types of sensitive information are different; The detection task sets the detection frequency and allocates the detection channels; according to the detection frequency and detection channels, the task detection is carried out to obtain the first detection data; the first detection data is screened to obtain the second detection data, and the second detection The data is used to assess the security of sensitive information.

采用上述实现方式，利用敏感信息的标识通过多个探测渠道进行敏感信息的探测，并对探测到的数据进行分析,以确保探测到的信息准确性。最后根据探测到的数据与原始敏感信息进行比对获得目标相关性，进而对敏感信息安全度进行评估，相比传统方式提高了敏感信息外查的准确性。Using the above implementation method, sensitive information is detected through multiple detection channels by using the identification of sensitive information, and the detected data is analyzed to ensure the accuracy of the detected information. Finally, the target correlation is obtained by comparing the detected data with the original sensitive information, and then the security degree of sensitive information is evaluated. Compared with the traditional method, the accuracy of external investigation of sensitive information is improved.

结合第一方面，在第一方面第一种可能的实现方式中，所述根据敏感信息标识设置信息探测任务，包括：对所述敏感信息进行解析确定所述敏感信息的信息标识；根据所述信息标识确定所述敏感信息的探测列表；为所述所述探测列表分配目标ID建立目标任务条目。With reference to the first aspect, in the first possible implementation manner of the first aspect, the setting of the information detection task according to the sensitive information identification includes: analyzing the sensitive information to determine the information identification of the sensitive information; The information identification determines the detection list of the sensitive information; and assigns a target ID to the detection list to establish a target task entry.

结合第一方面第一种可能的实现方式，在第一方面第二种可能的实现方式中，所述根据探测频度和探测渠道进行任务探测获取第一探测数据，包括：根据所述目标ID生成与所述目标ID关联的任务ID；在数据库中为每个所述探测渠道建立独立的数据表，相同探测任务确定的所述探测渠道的数据表使用相同的任务ID；通过所述探测渠道根据信息标识按照所述探测频度进行信息探测，并将探测的信息存储到对应的数据表中；探测完毕后读取数据库中带有相同任务ID的数据表获得同一探测任务的临时探测结果。With reference to the first possible implementation of the first aspect, in the second possible implementation of the first aspect, the task detection and acquisition of the first detection data according to the detection frequency and detection channels includes: according to the target ID Generate a task ID associated with the target ID; establish an independent data table for each of the detection channels in the database, and use the same task ID for the data tables of the detection channels determined by the same detection task; Perform information detection according to the detection frequency according to the information identification, and store the detected information in the corresponding data table; after the detection is completed, read the data table with the same task ID in the database to obtain the temporary detection result of the same detection task.

结合第一方面第二种可能的实现方式，在第一方面第三种可能的实现方式中，通过所述探测渠道根据信息标识按照所述探测频度进行信息探测，包括：根据所述探测频度对第一探测渠道进行初步敏感信息探测测试，所述第一探测渠道为同一探测任务中的任一探测渠道；获取所述第一探测渠道的反馈响应；如果所述反馈响应为限制反爬，则调用IP代理池通过所述第一探测渠道进行敏感信息探测。With reference to the second possible implementation manner of the first aspect, in the third possible implementation manner of the first aspect, performing information detection according to the detection frequency according to the information identifier through the detection channel includes: Perform a preliminary sensitive information detection test on the first detection channel, the first detection channel is any detection channel in the same detection task; obtain the feedback response of the first detection channel; if the feedback response is to limit anti-crawling , the IP proxy pool is invoked to detect sensitive information through the first detection channel.

结合第一方面第三种可能的实现方式，在第一方面第四种可能的实现方式中，调用IP代理池通过所述第一探测渠道进行敏感信息探测，包括：对所述IP代理池内的IP代理进行状态确定；如果第一IP代理为过期代理，则将所述第一IP代理从所述IP代理池剔除；或者，如果所述第一IP代理为第一探测渠道的无效IP代理，则将所述第一IP代理进行标记，并通过其他探测渠道对所述第一IP代理进行筛查。In combination with the third possible implementation of the first aspect, in the fourth possible implementation of the first aspect, calling the IP proxy pool to detect sensitive information through the first detection channel includes: The IP agent carries out state determination; If the first IP agent is an expired agent, then the first IP agent is removed from the IP agent pool; or, if the first IP agent is an invalid IP agent of the first detection channel, Then mark the first IP proxy, and screen the first IP proxy through other detection channels.

结合第一方面第四种可能的实现方式，在第一方面第五种可能的实现方式中，所述通过其他探测渠道对所述第一IP代理进行筛查，包括：调用所述第一IP代理通过第二探测渠道进行敏感信息探测，所述第二探测渠道为除所述第一探测渠道以外的任一探测渠道；如果所述第二探测渠道可以利用所述第一IP代理进行敏感信息探测，则保留所述第一IP代理，否则，从所述IP代理池剔除。With reference to the fourth possible implementation of the first aspect, in the fifth possible implementation of the first aspect, the screening of the first IP proxy through other detection channels includes: calling the first IP The agent detects sensitive information through a second detection channel, and the second detection channel is any detection channel except the first detection channel; if the second detection channel can use the first IP agent to detect sensitive information detection, then retain the first IP proxy, otherwise, remove it from the IP proxy pool.

结合第一方面第二至五种任一可能的实现方式，在第一方面第六种可能的实现方式中，所述将所述第一探测数据进行筛查获得第二探测数据，包括：将所述临时探测结果根据不同的探测渠道使用的唯一性字段去重获得第一数据；预设过滤条目对去所述第一数据的干扰内容进行过滤获得第二数据；根据所述探测任务的信息标识确定所述第二数据的目标相关性；按照确定出的目标相关性程度对所述第二数据进行评分获得所述第二探测数据。With reference to any of the second to fifth possible implementation manners of the first aspect, in a sixth possible implementation manner of the first aspect, the screening of the first detection data to obtain the second detection data includes: The temporary detection results are obtained by deduplicating the unique fields used by different detection channels to obtain the first data; the preset filter items filter the interference content of the first data to obtain the second data; according to the information of the detection task identifying and determining the target correlation of the second data; and scoring the second data according to the determined target correlation degree to obtain the second detection data.

结合第一方面第六种可能的实现方式，在第一方面第七种可能的实现方式中，按照预设分数区间从所述第二探测数据选取探测数据与原始敏感信息标识进行比对，通过比对结果进行敏感信息告警。In combination with the sixth possible implementation of the first aspect, in the seventh possible implementation of the first aspect, the detection data is selected from the second detection data according to the preset score interval and compared with the original sensitive information identifier, and the Sensitive information alerts are issued for comparison results.

第二方面，本申请实施例提供了一种多维度的互联网敏感信息探测系统，所述系统包括：任务设置模块，用于根据敏感信息标识设置信息探测任务，不同所述敏感信息种类的不同标识不同；探测渠道分配模块，用于将所述信息探测任务设置探测频度并分配探测渠道；信息探测模块，用于根据所述探测频度和探测渠道进行任务探测获取第一探测数据；数据处理模块，用于将所述第一探测数据进行筛查获得第二探测数据，所述第二探测数据用于评估敏感信息安全度。In the second aspect, the embodiment of the present application provides a multi-dimensional Internet sensitive information detection system, the system includes: a task setting module, configured to set information detection tasks according to sensitive information identification, different identifications of different types of sensitive information Different; the detection channel allocation module is used to set the detection frequency of the information detection task and allocate detection channels; the information detection module is used to perform task detection according to the detection frequency and detection channels to obtain the first detection data; data processing A module, configured to screen the first detection data to obtain second detection data, where the second detection data is used to evaluate the security of sensitive information.

第三方面，本申请实施例提供了一种设备，包括：存储器、处理器及存储在存储器上的计算机程序，进行敏感信息探测时，所述处理器读取所述计算机程序，执行第一方面或第一方面任一可能实现方式所述的多维度的互联网敏感信息探测方法，根据信息探测任务进行敏感信息探测。In the third aspect, the embodiment of the present application provides a device, including: a memory, a processor, and a computer program stored in the memory. When detecting sensitive information, the processor reads the computer program and executes the first aspect Or in the multi-dimensional Internet sensitive information detection method described in any possible implementation manner of the first aspect, the sensitive information detection is performed according to the information detection task.

附图说明Description of drawings

图1为本申请实施例提供的一种多维度的互联网敏感信息探测方法的流程示意图；FIG. 1 is a schematic flow diagram of a multi-dimensional Internet sensitive information detection method provided by an embodiment of the present application;

图2为本申请实施例提供的一种多维度的互联网敏感信息探测系统的示意图；FIG. 2 is a schematic diagram of a multi-dimensional Internet sensitive information detection system provided by an embodiment of the present application;

图3为本申请实施例提供的一种设备的示意图。Fig. 3 is a schematic diagram of a device provided by an embodiment of the present application.

具体实施方式Detailed ways

下面结合附图与具体实施方式对本方案进行阐述。The scheme will be described below in conjunction with the accompanying drawings and specific implementation methods.

图1为本申请实施例提供的一种多维度的互联网敏感信息探测方法的流程示意图，参见图1，本实施例中的多维度的互联网敏感信息探测方法包括：Fig. 1 is a schematic flow diagram of a multi-dimensional Internet sensitive information detection method provided by the embodiment of the present application. Referring to Fig. 1, the multi-dimensional Internet sensitive information detection method in this embodiment includes:

S101，根据敏感信息标识设置信息探测任务，不同所述敏感信息种类的不同标识不同。S101. Set an information detection task according to sensitive information identifiers, where different identifiers of different types of sensitive information are different.

本实施例中首先对所述敏感信息进行解析确定所述敏感信息的信息标识；根据所述信息标识确定所述敏感信息的探测列表；为所述所述探测列表分配目标ID建立目标任务条目。In this embodiment, firstly, the sensitive information is analyzed to determine the information identification of the sensitive information; the detection list of the sensitive information is determined according to the information identification; a target ID is assigned to the detection list to establish a target task entry.

一次进行探测的敏感信息可以是一条，也可是是多条，而且敏感信息的类型也会存在不同。比如有的敏感信息是文字内容的，则其信息标识一般为关键字。如果是源代码包，则信息标识为源代码包名。还有一种就是企业内部的网站链接，而且对外并没有公开的，则其信息标识为域名。上述只是对几种类型进行简单举例，具体情况中敏感信息的内容还会存在其他形式。Sensitive information to be detected at one time can be one piece or multiple pieces, and the types of sensitive information will also be different. For example, if some sensitive information has text content, its information identification is generally a keyword. If it is a source code package, the information identifier is the source code package name. There is also a website link within the enterprise, and if it is not disclosed to the outside world, its information is identified as a domain name. The above are only simple examples of several types, and the content of sensitive information may exist in other forms in specific situations.

本申请实施例中，提供目标单位的关键字、域名等原始敏感信息标识,在数据库client表中生成对应的目标ID,建立目标条目,这样一条目标条目应当包含的字段有目标名称、由随机数加时间戳组合经md5算法生成的目标ID(格式为字母、数字混杂的32位字符串)、原始敏感信息标识、可执行的任务清单。In the embodiment of the present application, original sensitive information identifiers such as keywords and domain names of target units are provided, corresponding target IDs are generated in the database client table, and target entries are established. The fields that such a target entry should contain include target name, random number Add a time stamp to combine the target ID generated by the md5 algorithm (the format is a 32-bit string of mixed letters and numbers), the original sensitive information identification, and the list of executable tasks.

如果以关键字、源代码包名和域名为例，根据敏感信息的类型会从不同敏感中解析出敏感信息对应的关键字、源代码包名和域名。然后建立探测列表，以列或行为单位将对应的关键字、源代码包名和域名信息填充进去，最后为每行或每列分配目标ID。需要指出的是，敏感信息的信息标识不一定是单一的关键字、源代码包名或者域名中的任一个，也可能是任意之间的组合，比如一个敏感信息的信息标识为关键字和域名。If keywords, source code package names, and domain names are taken as examples, the keywords, source code package names, and domain names corresponding to sensitive information will be parsed from different sensitive information according to the type of sensitive information. Then build a detection list, fill in the corresponding keywords, source code package name and domain name information in column or row units, and finally assign a target ID to each row or column. It should be pointed out that the information identification of sensitive information is not necessarily any one of a single keyword, source code package name or domain name, but may also be any combination of them. For example, the information identification of a sensitive information is a keyword and a domain name .

S102，将所述信息探测任务设置探测频度并分配探测渠道。S102. Set a detection frequency for the information detection task and allocate detection channels.

本申请实施中的探测渠道可以采用互联网搜索引擎（百度搜索引擎、谷歌搜索引擎、fofa网络空间搜索引擎）、文库、云网盘、微信公众号、邮箱、子域名、源代码托管平台、第三方网络空间搜索引擎、暗网资源站点等多种监测渠道。对于不同类型的敏感信息，确定出其可能泄露后存在的范围。比如源代码包很大概率会存在于云网盘、源代码托管平台、第三方网络空间搜索引擎、暗网资源站点，而对于文库存在的可能性较小。但是对于文字类和网站链接类的，则泄露出现在文库的可能性较大，因此在为不同类的敏感信息进行探测前的探测渠道确定尤为重要。The detection channels in the implementation of this application can use Internet search engines (Baidu search engine, Google search engine, fofa cyberspace search engine), library, cloud network disk, WeChat official account, mailbox, sub-domain name, source code hosting platform, third party Various monitoring channels such as cyberspace search engines and dark web resource sites. For different types of sensitive information, determine the scope of its possible leakage. For example, there is a high probability that source code packages will exist in cloud storage, source code hosting platforms, third-party cyberspace search engines, and dark web resource sites, but it is less likely to exist in archives. However, for text and website links, the possibility of leakage in the library is relatively high, so it is particularly important to determine the detection channels before detecting different types of sensitive information.

当然，对于不可能存在的探测渠道也可以进行选择探测，但是那样会对探测成本和时间上造成不必要的浪费。而且在选择探测渠道时，如果对探测渠道不确定时，优先选择以保证更加全面准确的探测。Of course, selective detection can also be carried out for detection channels that cannot exist, but that will cause unnecessary waste of detection cost and time. And when choosing a detection channel, if you are not sure about the detection channel, give priority to it to ensure a more comprehensive and accurate detection.

为探测任务设置不同的探测频度,根据具体的需要设置不同的探测频度，可以设置一次性指定时间执行任务、间隔时间执行任务(可间隔年/月/时/分/秒)、定时周期性执行任务(如每月/每日/每月的某日/每日的某时/甚至每时的某分某秒)。具体探测频度的设置也是根据敏感信息的重要程度和可能泄露的时间点进行设置。Set different detection frequencies for detection tasks, and set different detection frequencies according to specific needs. You can set one-time specified time execution tasks, interval time execution tasks (can be intervals of years/months/hours/minutes/seconds), and timing cycles Task execution (such as monthly/daily/a certain day of the month/a certain time of the day/even a certain minute and a second every hour). The setting of the specific detection frequency is also set according to the importance of sensitive information and the time point when it may be leaked.

S103，根据所述探测频度和探测渠道进行任务探测获取第一探测数据。S103. Perform mission detection according to the detection frequency and detection channels to obtain first detection data.

本实施例中根据所述目标ID生成与所述目标ID关联的任务ID；在数据库中为每个所述探测渠道建立独立的数据表，相同探测任务确定的所述探测渠道的数据表使用相同的任务ID。通过所述探测渠道根据信息标识按照所述探测频度进行信息探测，并将探测的信息存储到对应的数据表中；探测完毕后读取数据库中带有相同任务ID的数据表获得同一探测任务的临时探测结果。In this embodiment, the task ID associated with the target ID is generated according to the target ID; an independent data table is established for each of the detection channels in the database, and the data tables of the detection channels determined by the same detection task use the same The task ID. Carry out information detection according to the detection frequency according to the information identification through the detection channel, and store the detected information in the corresponding data table; after the detection is completed, read the data table with the same task ID in the database to obtain the same detection task provisional detection results.

在进行敏感信息探测时，根据探测任务预设的探测渠道相继启动不同渠道的监测功能模块,对于某些探测渠道,由于请求频率的限制反爬因此可能需要使用ip代理池。因此，本实施例中通过所述探测渠道根据信息标识按照所述探测频度进行信息探测具体地包括：When detecting sensitive information, the monitoring function modules of different channels are successively activated according to the detection channels preset by the detection task. For some detection channels, it may be necessary to use the ip proxy pool due to the limitation of request frequency and anti-climbing. Therefore, in this embodiment, performing information detection according to the detection frequency through the detection channel according to the information identifier specifically includes:

根据所述探测频度对第一探测渠道进行初步敏感信息探测测试，所述第一探测渠道为同一探测任务中的任一探测渠道。获取所述第一探测渠道的反馈响应；如果所述反馈响应为限制反爬，则调用IP代理池通过所述第一探测渠道进行敏感信息探测。A preliminary sensitive information detection test is performed on the first detection channel according to the detection frequency, and the first detection channel is any detection channel in the same detection task. Obtain a feedback response from the first detection channel; if the feedback response is to limit anti-crawling, call the IP proxy pool to detect sensitive information through the first detection channel.

本实施例中为了维持IP代理池的正常，需要启动另一个独立的进程维护IP代理池,因为IP代理不需要持久化,将其存储在redis数据库中便于多个系统调用,需要持续监控代理状态,及时剔除无效代理、过期代理,保障功能可用性。In this embodiment, in order to maintain the normality of the IP proxy pool, another independent process needs to be started to maintain the IP proxy pool, because the IP proxy does not need to be persistent, and it is stored in the redis database to facilitate multiple system calls, and it is necessary to continuously monitor the proxy status , Eliminate invalid agents and expired agents in time to ensure the availability of functions.

一个示意性实施例，调用IP代理池通过所述第一探测渠道进行敏感信息探测前首先对所述IP代理池内的IP代理进行状态确定。如果第一IP代理为过期代理，则将所述第一IP代理从所述IP代理池剔除。如果所述第一IP代理为第一探测渠道的无效IP代理，则将所述第一IP代理进行标记，并通过其他探测渠道对所述第一IP代理进行筛查。In an exemplary embodiment, before invoking the IP proxy pool to detect sensitive information through the first detection channel, the status of the IP proxy in the IP proxy pool is determined first. If the first IP proxy is an expired proxy, then remove the first IP proxy from the IP proxy pool. If the first IP proxy is an invalid IP proxy of the first detection channel, mark the first IP proxy, and screen the first IP proxy through other detection channels.

此处之所以对通过第一探测渠道确定的无效IP代理吧直接进行剔除，主要原因是在使用这些IP代理的过程中,往往会出现由于探测渠道的不同,对某一探测渠道无效的代理,在另一探测渠道仍然是可用的,这样的IP代理将仅在ip代理池的特定的渠道代理组合中标记不可用，并不会直接剔出整个代理池，待其过期后才会完全删除，从而最大化利用代理资源，减少运营成本。The main reason why the invalid IP agents determined through the first detection channel are directly eliminated here is that in the process of using these IP agents, there will often be agents that are invalid for a certain detection channel due to different detection channels. If another detection channel is still available, such an IP agent will only be marked as unavailable in the specific channel agent combination of the ip agent pool, and will not be directly removed from the entire agent pool, and will be completely deleted after it expires. Thereby maximizing the use of agency resources and reducing operating costs.

因此，进一步地本实施例中会进一步验证第一IP代理是否为真正的无效代理。具体地，调用所述第一IP代理通过第二探测渠道进行敏感信息探测，所述第二探测渠道为除所述第一探测渠道以外的任一探测渠道。如果所述第二探测渠道可以利用所述第一IP代理进行敏感信息探测，则保留所述第一IP代理，否则，从所述IP代理池剔除。Therefore, further in this embodiment, it will be further verified whether the first IP proxy is a real invalid proxy. Specifically, the first IP agent is invoked to detect sensitive information through a second detection channel, and the second detection channel is any detection channel other than the first detection channel. If the second detection channel can use the first IP proxy to detect sensitive information, keep the first IP proxy; otherwise, remove it from the IP proxy pool.

S104，将所述第一探测数据进行筛查获得第二探测数据，所述第二探测数据用于评估敏感信息安全度。S104. Screen the first detection data to obtain second detection data, where the second detection data is used to evaluate the security of sensitive information.

本实施例中通过S103获得临时探测结果都存储在标识有任务ID的数据表中，从数据库的数据表中调取临时探测结果，将所有渠道探测的可疑敏感信息进行数据清洗和分析。In this embodiment, the temporary detection results obtained through S103 are all stored in the data table marked with the task ID, and the temporary detection results are retrieved from the data table of the database, and the suspicious and sensitive information detected by all channels is cleaned and analyzed.

根据不同的探测渠道使用的唯一性字段去重,如在搜索引擎这一渠道中,URL(网址)具备唯一性，在云网盘这一渠道中,标题可具备唯一性。探测到的信息不可避免的存在干扰内容,可预设过滤条目根据前项中的唯一性字段尽量过滤干扰部分。检索到的信息与目标相关性各有不同，有些渠道如源代码托管平台会提供这种相关性系数,但更多的情况下需要自行处理,能够和预设关键字、域名匹配程度较多的应当给予较高的分数。According to the unique field used by different detection channels, for example, in the channel of search engine, the URL (web address) is unique, and in the channel of cloud network disk, the title can be unique. The detected information inevitably has interference content, and the filter items can be preset to filter the interference part as much as possible according to the uniqueness field in the previous item. The correlation between the retrieved information and the target is different. Some channels, such as source code hosting platforms, will provide this correlation coefficient, but in more cases, you need to handle it yourself. A higher score should be given.

由于上述处理后的数据可能存在部分内容为模糊匹配的，为了进一步提高数据的准确性，本实施例中进一步按照预设分数区间从所述第二探测数据选取探测数据与原始敏感信息标识进行比对，分析完成将结果再次归档到数据库中,在此基础上辅以安全专家的人工研判提供尽可能多的敏感信息告警。Since some of the above-mentioned processed data may be fuzzy matched, in order to further improve the accuracy of the data, in this embodiment, the detection data is further selected from the second detection data according to the preset score range and compared with the original sensitive information identification. Yes, after the analysis is completed, the results will be archived in the database again, and on this basis, the manual judgment of security experts will be used to provide as many sensitive information alerts as possible.

与上述实施例提供的一种多维度的互联网敏感信息探测方法相对应，本申请还提供了一种多维度的互联网敏感信息探测系统的实施例。Corresponding to the multi-dimensional Internet sensitive information detection method provided in the foregoing embodiments, the present application also provides an embodiment of a multi-dimensional Internet sensitive information detection system.

参见图2，本申请实施例提供的多维度的互联网敏感信息探测系统20包括：任务设置模块201、探测渠道分配模块202、信息探测模块203和数据处理模块204。Referring to FIG. 2 , the multi-dimensional Internet sensitive information detection system 20 provided by the embodiment of the present application includes: a task setting module 201 , a detection channel allocation module 202 , an information detection module 203 and a data processing module 204 .

任务设置模块201，用于根据敏感信息标识设置信息探测任务，不同所述敏感信息种类的不同标识不同。探测渠道分配模块202，用于将所述信息探测任务设置探测频度并分配探测渠道。信息探测模块203，用于根据所述探测频度和探测渠道进行任务探测获取第一探测数据。据处理模块204，用于将所述第一探测数据进行筛查获得第二探测数据，所述第二探测数据用于评估敏感信息安全度。The task setting module 201 is configured to set information detection tasks according to sensitive information identifiers, and different identifiers of different types of sensitive information are different. The detection channel allocation module 202 is configured to set the detection frequency for the information detection task and allocate detection channels. The information detection module 203 is configured to perform task detection according to the detection frequency and detection channel to obtain first detection data. The data processing module 204 is configured to screen the first detection data to obtain second detection data, and the second detection data is used to evaluate the security degree of sensitive information.

本申请还提供了一种设备，参见图3，本实施例中的设备30包括：处理器301、存储器302和通信接口303。The present application also provides a device. Referring to FIG. 3 , the device 30 in this embodiment includes: a processor 301 , a memory 302 and a communication interface 303 .

在图3中，处理器301、存储器302和通信接口303可以通过总线相互连接；总线可以分为地址总线、数据总线、控制总线等。为便于表示，图3中仅用一条粗线表示，但并不表示仅有一根总线或一种类型的总线。In FIG. 3 , the processor 301 , the memory 302 and the communication interface 303 can be connected to each other through a bus; the bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is used in FIG. 3 , but it does not mean that there is only one bus or one type of bus.

处理器301通常是控制设备30的整体功能，例如设备的启动、以及设备启动后接收到需要探测的敏感信息数据时，根据敏感信息标识设置信息探测任务，不同所述敏感信息种类的不同标识不同。将所述信息探测任务设置探测频度并分配探测渠道。根据所述探测频度和探测渠道进行任务探测获取第一探测数据。将所述第一探测数据进行筛查获得第二探测数据，所述第二探测数据用于评估敏感信息安全度。The processor 301 usually controls the overall functions of the device 30, for example, when the device is started, and when sensitive information data to be detected is received after the device is started, the information detection task is set according to the sensitive information identification, and different identifications of different types of sensitive information are different. . The detection frequency is set for the information detection task and detection channels are allocated. The task detection is performed according to the detection frequency and the detection channel to obtain the first detection data. The first detection data is screened to obtain second detection data, and the second detection data is used to evaluate the security degree of sensitive information.

此外，处理器301可以是微处理器（MCU）。处理器还可以包括硬件芯片。上述硬件芯片可以是专用集成电路（ASIC），可编程逻辑器件（PLD）或其组合。上述PLD可以是复杂可编程逻辑器件（CPLD），现场可编程逻辑门阵列（FPGA）等。Also, the processor 301 may be a microprocessor (MCU). Processors may also include hardware chips. The aforementioned hardware chip may be an Application Specific Integrated Circuit (ASIC), a Programmable Logic Device (PLD) or a combination thereof. The above-mentioned PLD may be a complex programmable logic device (CPLD), a field programmable logic gate array (FPGA) or the like.

存储器302被配置为存储计算机可执行指令以支持设备30数据的操作。存储器301可以由任何类型的易失性或非易失性存储设备或者它们的组合实现，如静态随机存取存储器（SRAM），电可擦除可编程只读存储器（EEPROM），可擦除可编程只读存储器（EPROM），可编程只读存储器（PROM），只读存储器（ROM），磁存储器，快闪存储器，磁盘或光盘。Memory 302 is configured to store computer-executable instructions to support the operation of device 30 data. The memory 301 can be implemented by any type of volatile or non-volatile storage devices or their combination, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic Disk or Optical Disk.

通信接口303用于设备30传输数据，例如实现与企业方数据库的通信。通信接口303包括有线通信接口，还可以包括无线通信接口。其中，有线通信接口包括USB接口、MicroUSB接口，还可以包括以太网接口。无线通信接口可以为WLAN接口，蜂窝网络通信接口或其组合等。The communication interface 303 is used for the device 30 to transmit data, for example, to communicate with the database of the enterprise. The communication interface 303 includes a wired communication interface, and may also include a wireless communication interface. Wherein, the wired communication interface includes a USB interface, a MicroUSB interface, and may also include an Ethernet interface. The wireless communication interface may be a WLAN interface, a cellular network communication interface or a combination thereof.

在一个示意性实施例中，本申请实施例提供的设备30还包括电源组件，电源组件为设备30的各种组件提供电力。电源组件可以包括电源管理系统，一个或多个电源，及其他与为设备30生成、管理和分配电力相关联的组件。In an exemplary embodiment, the device 30 provided in the embodiment of the present application further includes a power supply component, which provides power for various components of the device 30 . Power components may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to device 30 .

通信组件，通信组件被配置为便于设备30和外接设备有线或无线方式的通信。设备30可以接入基于通信标准的无线网络，如WiFi，4G或5G，或它们的组合。通信组件经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。通信组件还包括近场通信（NFC）模块，以促进短程通信。例如，在NFC模块可基于射频识别（RFID）技术，红外数据协会（IrDA）技术，超宽带（UWB）技术，蓝牙（BT）技术和其他技术来实现。A communication component, the communication component is configured to facilitate wired or wireless communication between the device 30 and an external device. The device 30 can access wireless networks based on communication standards, such as WiFi, 4G or 5G, or a combination thereof. The communication component receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. The communication component also includes a Near Field Communication (NFC) module to facilitate short-range communication. For example, the NFC module can be implemented based on Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

在一个示意性实施例中，设备30可以被一个或多个应用专用集成电路（ASIC）、数字信号处理器（DSP）、数字信号处理设备（DSPD）、可编程逻辑器件（PLD）、现场可编程门阵列（FPGA）、终端、微终端、处理器或其他电子元件实现。In an exemplary embodiment, device 30 may be implemented by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable Programmable gate array (FPGA), terminal, microterminal, processor or other electronic component implementation.

本申请说明书中各个实施例之间相同相似的部分互相参见即可。尤其，对于系统及设备实施例而言，由于其中的方法基本相似于方法的实施例，所以描述的比较简单，相关之处参见方法实施例中的说明即可。For the same and similar parts among the various embodiments in the specification of the present application, please refer to each other. In particular, for the system and device embodiments, since the methods therein are basically similar to the method embodiments, the description is relatively simple, and for relevant parts, refer to the description in the method embodiments.

需要说明的是，在本文中，诸如“第一”和“第二”等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that in this article, relative terms such as "first" and "second" are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these No such actual relationship or order exists between entities or operations. Furthermore, the term "comprises", "comprises" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements, but also includes elements not expressly listed. other elements of or also include elements inherent in such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

Claims

1. A multi-dimensional Internet sensitive information detection method is characterized by comprising the following steps:

setting an information detection task according to a sensitive information identifier, wherein different identifiers of different sensitive information types are different;

setting detection frequency of the information detection task and distributing detection channels;

performing task detection according to the detection frequency and the detection channel to obtain first detection data;

and screening the first detection data to obtain second detection data, wherein the second detection data is used for evaluating the security degree of sensitive information.

2. The method for detecting the sensitive information of the internet in the multi-dimension according to claim 1, wherein the setting of the information detection task according to the sensitive information identification comprises:

analyzing the sensitive information to determine an information identifier of the sensitive information;

determining a detection list of the sensitive information according to the information identifier;

and allocating target IDs to the detection lists to establish target task entries.

3. The method for detecting the internet sensitive information in multiple dimensions according to claim 2, wherein the task detection according to the detection frequency and the detection channel to obtain the first detection data comprises:

generating a task ID associated with the target ID according to the target ID;

establishing an independent data table for each detection channel in a database, wherein the data tables of the detection channels determined by the same detection task use the same task ID;

detecting information according to the detection frequency through the detection channel according to the information identification, and storing the detected information into a corresponding data table;

and after the detection is finished, reading the data tables with the same task ID in the database to obtain a temporary detection result of the same detection task.

4. The method for detecting the internet sensitive information in multiple dimensions according to claim 3, wherein the step of detecting the information according to the detection frequency through the detection channel according to the information identifier comprises the following steps:

performing preliminary sensitive information detection test on a first detection channel according to the detection frequency, wherein the first detection channel is any one detection channel in the same detection task;

acquiring a feedback response of the first detection channel;

and if the feedback response is limited to reverse crawling, calling an IP proxy pool to detect the sensitive information through the first detection channel.

5. The method for detecting the internet sensitive information in the multi-dimension according to claim 4, wherein invoking an IP proxy pool to perform the sensitive information detection through the first detection channel comprises:

determining the state of the IP agents in the IP agent pool;

if the first IP agent is an expired agent, removing the first IP agent from the IP agent pool;

or,

and if the first IP agent is an invalid IP agent of the first detection channel, marking the first IP agent, and screening the first IP agent through other detection channels.

6. The method as claimed in claim 5, wherein said screening the first IP agent through other detection channels comprises:

calling the first IP agent to detect the sensitive information through a second detection channel, wherein the second detection channel is any detection channel except the first detection channel;

if the second detection channel can utilize the first IP agent to detect the sensitive information, the first IP agent is reserved, otherwise, the second IP agent is removed from the IP agent pool.

7. The method for detecting internet sensitive information in multiple dimensions according to any one of claims 3 to 6, wherein the screening the first detection data to obtain the second detection data comprises:

the temporary detection result is subjected to de-duplication according to the unique fields used by different detection channels to obtain first data;

presetting a filtering item to filter the interference content of the first data to obtain second data;

determining the target relevance of the second data according to the information identification of the detection task;

and scoring the second data according to the determined target relevance degree to obtain the second detection data.

8. The method as claimed in claim 7, wherein the second detection data is selected from the second detection data according to a preset score interval to compare with the original sensitive information identifier, and the sensitive information is alarmed according to the comparison result.

9. A multidimensional internet-sensitive information detection system, the system comprising:

the task setting module is used for setting information detection tasks according to sensitive information identifiers, wherein different identifiers of different sensitive information types are different;

the detection channel distribution module is used for setting detection frequency of the information detection task and distributing detection channels;

the information detection module is used for carrying out task detection according to the detection frequency and the detection channel to obtain first detection data;

and the data processing module is used for screening the first detection data to obtain second detection data, and the second detection data is used for evaluating the security degree of the sensitive information.

10. An apparatus, comprising: the device comprises a memory, a processor and a computer program stored on the memory, and is characterized in that when sensitive information detection is carried out, the processor reads the computer program, executes the multi-dimensional internet sensitive information detection method according to any one of claims 1 to 8, and carries out sensitive information detection according to an information detection task.