CN111488556A

CN111488556A - Nested document extraction method and device, electronic equipment and storage medium

Info

Publication number: CN111488556A
Application number: CN202010273216.0A
Authority: CN
Inventors: 蔡家坡; 关守兵
Original assignee: Sangfor Technologies Co Ltd
Current assignee: Sangfor Technologies Co Ltd
Priority date: 2020-04-09
Filing date: 2020-04-09
Publication date: 2020-08-04

Abstract

The application discloses a nested document extraction method, a nested document extraction device, an electronic device and a computer readable storage medium, wherein the method comprises the following steps: acquiring a target document to be extracted, and reading a document directory corresponding to the target document; extracting all sub-documents nested in the target document from the document directory; and respectively identifying the document content of each subdocument to identify whether confidential information exists in the subdocument. According to the method and the device, all the sub-documents nested in the target document can be extracted based on the document directory of the target document to be extracted, so that the extraction of the nested document is realized, the content of the sub-document is identified, whether confidential information exists in the sub-documents is determined, and information leakage can be effectively avoided.

Description

A nested document extraction method, device, electronic device and storage medium

技术领域technical field

本申请涉及计算机技术领域，更具体地说，涉及一种嵌套文档提取方法、装置及一种电子设备和一种计算机可读存储介质。The present application relates to the field of computer technology, and more particularly, to a nested document extraction method and apparatus, an electronic device, and a computer-readable storage medium.

背景技术Background technique

嵌套文档，即office系列文档以及rtf等办公文档中通过插入附件对象的形式添加的文档。DLP(Data leakage prevention，数据泄密防护)是指防止泄露的系统。当前泄密DLP产品中普遍存在提取文档内容的需求，而大多数DLP产品针对文档的提取仅支持当前文档内容的提取，而不支持嵌套文档，甚至多层嵌套文档内容的提取，由此，可能会导致若用户通过将机密信息文件嵌入到文档中，泄密系统无法分析到机密信息文件而导致机密信息外泄的情况。因此，如何解决上述问题是本领域技术人员需要重点关注的。Nested documents, that is, documents added by inserting attachment objects in office documents such as office documents and RTF documents. DLP (Data leakage prevention) refers to a system for preventing leakage. The need to extract document content is common in current leaked DLP products, and most DLP products for document extraction only support the extraction of the current document content, but not nested documents, or even the extraction of multi-level nested document content. Therefore, It may lead to the leakage of confidential information if the user embeds the confidential information file into the document, and the leakage system cannot analyze the confidential information file. Therefore, how to solve the above problems is the focus of those skilled in the art.

发明内容SUMMARY OF THE INVENTION

本申请的目的在于提供一种嵌套文档提取方法、装置及一种电子设备和一种计算机可读存储介质，能够实现嵌套文档的提取，从而有效避免信息泄露。The purpose of the present application is to provide a nested document extraction method and apparatus, an electronic device and a computer-readable storage medium, which can realize the extraction of nested documents, thereby effectively avoiding information leakage.

为实现上述目的，本申请提供了一种嵌套文档提取方法，包括：To achieve the above purpose, the present application provides a nested document extraction method, including:

获取待提取的目标文档，并读取所述目标文档对应的文档目录；Obtain the target document to be extracted, and read the document directory corresponding to the target document;

从所述文档目录中提取所述目标文档中嵌套的所有子文档；extracting all subdocuments nested in the target document from the document directory;

分别对各个子文档的文档内容进行识别，以识别所述子文档中是否存在机密信息。The document content of each sub-document is respectively identified to identify whether confidential information exists in the sub-document.

可选的，所述获取待提取的目标文档之后，还包括：Optionally, after obtaining the target document to be extracted, the method further includes:

确定所述目标文档对应的文档类型；determining the document type corresponding to the target document;

所述读取所述目标文档对应的文档目录，包括：The reading of the document directory corresponding to the target document includes:

根据所述文档类型读取所述目标文档对应的文档目录。The document directory corresponding to the target document is read according to the document type.

可选的，所述根据所述文档类型读取所述目标文档对应的文档目录，包括：Optionally, the reading of the document directory corresponding to the target document according to the document type includes:

若所述文档类型为第一类版本格式的文档类型，则读取所述目标文档对应的复合文档目录；所述第一类版本格式包括doc格式、xls格式、ppt格式的任意一种；If the document type is the document type of the first type of version format, then read the composite document directory corresponding to the target document; the first type of version format includes any one of doc format, xls format, and ppt format;

若所述文档类型为第二类版本格式的文档类型，则对所述目标文档进行解压，并在解压后读取所述目标文档对应的多级文档目录；所述第二类版本格式包括docx格式、xlsx格式、pptx格式的任意一种。If the document type is a document type of the second type of version format, the target document is decompressed, and after decompression, the multi-level document directory corresponding to the target document is read; the second type of version format includes docx Format, xlsx format, or pptx format.

可选的，从所述文档目录中提取所述目标文档中嵌套的所有子文档，包括：Optionally, extract all subdocuments nested in the target document from the document directory, including:

通过读取所述复合文档目录下预设文件的所有文件夹，提取所述目标文档中嵌套的所有子文档。By reading all folders of preset files in the compound document directory, all subdocuments nested in the target document are extracted.

读取所述目标文档对应的多级文档目录下的预设子目录；reading a preset subdirectory under the multi-level document directory corresponding to the target document;

提取所述预设子目录中存放的所有子文档。Extract all sub-documents stored in the preset sub-directory.

可选的，所述分别对各个子文档的文档内容进行识别之前，还包括：Optionally, before identifying the document content of each subdocument, the method further includes:

判断所述子文档是否为单一文档或嵌套文档；Determine whether the sub-document is a single document or a nested document;

若所述子文档为嵌套文档，则将所述子文档作为所述目标文档，并进入从所述文档目录中提取所述目标文档中嵌套的所有子文档的步骤进行迭代提取。If the subdocument is a nested document, the subdocument is used as the target document, and the step of extracting all subdocuments nested in the target document from the document directory is performed to perform iterative extraction.

可选的，所述分别对各个子文档的文档内容进行识别，包括：Optionally, identifying the document content of each sub-document respectively includes:

若所述子文档为单一文档，则根据所述子文档对应的文档格式直接对所述子文档进行内容提取和识别。If the sub-document is a single document, content extraction and identification of the sub-document are directly performed according to the document format corresponding to the sub-document.

为实现上述目的，本申请提供了一种嵌套文档提取装置，包括：To achieve the above purpose, the present application provides a nested document extraction device, comprising:

目录读取模块，用于获取待提取的目标文档，并读取所述目标文档对应的文档目录；a directory reading module, used to obtain the target document to be extracted, and read the document directory corresponding to the target document;

文档提取模块，用于从所述文档目录中提取所述目标文档中嵌套的所有子文档；a document extraction module for extracting all subdocuments nested in the target document from the document directory;

内容识别模块，用于分别对各个子文档的文档内容进行识别，以识别所述子文档中是否存在机密信息。The content identification module is configured to identify the document content of each sub-document respectively, so as to identify whether confidential information exists in the sub-document.

为实现上述目的，本申请提供了一种电子设备，包括：To achieve the above purpose, the present application provides an electronic device, including:

存储器，用于存储计算机程序；memory for storing computer programs;

处理器，用于执行所述计算机程序时实现前述公开的任一种嵌套文档提取方法的步骤。The processor is configured to implement the steps of any one of the methods for extracting nested documents disclosed above when executing the computer program.

为实现上述目的，本申请提供了一种计算机可读存储介质，所述计算机可读存储介质上存储有计算机程序，所述计算机程序被处理器执行时实现前述公开的任一种嵌套文档提取方法的步骤。In order to achieve the above object, the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, any one of the aforementioned nested document extractions is implemented. steps of the method.

通过以上方案可知，本申请提供的一种嵌套文档提取方法，包括：获取待提取的目标文档，并读取所述目标文档对应的文档目录；从所述文档目录中提取所述目标文档中嵌套的所有子文档；分别对各个子文档的文档内容进行识别，以识别所述子文档中是否存在机密信息。由上可知，本申请能够基于待提取目标文档的文档目录，提取出其中嵌套的所有子文档，实现嵌套文档的提取，进而对子文档的内容进行识别，确定子文档中是否存在机密信息，能够有效避免信息泄露。It can be seen from the above solutions that a nested document extraction method provided by the present application includes: acquiring a target document to be extracted, and reading a document directory corresponding to the target document; extracting the target document from the document directory All nested sub-documents; the document content of each sub-document is identified separately to identify whether confidential information exists in the sub-documents. As can be seen from the above, the present application can extract all sub-documents nested therein based on the document directory of the target document to be extracted, realize the extraction of the nested documents, and then identify the content of the sub-documents to determine whether there is confidential information in the sub-documents. , which can effectively avoid information leakage.

本申请还公开了一种嵌套文档提取装置及一种电子设备和一种计算机可读存储介质，同样能实现上述技术效果。The present application also discloses an apparatus for extracting nested documents, an electronic device, and a computer-readable storage medium, which can also achieve the above technical effects.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性的，并不能限制本申请。It is to be understood that the foregoing general description and the following detailed description are exemplary only and do not limit the application.

附图说明Description of drawings

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings required for the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without any creative effort.

图1为本申请实施例公开的一种具体实施下嵌套文档提取系统的组成架构图；FIG. 1 is a compositional architecture diagram of a nested document extraction system under a specific implementation disclosed in an embodiment of the application;

图2为本申请实施例公开的一种嵌套文档提取方法的流程图；2 is a flowchart of a method for extracting nested documents disclosed in an embodiment of the present application;

图3为本申请实施例公开的另一种嵌套文档提取方法的流程图；3 is a flowchart of another method for extracting nested documents disclosed in an embodiment of the present application;

图4、图5、图6分别为本申请实施例公开的另一种嵌套文档提取方法中针对不同版本文档的目录示意图；FIG. 4 , FIG. 5 , and FIG. 6 are respectively schematic diagrams of directories for different versions of documents in another nested document extraction method disclosed in the embodiment of the present application;

图7为本申请实施例公开的又一种嵌套文档提取方法的流程图；7 is a flowchart of yet another nested document extraction method disclosed in an embodiment of the present application;

图8、图9、图10分别为本申请实施例公开的又一种嵌套文档提取方法中针对不同版本文档的目录示意图；FIG. 8 , FIG. 9 , and FIG. 10 are schematic diagrams of directories for different versions of documents in yet another nested document extraction method disclosed in an embodiment of the present application;

图11为本申请实施例公开的一种嵌套文档提取装置的结构图；11 is a structural diagram of an apparatus for extracting nested documents disclosed in an embodiment of the present application;

图12为本申请实施例公开的一种电子设备的结构图；12 is a structural diagram of an electronic device disclosed in an embodiment of the application;

图13为本申请实施例公开的另一种电子设备的结构图。FIG. 13 is a structural diagram of another electronic device disclosed in an embodiment of this application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

在现有技术中，当前泄密DLP产品中普遍存在提取文档内容的需求，而大多数DLP产品针对文档的提取仅支持当前文档内容的提取，而不支持嵌套文档，甚至多层嵌套文档内容的提取，由此，可能会导致若用户通过将机密信息文件嵌入到文档中，泄密系统无法分析到机密信息文件而导致机密信息外泄的情况。In the prior art, there is a general need to extract document content in current leaked DLP products, and most DLP products only support the extraction of current document content, but not nested documents, or even multi-layer nested document content. Therefore, if the user embeds the confidential information file into the document, the leaking system cannot analyze the confidential information file, which may lead to the leakage of the confidential information.

因此，本申请实施例公开了一种嵌套文档提取方法，能够实现嵌套文档的提取，从而有效避免信息泄露。Therefore, the embodiment of the present application discloses a method for extracting nested documents, which can realize the extraction of nested documents, thereby effectively avoiding information leakage.

为了便于理解，下面对本申请的技术方案所适用的系统架构进行介绍。参见图1，其分别示出了本申请的一种嵌套文档提取系统的组成架构。如图1所示，本申请的嵌套文档提取系统可以具体包括用户终端11和服务器12，用户终端11和服务器12之间通过网络13进行通信连接。上述用户终端11和服务器12中均可以进一步包含有处理器、存储器、通信接口、输入单元、显示器以及通信总线等元件，且处理器、存储器、通信接口、输入单元、显示器、均通过通信总线完成相互间的通信。For ease of understanding, a system architecture to which the technical solutions of the present application are applied will be introduced below. Referring to FIG. 1, it respectively shows the composition architecture of a nested document extraction system of the present application. As shown in FIG. 1 , the nested document extraction system of the present application may specifically include a user terminal 11 and a server 12 , and a communication connection between the user terminal 11 and the server 12 is performed through a network 13 . The above-mentioned user terminal 11 and the server 12 may further include elements such as a processor, a memory, a communication interface, an input unit, a display, and a communication bus, and the processor, the memory, the communication interface, the input unit, and the display are all completed through the communication bus. communication with each other.

在具体实施中，用户可以通过用户终端11进行文件的传输，例如可上传文件到服务器12，或通过服务器12传输文件至其他通信终端。具体地，用户终端11具体可以包括但不限于智能手机、平板电脑、穿戴式设备和台式计算机等数据处理设备。In a specific implementation, the user can transmit files through the user terminal 11 , for example, upload the file to the server 12 , or transmit the file to other communication terminals through the server 12 . Specifically, the user terminal 11 may specifically include, but is not limited to, data processing devices such as smart phones, tablet computers, wearable devices, and desktop computers.

可以理解的是，服务器12具体用于在获取到用户传输的文件之后，对其进行文档内容的识别，若文件为嵌套文档，则进行嵌套文档的提取，以识别传输的文件中是否存在机密信息，并在存在机密信息的时候发送风险提示，以避免信息的泄露。上述服务器12可以包括但不限于云服务器、物理服务器和虚拟服务器等。It can be understood that the server 12 is specifically configured to identify the content of the document after acquiring the file transmitted by the user, and if the file is a nested document, extract the nested document to identify whether the transmitted file exists. Confidential information, and send risk alerts when confidential information exists to avoid information leakage. The above-mentioned servers 12 may include, but are not limited to, cloud servers, physical servers, virtual servers, and the like.

需要指出的是，本申请中的网络13可以根据实际应用过程中的网络状况和应用需求来确定，既可以是无线通讯网络，如移动通讯网络或WiFi网络等，也可以是有线通讯网络；既可以是广域网，在情况允许时也可以采用局域网。It should be pointed out that the network 13 in this application can be determined according to the network conditions and application requirements in the actual application process, and it can be either a wireless communication network, such as a mobile communication network or a WiFi network, or a wired communication network; It can be a wide area network, or a local area network when circumstances permit.

参见图2所示，本申请实施例公开的一种嵌套文档提取方法包括：Referring to FIG. 2 , a nested document extraction method disclosed in an embodiment of the present application includes:

S101：获取待提取的目标文档，并读取所述目标文档对应的文档目录；S101: Obtain a target document to be extracted, and read a document directory corresponding to the target document;

本申请实施例中，可以通过导入接口获取待提取的目标文档，并读取目标文档对应的文档目录。In this embodiment of the present application, the target document to be extracted can be acquired through the import interface, and the document directory corresponding to the target document can be read.

作为一种可行的实施方式，本申请实施例在获取待提取的目标文档之后，还可以进一步确定目标文档对应的文档类型，以便根据文档类型读取目标文档对应的文档目录。As a feasible implementation manner, after acquiring the target document to be extracted, the embodiment of the present application may further determine the document type corresponding to the target document, so as to read the document directory corresponding to the target document according to the document type.

S102：从所述文档目录中提取所述目标文档中嵌套的所有子文档；S102: Extract all subdocuments nested in the target document from the document directory;

本步骤中，可以基于文档目录确定目标文档内所嵌套的子文档，并从文档目录中提取出所有子文档。In this step, the subdocuments nested in the target document may be determined based on the document directory, and all subdocuments are extracted from the document directory.

S103：分别对各个子文档的文档内容进行识别，以识别所述子文档中是否存在机密信息。S103: Identify the document content of each sub-document respectively to identify whether confidential information exists in the sub-document.

在具体实施中，当提取出所有子文档之后，可以分别对各个子文档的文档内容进行识别，从而判断子文档内是否存在机密信息。若存在机密信息，可返回相应的风险提示信息，以避免信息泄露。In a specific implementation, after all the sub-documents are extracted, the document content of each sub-document can be identified respectively, so as to determine whether there is confidential information in the sub-documents. If there is confidential information, the corresponding risk prompt information can be returned to avoid information leakage.

需要说明的是，上述机密信息可以具体包括但不限于财务数据、客户资料信息、技术数据、源代码、办公信件数据、商业方案等敏感信息。判断文档内是否存在机密信息时，可以具体采用正则表达式、关键字等基础检测方法进行内容搜索和匹配，采用基础检测方法能够对明确的敏感信息内容进行检测。另外，还可采用精准数据比对检测方法、指纹文档比对检测方法以及向量分类比对检测方法等，以进一步提高检测准确率。其中，精准数据比对检测方法用于比对检测结构化数据，例如用户的名字、身份证号、银行账号等；指纹文档比对检测方法用于检测非结构化数据，例如办公信件数据、商业方案等；向量分类比对检测方法适用于检测具备独特特征的数据，例如财务数据、源代码等。It should be noted that the above confidential information may specifically include, but is not limited to, financial data, customer profile information, technical data, source code, office letter data, business plans and other sensitive information. When judging whether there is confidential information in a document, basic detection methods such as regular expressions and keywords can be used for content search and matching, and basic detection methods can be used to detect clear sensitive information content. In addition, accurate data comparison detection methods, fingerprint document comparison detection methods, and vector classification comparison detection methods can also be used to further improve the detection accuracy. Among them, the accurate data comparison detection method is used to compare and detect structured data, such as the user's name, ID number, bank account number, etc.; the fingerprint document comparison detection method is used to detect unstructured data, such as office letter data, business scheme, etc.; the vector classification comparison detection method is suitable for detecting data with unique characteristics, such as financial data, source code, etc.

可以理解的是，在分别对各个子文档的文档内容进行识别之前，可以首先判断子文档是否为单一文档或者嵌套文档。若子文档为单一文档，则可以根据子文档对应的文档格式直接对所述子文档进行内容提取和识别；若子文档为嵌套文档，则将子文档作为新的目标文档，并对当前的目标文档进行提取，即再次进入从文档目录中提取目标文档中嵌套的所有子文档的步骤进行迭代提取。It can be understood that, before identifying the document content of each subdocument, it can be first determined whether the subdocument is a single document or a nested document. If the subdocument is a single document, the content of the subdocument can be directly extracted and recognized according to the document format corresponding to the subdocument; if the subdocument is a nested document, the subdocument is regarded as a new target document, and the current target document Perform extraction, that is, enter the step of extracting all subdocuments nested in the target document from the document directory again to perform iterative extraction.

本申请实施例公开了另一种嵌套文档提取方法，相对于上一实施例，本实施例对技术方案作了进一步的说明和优化。参见图3所示，具体的：The embodiment of the present application discloses another method for extracting nested documents. Compared with the previous embodiment, this embodiment further describes and optimizes the technical solution. See Figure 3, specifically:

S201：获取待提取的目标文档，并确定所述目标文档对应的文档类型；S201: Obtain a target document to be extracted, and determine a document type corresponding to the target document;

S202：若所述文档类型为第一类版本格式的文档类型，则读取所述目标文档对应的复合文档目录；所述第一类版本格式包括doc格式、xls格式、ppt格式的任意一种；S202: If the document type is the document type of the first type of version format, read the composite document directory corresponding to the target document; the first type of version format includes any one of doc format, xls format, and ppt format ;

本申请实施例中，获取到待提取的目标文档之后，将首先确定目标文档对应的文档类型。若文档类型为第一类版本格式的文档类型，即office2003版本的文档，则读取目标文档对应的复合文档目录。其中，第一类版本格式可以包括但不限于doc格式、xls格式、ppt格式。In this embodiment of the present application, after acquiring the target document to be extracted, the document type corresponding to the target document is first determined. If the document type is the document type of the first type of version format, that is, the document of the office2003 version, the compound document directory corresponding to the target document is read. The first type of version format may include, but is not limited to, doc format, xls format, and ppt format.

需要说明的是，office2003版本的文档以复合文档格式进行存储。复合文档是一种不仅包含文本，而且包括图形、电子表格数据、声音、视频图形以及其他信息的文档。复合文档将数据分成许多流，这些流存储在不同的仓库中，所有的流又分为更小的数据块，称为扇区，整个文件由一个文件头结构以及其后的所有扇区组成，扇区的大小在头结构中指定，且所有的扇区大小一致。目录是一个内部控制流，由一系列目录入口组成，每一个目录入口都指向复合文档的仓库或流，目录入口以其在目录流中出现的顺序被列举。而office2003中嵌套文档中也具体以一个目录入口的形式存在。It should be noted that the documents of the Office2003 version are stored in the compound document format. A compound document is a document that contains not only text, but also graphics, spreadsheet data, sound, video graphics, and other information. A compound document divides data into many streams, which are stored in different repositories, and all streams are divided into smaller blocks of data, called sectors, and the entire file consists of a file header structure and all the sectors that follow, The size of the sector is specified in the header structure, and all sectors are of the same size. A directory is an internal control flow consisting of a series of directory entries, each of which points to a repository or stream of compound documents, enumerated in the order in which they appear in the directory stream. The nested document in office2003 also exists in the form of a directory entry.

S203：通过读取所述复合文档目录下预设文件的所有文件夹，提取所述目标文档中嵌套的所有子文档；S203: Extract all subdocuments nested in the target document by reading all the folders of the preset files in the composite document directory;

本步骤中，可以读取复合文档目录下预设文件的所有文件夹，以提取出各文件夹下存放的所有子文档。In this step, all folders of preset files in the compound document directory may be read to extract all sub-documents stored in each folder.

具体地，如图4所示，doc格式的word嵌套文件在解压后的ObjectPool文件下的每个以下划线加10个数字命名的文件夹中，每个文件夹中存放一个嵌套的子文档，如嵌入docx，xlsx，pptx格式的子文档，每个子文档存放到以ole格式存储的package文件中。另外，还需对提取到的除文件夹之外的数据多余内容进行再次处理。如图5所示，xls格式的excel嵌套文件在解压后的MBD加8个十六进制数命名的文件夹中，每个文件中存放一个嵌套的子文档。如图6所示，ppt格式的嵌套文件存放在PowerPoint Document文件中，通过查找rectype值可以提取嵌套文件，嵌套文件在RT_ExternalOleObjectStg的rectype中。Specifically, as shown in Figure 4, the word nested file in doc format is in each folder named underline and 10 numbers under the decompressed ObjectPool file, and each folder stores a nested sub-document , such as subdocuments embedded in docx, xlsx, pptx formats, each subdocument is stored in a package file stored in ole format. In addition, it is necessary to reprocess the extracted data redundant content other than the folder. As shown in Figure 5, the excel nested file in xls format is in a folder named after the decompressed MBD plus 8 hexadecimal numbers, and each file stores a nested sub-document. As shown in Figure 6, the nested file in ppt format is stored in the PowerPoint Document file, and the nested file can be extracted by finding the rectype value, and the nested file is in the rectype of RT_ExternalOleObjectStg.

S204：分别对各个子文档的文档内容进行识别，以识别所述子文档中是否存在机密信息。S204: Identify the document content of each sub-document respectively to identify whether confidential information exists in the sub-document.

可以理解的是，对各个子文档进行识别时，若子文档为单一的文档，则可根据具体的文档格式进行提取。例如，若目录入口对象名称为WordDocument，则进行doc文档格式的提取；若目录入口对象名称为workbook，则进行xls文档格式的提取；若目录入口对象名称为PowerPointDocument，则进行ppt文档格式的提取。在文档提取时，目录入口结构中指定的start与length分别指示文档的具体内容。It can be understood that, when identifying each sub-document, if the sub-document is a single document, it can be extracted according to a specific document format. For example, if the name of the directory entry object is WordDocument, the doc document format is extracted; if the directory entry object name is workbook, the xls document format is extracted; if the directory entry object name is PowerPointDocument, the ppt document format is extracted. When a document is extracted, the start and length specified in the directory entry structure respectively indicate the specific content of the document.

本申请实施例公开了又一种嵌套文档提取方法，相对于上一实施例，本实施例对技术方案作了进一步的说明和优化。参见图7所示，具体的：The embodiment of the present application discloses another method for extracting nested documents. Compared with the previous embodiment, this embodiment further describes and optimizes the technical solution. See Figure 7, specifically:

S301：获取待提取的目标文档，并确定所述目标文档对应的文档类型；S301: Obtain a target document to be extracted, and determine a document type corresponding to the target document;

S302：若所述文档类型为第二类版本格式的文档类型，则对所述目标文档进行解压，并在解压后读取所述目标文档对应的多级文档目录；所述第二类版本格式包括docx格式、xlsx格式、pptx格式的任意一种；S302: If the document type is a document type of the second type of version format, decompress the target document, and read the multi-level document directory corresponding to the target document after decompression; the second type of version format Including any one of docx format, xlsx format, pptx format;

本申请实施例中，获取到待提取的目标文档之后，将首先确定目标文档对应的文档类型。若文档类型为第二类版本格式的文档类型，即office2007版本的文档，则对目标文档进行解压，并在解压后读取目标文档对应的多级文档目录。其中，第二类版本格式可以包括但不限于docx格式、xlsx格式、pptx格式。In this embodiment of the present application, after acquiring the target document to be extracted, the document type corresponding to the target document is first determined. If the document type is the document type of the second type of version format, that is, the document of the office2007 version, the target document is decompressed, and after decompression, the multi-level document directory corresponding to the target document is read. The second type of version format may include, but is not limited to, docx format, xlsx format, and pptx format.

S303：读取所述目标文档对应的多级文档目录下的预设子目录；S303: Read the preset subdirectory under the multi-level document directory corresponding to the target document;

S304：提取所述预设子目录中存放的所有子文档；S304: Extract all sub-documents stored in the preset sub-directory;

需要说明的是，在获取到目标文档对应的多级文档目录之后，可读取预设子目录，并提取预设子目录中存放的所有子文档。It should be noted that, after obtaining the multi-level document directory corresponding to the target document, the preset subdirectory can be read, and all subdocuments stored in the preset subdirectory can be extracted.

在具体实施中，如图8所示，docx文档格式使用压缩包进行存放，文档在解压之后可以看到嵌套文档存放在word目录下的embeddings目录下，例如左边的docx文档中存在两个嵌套文档，则对应的embeddings存在两个嵌套文档oleObject1.bin与oleObject2.bin文档。xlsx嵌套文档格式使用压缩包进行存放，文档在解压之后可以看到嵌套文档存放在xl目录下的embeddings目录下，如图9所示，左边的文档中存在两个嵌套文档，则对应的embeddings存在两个嵌套文档oleObject1.bin与oleObject2.bin文件。如图10所示，pptx嵌套文档的提取可参照上述内容。In the specific implementation, as shown in Figure 8, the docx document format is stored in a compressed package. After the document is decompressed, it can be seen that the nested documents are stored in the embeddings directory under the word directory. For example, there are two embedded documents in the docx document on the left. set of documents, there are two nested documents oleObject1.bin and oleObject2.bin documents in the corresponding embeddings. The xlsx nested document format is stored in a compressed package. After the document is decompressed, you can see that the nested document is stored in the embeddings directory under the xl directory. As shown in Figure 9, there are two nested documents in the document on the left, corresponding to The embeddings exist two nested documents oleObject1.bin and oleObject2.bin files. As shown in Figure 10, the extraction of pptx nested documents can refer to the above content.

S305：分别对各个子文档的文档内容进行识别，以识别所述子文档中是否存在机密信息。S305: Identify the document content of each sub-document respectively to identify whether confidential information exists in the sub-document.

可以理解的是，由于office2007系列文档使用ZIP压缩格式进行存储，因此可以对ZIP压缩包进行循环解压以读取嵌套对象。具体地，若解压后存在word/document.xml文件，则进行docx文档内容的提取；若解压后存在xl/sharedString.xml文件，则进行xlsx文档的内容提取；若解压后存在ppt/slides/slideX.xml文件，则进行pptx文档的内容提取；若解压后为其他文件，则将该文件作为待提取的目标文档，并返回步骤S301，进行文件类型识别流程。It is understandable that since the office2007 series documents are stored in the ZIP compressed format, the ZIP archive can be decompressed cyclically to read nested objects. Specifically, if the word/document.xml file exists after decompression, the content of the docx document is extracted; if the xl/sharedString.xml file exists after decompression, the content of the xlsx document is extracted; if there is ppt/slides/slideX after decompression .xml file, extract the content of the pptx document; if it is another file after decompression, take the file as the target document to be extracted, and return to step S301 to perform the file type identification process.

下面对本申请实施例提供的一种嵌套文档提取装置进行介绍，下文描述的一种嵌套文档提取装置与上文描述的一种嵌套文档提取方法可以相互参照。An apparatus for extracting a nested document provided by an embodiment of the present application is introduced below. The apparatus for extracting a nested document described below and a method for extracting a nested document described above may refer to each other.

参见图11所示，本申请实施例提供的一种嵌套文档提取装置包括：Referring to FIG. 11 , a nested document extraction apparatus provided by an embodiment of the present application includes:

目录读取模块401，用于获取待提取的目标文档，并读取所述目标文档对应的文档目录；A directory reading module 401, configured to acquire a target document to be extracted, and read a document directory corresponding to the target document;

文档提取模块402，用于从所述文档目录中提取所述目标文档中嵌套的所有子文档；a document extraction module 402, configured to extract all subdocuments nested in the target document from the document directory;

内容识别模块403，用于分别对各个子文档的文档内容进行识别，以识别所述子文档中是否存在机密信息。The content identification module 403 is configured to identify the document content of each sub-document respectively, so as to identify whether confidential information exists in the sub-document.

关于上述模块401至403的具体实施过程可参考前述实施例公开的相应内容，在此不再进行赘述。For the specific implementation process of the foregoing modules 401 to 403 , reference may be made to the corresponding content disclosed in the foregoing embodiments, which will not be repeated here.

本申请还提供了一种电子设备，参见图12所示，本申请实施例提供的一种电子设备包括：The present application also provides an electronic device. Referring to FIG. 12 , the electronic device provided by the embodiment of the present application includes:

存储器100，用于存储计算机程序；a memory 100 for storing computer programs;

处理器200，用于执行所述计算机程序时可以实现上述实施例所提供的步骤。The processor 200 can implement the steps provided in the above embodiments when executing the computer program.

具体的，存储器100包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机可读指令，该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。处理器200在一些实施例中可以是一中央处理器(CentralProcessing Unit，CPU)、控制器、微控制器、微处理器或其他数据处理芯片，为电子设备提供计算和控制能力，执行所述存储器100中保存的计算机程序时，可以实现前述任一实施例公开的嵌套文档提取方法的步骤。Specifically, the memory 100 includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer-readable instructions, and the internal memory provides an environment for the operation of the operating system and the computer-readable instructions in the non-volatile storage medium. The processor 200 may be a central processing unit (CPU), controller, microcontroller, microprocessor or other data processing chip in some embodiments, providing computing and control capabilities for electronic devices, executing the memory When the computer program stored in 100 is used, the steps of the method for extracting nested documents disclosed in any of the foregoing embodiments can be implemented.

在上述实施例的基础上，作为优选实施方式，参见图13所示，所述电子设备还包括：On the basis of the above-mentioned embodiment, as a preferred implementation manner, as shown in FIG. 13 , the electronic device further includes:

输入接口300，与处理器200相连，用于获取外部导入的计算机程序、参数和指令，经处理器200控制保存至存储器100中。该输入接口300可以与输入装置相连，接收用户手动输入的参数或指令。该输入装置可以是显示屏上覆盖的触摸层，也可以是终端外壳上设置的按键、轨迹球或触控板，也可以是键盘、触控板或鼠标等。The input interface 300 is connected to the processor 200 , and is used for acquiring externally imported computer programs, parameters and instructions, which are stored in the memory 100 under the control of the processor 200 . The input interface 300 can be connected with an input device to receive parameters or instructions manually input by the user. The input device may be a touch layer covered on the display screen, or a key, a trackball or a touchpad provided on the terminal shell, or a keyboard, a touchpad, or a mouse, or the like.

显示单元400，与处理器200相连，用于显示处理器200处理的数据以及用于显示可视化的用户界面。该显示单元400可以为LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode，有机发光二极管)触摸器等。The display unit 400, connected to the processor 200, is used for displaying data processed by the processor 200 and for displaying a visual user interface. The display unit 400 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, and the like.

网络端口500，与处理器200相连，用于与外部各终端设备进行通信连接。该通信连接所采用的通信技术可以为有线通信技术或无线通信技术，如移动高清链接技术(MHL)、通用串行总线(USB)、高清多媒体接口(HDMI)、无线保真技术(WiFi)、蓝牙通信技术、低功耗蓝牙通信技术、基于IEEE802.11s的通信技术等。The network port 500 is connected to the processor 200 and is used for communicating with external terminal devices. The communication technology used for the communication connection can be wired communication technology or wireless communication technology, such as Mobile High Definition Link Technology (MHL), Universal Serial Bus (USB), High Definition Multimedia Interface (HDMI), Wireless Fidelity Technology (WiFi), Bluetooth communication technology, Bluetooth low energy communication technology, communication technology based on IEEE802.11s, etc.

图13仅示出了具有组件100-500的电子设备，本领域技术人员可以理解的是，图13示出的结构并不构成对电子设备的限定，可以包括比图示更少或者更多的部件，或者组合某些部件，或者不同的部件布置。FIG. 13 only shows the electronic device having the components 100-500. Those skilled in the art can understand that the structure shown in FIG. 13 does not constitute a limitation on the electronic device, and may include fewer or more components than those shown in the figure. components, or a combination of certain components, or a different arrangement of components.

本申请还提供了一种计算机可读存储介质，该存储介质可以包括：U盘、移动硬盘、只读存储器(Read-Only Memory，ROM)、随机存取存储器(RandomAccess Memory，RAM)、磁碟或者光盘等各种可以存储程序代码的介质。该存储介质上存储有计算机程序，所述计算机程序被处理器执行时实现前述任一实施例公开的嵌套文档提取方法的步骤。The present application also provides a computer-readable storage medium, which may include: a U disk, a mobile hard disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk Or various media such as optical discs that can store program codes. The storage medium stores a computer program, and when the computer program is executed by the processor, implements the steps of the method for extracting nested documents disclosed in any of the foregoing embodiments.

本申请能够基于待提取目标文档的文档目录，提取出其中嵌套的所有子文档，实现嵌套文档的提取，进而对子文档的内容进行识别，确定子文档中是否存在机密信息，能够有效避免信息泄露。The present application can extract all nested sub-documents based on the document directory of the target document to be extracted, realize the extraction of nested documents, and then identify the content of the sub-documents to determine whether there is confidential information in the sub-documents, which can effectively avoid Information disclosure.

说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其他实施例的不同之处，各个实施例之间相同相似部分互相参见即可。对于实施例公开的系统而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。应当指出，对于本技术领域的普通技术人员来说，在不脱离本申请原理的前提下，还可以对本申请进行若干改进和修饰，这些改进和修饰也落入本申请权利要求的保护范围内。The various embodiments in the specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments can be referred to each other. For the system disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method. It should be pointed out that for those of ordinary skill in the art, without departing from the principles of the present application, several improvements and modifications can also be made to the present application, and these improvements and modifications also fall within the protection scope of the claims of the present application.

还需要说明的是，在本说明书中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should also be noted that, in this specification, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities or operations. There is no such actual relationship or sequence between operations. Moreover, the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device that includes a list of elements includes not only those elements, but also includes not explicitly listed or other elements inherent to such a process, method, article or apparatus. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in a process, method, article or apparatus that includes the element.

Claims

1. A nested document extraction method, comprising:

acquiring a target document to be extracted, and reading a document directory corresponding to the target document;

extracting all sub-documents nested in the target document from the document directory;

and respectively identifying the document content of each subdocument to identify whether confidential information exists in the subdocument.

2. The method for extracting the nested document according to claim 1, wherein after the target document to be extracted is obtained, the method further comprises:

determining a document type corresponding to the target document;

the reading of the document directory corresponding to the target document includes:

and reading a document directory corresponding to the target document according to the document type.

3. The method for extracting the nested document according to claim 2, wherein the reading the document directory corresponding to the target document according to the document type includes:

if the document type is in a first version format, reading a composite document directory corresponding to the target document; the first version format comprises any one of doc format, xls format and ppt format;

if the document type is the document type in the second type version format, decompressing the target document, and reading a multi-level document directory corresponding to the target document after decompression; the second type of version format includes any one of a docx format, an xlsx format, and a pptx format.

4. A nested document extraction method according to claim 3, wherein extracting all sub-documents nested in the target document from the document directory comprises:

and extracting all the sub-documents nested in the target document by reading all the folders of the preset file in the compound document directory.

5. A nested document extraction method according to claim 3, wherein extracting all sub-documents nested in the target document from the document directory comprises:

reading preset subdirectories under a multilevel document directory corresponding to the target document;

and extracting all the subdocuments stored in the preset subdirectory.

6. A nested document extraction method according to any one of claims 1 to 5, wherein before identifying the document content of each subdocument separately, the method further comprises:

judging whether the subdocuments are single documents or nested documents;

and if the subdocuments are nested documents, taking the subdocuments as the target documents, and performing iterative extraction by the step of extracting all the subdocuments nested in the target documents from the document directory.

7. A nested document extraction method according to claim 6, wherein the identifying the document content of each subdocument separately comprises:

and if the subdocuments are single documents, directly extracting and identifying the contents of the subdocuments according to the document formats corresponding to the subdocuments.

8. A nested document extraction apparatus, comprising:

the catalog reading module is used for acquiring a target document to be extracted and reading a document catalog corresponding to the target document;

the document extraction module is used for extracting all the sub-documents nested in the target document from the document directory;

and the content identification module is used for respectively identifying the document content of each subdocument so as to identify whether confidential information exists in the subdocument.

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the nested document extraction method of any one of claims 1 to 7 when executing said computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the steps of the nested document extraction method of any one of claims 1 to 7.