WO2011020360A1

WO2011020360A1 - Document storage method

Info

Publication number: WO2011020360A1
Application number: PCT/CN2010/073412
Authority: WO
Inventors: 王东临; 郭旭; 刘宁胜
Original assignee: Sursen Corp
Current assignee: Sursen Corp
Priority date: 2009-08-19
Filing date: 2010-06-01
Publication date: 2011-02-24
Anticipated expiration: 2012-02-19

Abstract

A document storage method is disclosed. The method discloses: mapping the content of a document to the tree structure according to the designated document types, in which the type information of each node is defined in the document type; storing the content of each node for constructing the tree structure and the information of the relationship between nodes; recording the storage location of the node data in the node index table; in which the node data comprises the node content; storing the node index table. With the storage method, it can improve the access performance of the document.

Description

Method for storing documents

Technical field

本发明涉及计算机存储技术，具体涉及一种文档的存储方法。 The present invention relates to computer storage technology, and in particular to a method for storing a document.

Background technique

可扩展标记语言（XML）是一种文档描述语言，利用该语言描述的文档称为XML文档。Extensible Markup Language (XML) is a document description language in which documents described in this language are called XML documents.

XML文档有很多优势。XML是一种元标记语言，开发者可以根据自己的需要定义自己的标记，XML的文档是有明确语义并且是结构化的，抗破坏能力很强， XML表示的信息独立于平台的，这里的平台即可以理解为不同的应用程序也可以理解为不同的操作系统，对于大型复杂的文档，XML不仅允许指定文档中的词汇，还允许指定元素之间的关系。XML文档由文档类型定义（DTD）/模式（Schema）和XML文本组成，DTD/ Schema就是一组标记符的语法规则，表明XML文本是怎么样组织的，这样的组成方式使XML文档实现了内容与形式的分离，成就了上述众多优点。XML documents have many advantages. XML is a meta-markup language. Developers can define their own tags according to their needs. XML documents are well-defined and structured, and have strong anti-destructive capabilities. The information represented by XML is platform-independent. The platform here can be understood as different applications and can be understood as different operating systems. For large and complex documents, XML not only allows the vocabulary in the specified document, but also allows the specified elements to be between Relationship. An XML document consists of a document type definition (DTD)/schema (Schema) and XML text, DTD/ Schema is a grammar rule for a set of tags, indicating how the XML text is organized. This way of composition makes the XML document realize the separation of content and form, which achieves many of the above advantages.

但是，XML文档的缺点在于，在将文档存储为XML文档后，如果要访问某个对象，需要首先对整个文档进行解析，将其转换为树型结构的组织方式，再搜索要访问的对象，进行访问。可以看出，应用上述存储方式存储某文档后，当用户对文档的部分内容进行访问时，系统需要首先耗费资源对整个文档进行解析，然后才能选择用户感兴趣的内容进行显示，延长了处理时间，进而降低了访问性能，浪费了系统资源。However, the disadvantage of XML documents is that after storing a document as an XML document, if you want to access an object, you need to first parse the entire document, convert it into a tree structure, and then search for the object to be accessed. Make an access. It can be seen that after the storage method is used to store a certain document, when the user accesses part of the content of the document, the system needs to firstly analyze the entire document by using resources, and then select the content that the user is interested in for display, thereby prolonging the processing time. , which reduces access performance and wastes system resources.

technical problem

有鉴于此，本发明提供一种文档的存储方法，能够提高访问性能。 In view of this, the present invention provides a method for storing a document, which can improve access performance.

Technical solution

为实现上述目的，本发明采用如下的方案：To achieve the above object, the present invention adopts the following scheme:

一种文档的存储方法，包括：A method of storing a document, comprising:

根据指定的文档类型，将文档的内容映射为树形结构；其中，各结点的类型信息在文档类型中定义；Maps the content of the document to a tree structure according to the specified document type; wherein the type information of each node is defined in the document type;

存储构成该树形结构的各结点的内容以及结点之间的关系信息；Storing the contents of the nodes constituting the tree structure and the relationship information between the nodes;

在结点索引表中记录结点数据的存储位置；其中，所述结点数据包括结点内容；Recording a storage location of the node data in the node index table; wherein the node data includes a node content;

存储结点索引表。Store the node index table.

Beneficial effect

利用上述技术方案在存储某文档时，首先根据该文档的文档类型，将文档内容映射为对应树型结构中的各个结点及其数据。通过此种方式，使文档由结点构成，各个结点间通过树型结构组织起来。然后，存储文档对应的各个结点数据，并且将结点数据的存储位置记录在了结点索引表中。通过上述方式，就可以将一个文档按照树型结构在存储介质中进行存储，同时采用带结点索引的存储格式。对于此类采用树型结构和结点索引方式存储后的文档，当用户对文档的部分内容进行访问时，可以直接对该内容对应的结点进行访问，而不需要预先对整个文档进行处理，节省了系统资源，提高了访问性能，实现结点的快速搜索。When storing a certain document by using the above technical solution, first, according to the document type of the document, the document content is mapped to each node in the corresponding tree structure and the data thereof. In this way, the document is made up of nodes, and the nodes are organized by a tree structure. Then, each node data corresponding to the document is stored, and the storage location of the node data is recorded in the node index table. In the above manner, a document can be stored in a storage medium in a tree structure, and a storage format with a node index is used. For such a document stored in a tree structure and a node index manner, when the user accesses part of the content of the document, the node corresponding to the content can be directly accessed without the need to process the entire document in advance. It saves system resources, improves access performance, and enables fast search of nodes.

另外，本发明的方法依然保留文档类型的配置，继承了XML文档的优势。利用本发明的文档存储方法可以支持复杂树形结构数据的存储、存储空间小、扩展性好、可增量修改、易于提高存储安全性。In addition, the method of the present invention still retains the configuration of the document type, inheriting the advantages of the XML document. The document storage method of the invention can support the storage of complex tree structure data, small storage space, good expansibility, incremental modification, and easy to improve storage security.

DRAWINGS

图1为本发明的文档存储方法总体流程图。FIG. 1 is a general flow chart of a document storage method of the present invention.

图2为本发明实施例中文档存储方法具体流程图。FIG. 2 is a specific flowchart of a document storage method according to an embodiment of the present invention.

图3为块号数组及数组内容对应数据块的示意图。FIG. 3 is a schematic diagram of a block number array and an array block corresponding data block.

图4为一个空闲块表的示意图。Figure 4 is a schematic diagram of a free block table.

图5为收缩文档存储空间的流程图。Figure 5 is a flow chart of shrinking the document storage space.

BEST MODE FOR CARRYING OUT THE INVENTION

为使本发明的目的、技术手段和优点更加清楚明白，以下结合附图对本发明作进一步详细说明。In order to make the objects, technical means and advantages of the present invention more comprehensible, the present invention will be further described in detail below with reference to the accompanying drawings.

本发明的基本思想是：预先配置由至少一种结点类型定义的具有树型结构的文档类型，定义包括结点索引表和结点数据区的文档存储格式。在存储某文档时，首先根据该文档的文档类型，将文档内容映射为对应树型结构中的各个结点及其数据。然后，根据文档长度，在存储介质中为文档分配一定的存储空间，在该存储空间中进一步为结点索引表和结点数据区分别分配各自的存储区域。最后，将文档对应的各个结点数据存储到结点数据区，并在结点索引表中记录每个结点与该结点数据存储位置的对应关系。The basic idea of the present invention is to pre-configure a document type having a tree structure defined by at least one node type, and define a document storage format including a node index table and a node data area. When storing a document, the document content is first mapped to the corresponding nodes and their data in the corresponding tree structure according to the document type of the document. Then, according to the length of the document, a certain storage space is allocated to the document in the storage medium, and the storage area is further allocated to the node index table and the node data area in the storage space. Finally, each node data corresponding to the document is stored in the node data area, and the correspondence between each node and the storage location of the node data is recorded in the node index table.

在本发明的实施例中，为了叙述的方便，可以将依照上述存储方式存储的文档统称为SurXml文档（由Sursen公司定义的XML文档）。In the embodiment of the present invention, for the convenience of description, the files stored according to the above storage manner may be collectively referred to as a SurXml document. (XML document defined by Sursen).

图1为本发明的文档存储方法总体流程图。如图1所示，该方法包括：FIG. 1 is a general flow chart of a document storage method of the present invention. As shown in Figure 1, the method includes:

步骤101，预先配置至少一个结点类型定义的具有树型结构的文档类型；定义包括结点索引表和结点数据区的文档存储格式。Step 101: Preconfigure a document type having a tree structure defined by at least one node type; and define a document storage format including a node index table and a node data area.

本发明的核心是将文档以树型结构直接存储到存储介质中，本步骤中配置的文档类型即代表了树型结构的具体组织方式。The core of the present invention is to store the document directly in the storage medium in a tree structure. The document type configured in this step represents the specific organization of the tree structure.

步骤102，参照步骤101中配置的文档类型，确定并存储文档的文档类型，并根据该文档类型，将文档的内容映射为对应树型结构中的各个结点及其数据。Step 102: Determine, according to the document type configured in step 101, the document type of the document, and map the content of the document to each node in the corresponding tree structure and the data according to the document type.

步骤103，根据对应树型结构中的各个结点及其数据的长度，为文档分配存储空间，在分配的存储空间中为结点索引表和结点数据区分配各自的存储区域。Step 103: Allocate a storage space for the document according to the length of each node in the corresponding tree structure and the length of the data, and allocate a storage area for the node index table and the node data area in the allocated storage space.

步骤104，在结点数据区中存储结点的数据，并在结点索引表中记录每个结点与该结点数据存储位置的对应关系。简而言之，即在结点索引表中记录结点数据的存储位置。Step 104: Store the data of the node in the node data area, and record the correspondence between each node and the storage location of the node data in the node index table. In short, the storage location of the node data is recorded in the node index table.

至此，本发明的存储方法流程结束。在上述流程中，步骤102和步骤103是顺序执行的，事实上，也可以先执行步骤103再执行步骤102，或者两个步骤同时执行。So far, the storage method flow of the present invention ends. In the above process, step 102 and step 103 are performed sequentially. In fact, step 103 may be performed first and then step 102 may be performed, or two steps may be performed simultaneously.

在以上描述中，虽然给出了“结点数据区”的概念，但本领域技术人员可以理解，该概念是为了描述的方便，并不作为必要技术特征。另外，本领域技术人员还可以理解步骤103为一个可选的步骤，即，在将文档的内容映射为树型结构中的各个结点及其数据后，可以存储结点的数据，并记录结点数据的存储位置。In the above description, although the concept of "node data area" is given, those skilled in the art can understand that the concept is for convenience of description and is not a necessary technical feature. In addition, those skilled in the art can also understand that step 103 is an optional step, that is, after mapping the content of the document into each node in the tree structure and its data, the data of the node can be stored, and the node is recorded. The storage location of the point data.

其中，结点数据通常包括结点的内容，在本发明一实施例中，进一步包括与其他结点的关系信息。值得注意的是，在本发明一实施例中，在结点数据区中存储的是各结点的内容以及结点之间的关系信息，但是在结点索引表记录的却仅仅是结点内容的存储位置，即结点之间的关系信息可以不记录在结点索引表中。The node data usually includes the content of the node. In an embodiment of the present invention, the relationship information with other nodes is further included. It should be noted that, in an embodiment of the present invention, the content of each node and the relationship information between the nodes are stored in the node data area, but only the node content is recorded in the node index table. The storage location, that is, the relationship information between the nodes, may not be recorded in the node index table.

以上注释性描述也同样适应于以下所有实施例，不再赘述。The above annotated descriptions are also applicable to all of the following embodiments, and are not described again.

Embodiments of the invention

上述为本发明的文档在存储介质中存储方法的总体概述，下面通过具体实施例说明该存储方法的具体实施方式。The foregoing is a general overview of a method for storing a document in the storage medium of the present invention. Specific embodiments of the storage method are described below by way of specific embodiments.

实施例：Example:

图2为本发明实施例中文档存储方法具体流程图。如图2所示，该方法包括：FIG. 2 is a specific flowchart of a document storage method according to an embodiment of the present invention. As shown in Figure 2, the method includes:

步骤201，预先配置由至少一个结点类型定义的且具有树型结构的文档类型，定义包括文件头、结点索引表和结点数据区的文档存储格式。Step 201: Preconfigure a document type defined by at least one node type and having a tree structure, and define a document storage format including a file header, a node index table, and a node data area.

本发明一实施例中，配置的文档类型包括下列信息：a、文档中包含哪些类型的结点（结点类型）；b、文档中包含的各种类型的结点的名称、结点中包含的各个属性的名称/类型；c、各种结点之间可能的父子关系。通过上述信息，文档类型即可以描述一种特定的树型结构。举一个简单的文档类型的例子，在该文档类型中，根结点下有若干页面（page）子结点，每个page子结点下有若干文本（text）和图像（image）子结点。具体来说，每个节点类型都定义有一个结点类型标记，唯一标示一个结点类型。本步骤中还定义了带结点索引的文档存储格式，具体为文件头、结点索引表和结点数据区。其中，文件头用于提供访问文件的入口信息，并不能理解为访问文件的入口信息一定存储在文件的头部，实际上，可以约定将入口信息存储在文件的任何部分，如文件的尾部；结点索引表用于记录各结点数据的具体存储位置，以方便检索各结点数据；结点数据区用于存储各结点数据。In an embodiment of the present invention, the configured document type includes the following information: a, which types of nodes (node types) are included in the document; b, names of various types of nodes included in the document, and nodes included in the node The name/type of each attribute; c, possible parent-child relationship between various nodes. With the above information, the document type can describe a specific tree structure. As an example of a simple document type, in the document type, there are several page sub-nodes under the root node, and there are several text and image sub-nodes under each page sub-node. . Specifically, each node type defines a node type tag that uniquely identifies a node type. In this step, a document storage format with a node index is also defined, which is specifically a file header, a node index table, and a node data area. Wherein, the file header is used to provide entry information for accessing the file, and it cannot be understood that the entry information for accessing the file must be stored at the head of the file. In fact, it may be agreed to store the entry information in any part of the file, such as the end of the file; The node index table is used to record the specific storage location of each node data to facilitate retrieval of each node data; the node data area is used to store each node data.

当要存储某文档时，执行以下操作。When you want to store a document, do the following.

步骤202，参照步骤201中配置的文档类型，确定要存储文档的文档类型，并存储该文档类型信息。Step 202: Referring to the document type configured in step 201, determine a document type of the document to be stored, and store the document type information.

本步骤中，确定文档类型后，将该文档类型信息进行存储。这里涉及到文档类型的表示和文档类型信息的存储两方面的内容。In this step, after the document type is determined, the document type information is stored. This involves both the representation of the document type and the storage of the document type information.

文档类型可以通过两种格式来表示：类型表，如自定义的类型表，以及XML Schema/DTD/Relax NG，其中，Relax NG是由OASIS提出的一种文档类型定义方式。Document types can be represented in two formats: type tables, such as custom type tables, and XML Schema/DTD/Relax NG, where Relax NG is a document type definition proposed by OASIS.

（一）类型表格式：(1) Type table format:

类型表是一种数据结构，任何能够描述文档类型信息的数据结构，都可以认为是类型表。这里给出一种示例实现，但实际可能的实现并不限于该示例：A type table is a data structure. Any data structure that can describe information about a document type can be considered as a type table. An example implementation is given here, but the actual possible implementation is not limited to this example:

属性的定义，包括名称和类型。这里，可以为每一个属性定义唯一标记，利用属性标记作为属性的名称；属性的类型可以是属性的数据类型，也可以为每个属性类型定义唯一标记。The definition of the attribute, including the name and type. Here, you can define a unique tag for each attribute, using the attribute tag as the name of the attribute; the type of the attribute can be the data type of the attribute, or you can define a unique tag for each attribute type.

struct attrStruct attr

{{

string name; String name;

string type; String type;

};};

结点的定义：包括结点名称、属性列表、子结点名称列表；The definition of the node: including the node name, attribute list, and list of child node names;

struct nodeStruct node

{{

string name; String name;

vector<attr> attr_list; Vector<attr> attr_list;

vector<string> sub_node_name_list; Vector<string> sub_node_name_list;

};};

文档定义：指定根结点名称Document definition: Specify the root node name

struct documentStruct document

{{

string root_node_name; String root_node_name;

};};

这是一种非常简单的定义方法，对迭代、选择、序列、重复等结构都没有加入，但大部分文档的类型信息也都能够使用上述结构进行描述。This is a very simple definition method that does not include iterations, selections, sequences, repetitions, etc., but most of the document type information can also be described using the above structure.

（二）XML DTD/Schema/Relax NG格式(2) XML DTD/Schema/Relax NG format

XML文档虽然在物理存储上是线性的，但逻辑上也是由结点构成的树型结构。因此，用于描述XML文档类型的DTD/Schema/Relax NG等格式也可以用于描述本发明的文档类型。Although XML documents are linear in physical storage, they are logically a tree structure composed of nodes. Therefore, the DTD/Schema/Relax used to describe the XML document type Formats such as NG can also be used to describe the type of document of the present invention.

此外，无论是DTD、Schema或Relax NG，都支持为结点定义名称、属性、子结点；与类型表格式给出的简单方法相比，这种方法对子结点的迭代、选择、序列、重复有很好的支持，能够描述更复杂的文档类型。In addition, whether it is DTD, Schema or Relax NG, which supports defining names, attributes, and child nodes for nodes; this method has good support for iteration, selection, sequence, and repetition of child nodes, compared with the simple method given by the type table format. Describe more complex document types.

这里给出一个使用Schema描述的文档类型定义，DTD和Relax NG的形式虽然不同，但三者在实质上是比较类似的。另外，实际文档类型并不受限于该示例：Here is a document type definition using the Schema description, DTD and Relax Although the form of NG is different, the three are similar in nature. In addition, the actual document type is not limited to this example:

<?xml version="1.0" encoding="UTF-8"?><?xml version="1.0" encoding="UTF-8"?>

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified"><xs:schema Xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" attributeFormDefault="unqualified">

<xs:element name="document_root"> <xs:element name="document_root">

<xs:annotation> <xs:annotation>

<xs:documentation>Comment describing your root element</xs:documentation> <xs:documentation>Comment describes your Root element</xs:documentation>

</xs:annotation> </xs:annotation>

<xs:complexType> <xs:complexType>

<xs:sequence> <xs:sequence>

<xs:element name="page_obj" type="page"/> <xs:element name="page_obj" type="page"/>

</xs:sequence> </xs:sequence>

</xs:complexType> </xs:complexType>

</xs:element> </xs:element>

<xs:complexType name="page"> <xs:complexType name="page">

<xs:sequence> <xs:sequence>

<xs:choice> <xs:choice>

<xs:element name="text_obj" type="text"/> <xs:element name="text_obj" type="text"/>

<xs:element name="image_obj" type="image"/> <xs:element name="image_obj" Type="image"/>

</xs:choice> </xs:choice>

</xs:sequence> </xs:sequence>

</xs:complexType> </xs:complexType>

<xs:complexType name="text"> <xs:complexType name="text">

<xs:attribute name="encoding" type="xs:string" use="required"/> <xs:attribute name="encoding" type="xs:string" Use="required"/>

<xs:attribute name="text_str" type="xs:string" use="required"/> <xs:attribute name="text_str" type="xs:string" Use="required"/>

</xs:complexType> </xs:complexType>

<xs:complexType name="image"> <xs:complexType name="image">

<xs:attribute name="type" type="xs:string" use="required"/> <xs:attribute name="type" type="xs:string" Use="required"/>

<xs:attribute name="data" type="xs:string" use="required"/> <xs:attribute name="data" type="xs:string" Use="required"/>

</xs:complexType> </xs:complexType>

</xs:schema></xs:schema>

通过上述两种方式的任意一种都可以表示文档类型。下面介绍文档类型信息的存储方式。The document type can be represented by either of the above two methods. The following describes how the document type information is stored.

存储文档类型信息时采用线性字符序列进行存储。具体地，对于利用类型表格式表示的文档类型，可以根据所用的自定义类型表，规定一组序列化函数，将类型表的数据转换为线性字符序列；对于利用DTD/Schema/Relax NG格式表示的文档类型，因为DTD/Schema/Relax NG文件本身就已经是线性字符序列，所以可以直接存储DTD/Schema/Relax NG文件本身。当然，在存储该文档类型信息前，也可以根据需要，预先对要存储的文档类型信息进行加密、压缩和变换等处理，然后将处理结果作为文档类型信息存储。A linear character sequence is stored for storing document type information. Specifically, for the document type represented by the type table format, a set of serialization functions may be specified according to the custom type table used, and the data of the type table is converted into a linear character sequence; for using DTD/Schema/Relax The document type represented by the NG format, because the DTD/Schema/Relax NG file itself is already a linear character sequence, so DTD/Schema/Relax can be stored directly. The NG file itself. Of course, before storing the document type information, the document type information to be stored may be pre-processed, such as encryption, compression, and transformation, and then the processing result is stored as document type information.

存储文档类型信息时，可以将文档类型信息进行远程存储、本地存储或存储在程序逻辑中。下面分别介绍三种存储位置下，存储文档类型信息的不同实现方式。When storing document type information, document type information can be stored remotely, stored locally, or stored in program logic. The following describes the different implementations of storing document type information in three storage locations.

（一）文档类型信息可以在本地存储，所谓本地存储是指在存储文档的具体文件中存储文档类型信息。这里给出一种示例方法：(1) Document type information can be stored locally, and so-called local storage refers to storing document type information in a specific file in which a document is stored. Here is an example method:

在文件中，增加一个自定义的存储区域，用于存储文档类型信息，指定该区域的方法，包括但不限于下述方法：自文件中某个指定位置开始的一段指定长度的区域；在文档类型的树型结构中增加一个特定的结点或属性，作为文档类型信息的存储。In the file, add a custom storage area for storing document type information, and specify the method of the area, including but not limited to the following method: a specified length of area starting from a specified position in the file; in the document A specific node or attribute is added to the type's tree structure as a storage for document type information.

文档类型信息存储在本地后，访问该文档的程序则默认使用文档内部的文档类型信息。After the document type information is stored locally, the program accessing the document uses the document type information inside the document by default.

（二）文档类型信息可以在远程存储，所谓远程存储是指在存储文档外部的其它文件系统中存储文档类型信息。文档类型信息在远程存储时，包括但不限于下述方法：远程或分布式文件系统，如网络文件系统（NFS）、WIN2000分布式文件系统（DFS）、安德鲁文件系统（AFS）；本地文件系统；网页（WEB）服务器；文件传输协议（FTP）服务器。将文档类型信息存储在远程后，还要将远程的文档类型信息的URL或路径信息，存储到SurXml文档内部，选择存储位置的方法，与本地存储方法选择文档类型信息存储位置的方法相同。访问SurXml文档的程序根据文档中保存的URL或路径信息查找文档类型信息。(2) The document type information can be stored remotely, and the so-called remote storage refers to storing document type information in other file systems external to the stored document. Document type information when stored remotely, including but not limited to the following methods: remote or distributed file systems, such as Network File System (NFS), WIN2000 Distributed File System (DFS), Andrew File System (AFS); local file system Web page (WEB) server; File Transfer Protocol (FTP) server. After the document type information is stored in the remote, the URL or path information of the remote document type information is also stored in the SurXml document, and the method of selecting the storage location is the same as the method of selecting the storage location of the document type information by the local storage method. The program that accesses the SurXml document finds the document type information based on the URL or path information saved in the document.

（三）除上述两种对文档类型信息进行显式（即存储在文档内部或外部）的存储外，还可以将文档类型信息存储到访问SurXml文档的程序逻辑中。具体地，可以通过一组应用程序接口（API）函数进行硬编码，在访问SurXml文档内容之前，应用程序需要调用该API函数，在内存中建立文档类型信息数据；或者，直接将文档类型信息存储到访问SurXml文档的应用程序的源代码或二进制映像中，访问SurXml文档的程序可以直接将文档类型信息复制到内存中使用。这种非显式的存储方法，只能支持数量有限的文档类型，而且程序中需要为每一种文档类型分配ID，在SurXml文档中需要存储所使用的文档类型的ID。(3) In addition to the above two types of storage of document type information (ie, stored inside or outside the document), the document type information may also be stored in the program logic for accessing the SurXml document. Specifically, it can be hard coded through a set of application program interface (API) functions. Before accessing the contents of the SurXml document, the application needs to call the API function to create document type information data in the memory; or directly store the document type information. In the source code or binary image of the application that accesses the SurXml document, the program that accesses the SurXml document can directly copy the document type information into memory for use. This non-explicit storage method can only support a limited number of document types, and the program needs to assign an ID to each document type. In the SurXml document, it is necessary to store the ID of the document type used.

存储文档类型信息时，还可以是采取上述三种存储方式中任意组合。如将部分文档类型信息进行远程存储、将部分文档类型信息进行本地存储，或将部分文档类型信息存储在程序逻辑中。When storing the document type information, it is also possible to adopt any combination of the above three storage methods. For example, some document type information is stored remotely, part of the document type information is stored locally, or part of the document type information is stored in the program logic.

步骤203，根据文档类型，将文档的内容映射为对应树型结构中的各个结点及其数据。Step 203: Map the content of the document to each node in the corresponding tree structure and its data according to the document type.

本实施例中，定义结点的数据包括结点的内容信息和位置信息。具体地，结点的内容信息用来描述本结点对应的文档内容，包括结点类型标记、结点长度、结点的属性的名称/标记和值；结点的位置信息用来描述本结点在整个文档对应的树型结构中的位置，又可称为相关结点索引信息。该相关结点索引信息包括结点的父结点ID、结点左右兄弟结点ID、结点最左侧子结点ID、结点的子结点数目、结点的所有子结点ID。其中，结点左右兄弟结点ID和结点的所有子结点ID是可选项，没有这两项内容，也能够将本结点在树型结构中的位置表述清楚，增加这两项内容的目的是提高检索速度。In this embodiment, the data defining the node includes content information and location information of the node. Specifically, the content information of the node is used to describe the content of the document corresponding to the node, including the node type tag, the node length, the name/tag and the value of the node attribute; the location information of the node is used to describe the node. The position of the point in the tree structure corresponding to the entire document may also be referred to as related node index information. The related node index information includes a parent node ID of the node, a left and right brother node ID of the node, a leftmost child node ID of the node, a number of child nodes of the node, and all child node IDs of the node. Among them, the ID of the left and right brother nodes and all the child node IDs of the nodes are optional. Without these two contents, the position of the node in the tree structure can be clearly expressed, and the two contents are added. The goal is to increase the speed of retrieval.

本步骤中，根据文档类型将文档内容映射为不同的结点及其数据。例如，映射PDF文档时，将每个页面映射为一个page结点，将该页面内的文字信息部分和图像信息部分分别映射为该page结点的两个子结点，结点ID分别为A和B。Page结点的内容信息包括：结点类型标记为page结点，结点长度值，结点属性的名称/标记和值包括页眉、页脚和页码等信息。Page结点的相关结点索引信息包括：父结点为PDF文档的根节点，左右兄弟结点为其它page结点，最左侧子结点ID为A，子节点数目为2，所有子节点ID为A、B。In this step, the document content is mapped to different nodes and their data according to the document type. For example, when mapping a PDF document, each page is mapped to a page node, and the text information part and the image information part in the page are respectively mapped to two child nodes of the page node, and the node IDs are respectively A and B. The content information of the page node includes: the node type is marked as a page node, the node length value, the name/tag and the value of the node attribute include information such as a header, a footer, and a page number. The relevant node index information of the page node includes: the parent node is the root node of the PDF document, the left and right sibling nodes are other page nodes, the leftmost child node ID is A, the number of child nodes is 2, and all child nodes The ID is A and B.

步骤204，根据各个结点及其数据的长度，为文档分配存储空间，在存储空间中为文件头、结点索引表和结点数据区分别分配各自的存储区域。Step 204: Allocate a storage space for the document according to the length of each node and its data, and allocate respective storage areas for the file header, the node index table, and the node data area in the storage space.

一般来说，文件头的长度或者是固定的，或者比较短，不需要复杂的存储分配机制支持，可以直接对其分配固定的存储区域；而对于结点索引表和结点数据区来说，其长度则随着结点数目的增加而增加，因此，在本实施例中，采用可收缩的存储分配、回收机制来对结点索引表和结点数据区的存储区域进行分配和组织。另外，该机制还用于大对象的存储分配。Generally speaking, the length of the file header is either fixed or short, and does not require complicated storage allocation mechanism support, and can directly allocate a fixed storage area; and for the node index table and the node data area, The length thereof increases as the number of nodes increases. Therefore, in the present embodiment, the shrinkable storage allocation and recovery mechanism is used to allocate and organize the storage areas of the node index table and the node data area. In addition, this mechanism is also used for storage allocation of large objects.

本发明一实施例中采用的可收缩的存储分配、回收机制为类似于UNIX文件系统中inode/freelist的机制。具体地，将结点索引表和结点数据区分别看做一个UNIX文件，各自对应一个inode。The shrinkable storage allocation and recovery mechanism employed in an embodiment of the present invention is similar to the inode/freelist mechanism in the UNIX file system. Specifically, the node index table and the node data area are respectively treated as a UNIX file, and each corresponds to an inode.

在inode/freelist机制中，整个存储空间分为三大部分：超级块、inode表和数据块，如表1所示。In the inode/freelist mechanism, the entire storage space is divided into three parts: super block, inode table and data block, as shown in Table 1.

超级块 Super block Inode 表 Inode table 数据块 data block

表 1 Table 1

与 UNIX 文件系统类似，在本实施例中，表 1 的超级块和 inode 表用于存储空间的分配和组织，而实际的数据位于数据块部分。 Similar to the UNIX file system, in this embodiment, the super block and inode of Table 1 Tables are used for the allocation and organization of storage space, while the actual data is in the data block section.

表 2 是超级块内部的结构。 Table 2 shows the structure inside the super block.

空闲块表 Free block table 空闲 Inode 表 Idle Inode Table Inode0 Inode0

表 2 Table 2

在表2中，空闲块表用于存储空间的分配和回收。由于inode表是固定大小的一个区域，其记载的inode数目也是确定的，利用空闲INODE表就可以管理inode表中的空闲空间。inode0是本发明一实施例对原有的unix文件系统中超级块的一个扩充，用于对inode表所占据的存储空间进行扩充和收缩。在该实施例中，可以是在超级块中加入一个特殊的inode——inode0，其中记录了inode表所位于的数据块号，这种情况下，inode表位于数据块中。在本发明另一实施例中，超级块中记录的是inode0是一个inode号码，inode0本身存储在inode表中。In Table 2, the free block table is used for allocation and reclamation of storage space. Since the inode table is a fixed-size area, the number of inodes recorded is also determined. The free space in the inode table can be managed by using the idle INODE table. Inode0 is an extension of the super block in the original unix file system according to an embodiment of the present invention, and is used for expanding and shrinking the storage space occupied by the inode table. In this embodiment, a special inode, inode0, may be added to the superblock, where the block number in which the inode table is located is recorded, in which case the inode table is located in the data block. In another embodiment of the present invention, the inode 0 is recorded in the super block as an inode number, and the inode 0 itself is stored in the inode table.

下面简单介绍一下UNIX文件系统中inode与文件的关系。每个文件都对应于一个inode，inode用于记录其所对应的文件中所包含的数据块的块号，也即指明该inode所对应的文件数据的具体存储位置。在每个inode内部保存有一个块号数组，该块号数组的前若干项记录了存储该inode对应文件数据的数据块块号，在该块号数组的最后三项分别记录间接块、二次间接块和三次间接块的块号。其中，所谓间接块是指记录了数据块块号的数据块，通过间接块的内容，可以找到对应数据块的块号，再进一步根据该块号找到存储文件数据的数据块；所谓二次间接块是指记录了间接块块号的数据块，三次间接块是指记录了二次间接块块好的数据块。图3即为块号数组及数组内容对应数据块的示意图。其中，301为数据块、302为间接块、303为二次间接块、304为三次间接块。The following briefly introduces the relationship between inodes and files in the UNIX file system. Each file corresponds to an inode. The inode is used to record the block number of the data block contained in the file corresponding to it, that is, the specific storage location of the file data corresponding to the inode. An array of block numbers is stored in each inode. The first few items of the block number record record the block number of the block in which the file data of the inode is stored, and the indirect block is recorded in the last three items of the block number array. The block number of the indirect block and the cubic indirect block. The so-called indirect block refers to the data block in which the block number of the data block is recorded. Through the content of the indirect block, the block number of the corresponding data block can be found, and the data block storing the file data is further found according to the block number; A block refers to a data block in which an indirect block number is recorded, and a cubic indirect block refers to a data block in which a secondary indirect block is recorded. Figure 3 is a schematic diagram of a block number array and an array block corresponding to the data block. Wherein, 301 is a data block, 302 is an indirect block, 303 is a secondary indirect block, and 304 is a cubic indirect block.

对于比较小的文件，可以只利用inode中前若干数据块块号项记录文件的存储位置；对于比较大的文件，除利用块号数组中的数据块块号项外，还可以利用最后的间接块、二次间接块和三次间接块的块号项记录文件的存储位置，尤其对于结点数据区的存储，间接块可以发挥很大的作用。当然，间接块的块号项数量是可以任意设置的，可以继续包括四次间接块、五次间接块等，也可以没有，视inode对应的文件大小而定。For relatively small files, you can use only the first block of data blocks in the inode to record the storage location of the file; for larger files, you can use the last indirect in addition to the block number of the block in the block number array. The storage location of the block number entry of the block, the secondary indirect block, and the cubic indirect block, especially for the storage of the node data area, can play a large role. Of course, the number of block number entries of the indirect block can be arbitrarily set, and can continue to include four indirect blocks, five indirect blocks, or the like, or not, depending on the file size corresponding to the inode.

在一实施例中，将结点索引表和结点数据区均视为一个UNIX文件，假定为结点索引表设置一个inode，标记为inode1，为结点数据区设置一个inode，标记为inode2；则inode1记录结点索引表的具体存储位置，inode2记录结点数据区的具体存储位置。In an embodiment, the node index table and the node data area are both regarded as a UNIX file, and an inode is set for the node index table, marked as inode1, and an inode is set for the node data area, and is marked as inode2; Then inode1 records the specific storage location of the node index table, and inode2 records the specific storage location of the node data area.

步骤205，在结点数据区中存储结点的数据，并在结点索引表中记录每个结点与该结点数据存储位置的对应关系。Step 205: Store the data of the node in the node data area, and record the correspondence between each node and the storage location of the node data in the node index table.

结点数据在结点数据区中的存储方式有三种：tlv方式、SlottedPage方式和inode方式。There are three ways to store node data in the node data area: tlv mode, SlottedPage mode, and inode mode.

tlv即结点类型标记（tag）＋结点长度（length）＋结点的值（value），在这种存储方式下，所有结点按一定顺序排列，每个结点内部，类型名称在前，接下来存储的是结点长度，最后是结点的属性值和其它相关结点的ID。Tlv is the node type tag (tag) + node length (length) + node value (value), in this storage mode, all nodes are arranged in a certain order, inside each node, the type name is in front Next, the node length is stored, and finally the attribute value of the node and the ID of other related nodes.

在结点数据区中，以tlv方式顺序存储的结点是从头部开始线性排列的。在将所有结点数据存储完毕后，结点数据区可能还有一定量未使用的空闲区域。本实施例中，在结点数据区末尾处，记录空闲区域开始处在数据区中的偏移量，以方便管理空闲区域。当然也可以预先在结点数据区中预留一个结点数据区以记录空闲区域开始处在数据区中的偏移量。In the node data area, the nodes sequentially stored in the tlv manner are linearly arranged from the head. After all node data has been stored, there may be a certain amount of unused free area in the node data area. In this embodiment, at the end of the node data area, the offset of the free area starting in the data area is recorded to facilitate management of the free area. It is of course also possible to reserve a node data area in the node data area in advance to record the offset of the free area starting in the data area.

下面介绍SlottedPage方式。为实现该存储方式，将结点数据区分成大小固定的页，每个结点都位于某个特定的页上，一个页内可以存储多个结点。在页内部的前后两端，分别记录页面内部各个结点偏移量的数组和结点的数据，偏移量数组和结点数据相对增长。在SlottedPage存储方式中，在页面内部，结点的数据是可以在页面的空闲区域内自由移动的。这种方式，在修改结点数据，特别是结点数据长度产生变化时，是比较灵活的，而且分页式的存储比较适合于使用缓存的环境；缺点在于，结点的长度会受到页面大小的制约，只适用于比较小的结点。The following describes the SlottedPage method. To achieve this storage method, the node data is divided into fixed-size pages, each node is located on a specific page, and multiple nodes can be stored in one page. At the front and the back of the page, the data of the array and node of each node offset in the page are recorded respectively, and the offset array and the node data are relatively increased. In the SlottedPage storage mode, inside the page, the data of the node can be freely moved in the free area of the page. In this way, it is more flexible when modifying the node data, especially when the length of the node data changes, and the paging storage is more suitable for the environment using the cache; the disadvantage is that the length of the node is affected by the page size. Constraints only apply to smaller nodes.

对长度比较大的结点或属性，比如图像、视频、音频等数据，无论是以tlv或SlottedPage的方式进行存储，都不太合适。如果使用tlv方式存储结点数据，在加载该结点数据时，会占据大量的内存；而如果使用SlottedPage方式，则由于页面大小的限制，无法建立比较大的结点对象。Nodes or attributes with large lengths, such as images, video, audio, etc., are not suitable for storage in the form of tlv or slotted pages. If you use the tlv method to store the node data, it will occupy a large amount of memory when loading the node data. If you use the SlottedPage method, you cannot create a larger node object due to the limitation of the page size.

本实施例中，对长度比较大的结点，可以采用前述的inode/freelist机制进行存储。具体地，在inode表中为要存储的大结点设置一个inode，利用块号数组记录该结点数据的存储位置，并在结点数据区中记录结点对应的inode号。In this embodiment, the node with a relatively large length can be stored by using the aforementioned inode/freelist mechanism. Specifically, an inode is set in the inode table for the large node to be stored, the storage location of the node data is recorded by the block number array, and the inode number corresponding to the node is recorded in the node data area.

对结点数据进行存储后，还需要在结点索引表中记录该结点与结点数据存储位置的对应关系。针对结点数据的不同存储方式，可以采用不同的方式表示结点数据的存储位置。After storing the node data, it is also necessary to record the correspondence between the node and the storage location of the node data in the node index table. For the different storage methods of the node data, the storage location of the node data can be expressed in different ways.

当结点数据采用tlv方式存储时，可以采用该结点数据在结点数据区中的起始偏移量表示结点数据的存储位置。When the node data is stored in the tlv mode, the starting offset of the node data in the node data area may be used to indicate the storage location of the node data.

当结点数据采用SlottedPage方式存储时，可以采用结点所在的页面地址和记录该结点页内偏移量数组元素的索引表示结点数据的存储位置。这样，当结点数据在页面内移动时，虽然其页内偏移量可能变换，但偏移量数组元素的索引不变，于是在结点索引表中表示的结点数据存储位置也不会发生变化。当检索该结点数据时，结合结点索引表中表示的结点数据存储位置和该页面中偏移量数组元素的值即可以准确定位结点数据的位置。When the node data is stored in the SlottedPage mode, the storage location of the node data may be represented by the page address where the node is located and the index of the offset array element in the node. Thus, when the node data moves within the page, although the offset within the page may change, the index of the offset array element does not change, so the node data storage location indicated in the node index table does not A change has occurred. When the node data is retrieved, the location of the node data can be accurately located by combining the node data storage location indicated in the node index table and the value of the offset array element in the page.

当结点数据采用inode/freelist方式存储时，可以将该结点对应的inode号作为结点数据的存储位置记录在结点索引表中。When the node data is stored in the inode/freelist mode, the inode number corresponding to the node may be recorded in the node index table as the storage location of the node data.

记录结点ID与该结点数据存储位置的对应关系时，可以采用索引的方式记录。即，将结点的ID映射到该结点的数据在结点数据区的存储位置。结点索引的实现方法可以有以下几种：When the correspondence between the node ID and the storage location of the node data is recorded, the index may be recorded. That is, the ID of the node is mapped to the storage location of the data of the node in the node data area. The implementation of the node index can be as follows:

1) 哈希索引1) Hash index

将结点ID作为键值，建立哈希表，哈希表中存储结点数据在结点数据区中的存储位置。The node ID is used as a key value to establish a hash table, and the hash table stores the storage location of the node data in the node data area.

2) B树（或B+树）索引2) B-tree (or B+ tree) index

将结点ID作为键值，建立B树（或B+树）。Create a B-tree (or B+ tree) with the node ID as the key.

3) 线性表3) Linear table

如果结点数目较少，也可以使用线性表来存储结点ID和对应的存储位置。If the number of nodes is small, a linear table can also be used to store the node ID and the corresponding storage location.

在结点数目比较多的情况下，建议使用哈希索引或B树（B+树）索引，在这两种索引的公知实现方法中，都有针对分页存储方式进行实现的算法，因此，可通过控制内存中缓存的页数目，来控制索引表的内存占用量，既可以快速的将结点ID映射到结点的存储位置，同时又不占用大量的内存。In the case of a large number of nodes, it is recommended to use a hash index or a B-tree (B+ tree) index. In the well-known implementation methods of the two indexes, there are algorithms for implementing the paging storage mode, and therefore, Controlling the number of pages cached in memory to control the memory footprint of the index table, can quickly map the node ID to the storage location of the node without taking up a lot of memory.

总之，对于任意结点，需要将结点数据在结点数据区进行存储，再根据该结点数据在结点数据区的存储位置，对应填写结点索引表。In short, for any node, the node data needs to be stored in the node data area, and then the node index table is filled according to the storage location of the node data in the node data area.

步骤206，在文件头中存储访问文档的标识信息和入口信息。Step 206: Store identification information and entry information of the accessed document in the file header.

文件头的主要目的在于对文件进行描述，提供一些元数据以及访问文件内容的入口信息。在本发明一实施例中，并不需要在文件头中存储访问文档的标识信息，标识信息可以用文件后缀来代替。The main purpose of the header is to describe the file, providing some metadata and entry information for accessing the contents of the file. In an embodiment of the present invention, it is not necessary to store the identification information of the access document in the file header, and the identification information may be replaced by a file suffix.

本发明形成的SurXml文档的文件头需要提供的入口信息包括：结点索引表和结点数据区在文件中的存储位置和长度；一个或多个根结点（SurXml文档的逻辑结构是树型的，因此结点可能组成一棵树，也可能是多棵树）的ID。其中，结点索引表和结点数据区在文件中的存储位置为各自对应的inode号。The entry information required by the file header of the SurXml document formed by the present invention includes: a node index table and a storage location and length of the node data area in the file; one or more root nodes (the logical structure of the SurXml document is a tree type) Therefore, the node may form the ID of a tree or a tree. The storage location of the node index table and the node data area in the file is the corresponding inode number.

SurXml的文件头需要提供的描述信息可以是比较随意的，但必须包括一个或多个具备唯一性的标识，使得需要访问文件内容的应用程序能够将文档识别为SurXml。The description information that SurXml's file header needs to provide can be quite arbitrary, but must include one or more unique identifiers so that applications that need access to the file content can identify the document as SurXml.

对于SurXml文件头的结构，此处给出一些可能的方案：For the structure of the SurXml file header, here are some possible scenarios:

1) 使用文本格式，借用ini文件格式，x=y的形式，对二进制数据（偏移量、长度、根结点ID等）转换为文本数据。1) Binary data (offset, length, root node ID, etc.) is converted to text data using a text format, borrowing the ini file format, x=y.

2) 使用文本格式，借用xml文件格式，形如<x>y</x>，对二进制数据的处理同1。2) Use text format, borrow xml file format, like <x>y</x>, treat binary data as 1.

3) 使用二进制格式，按照硬编码的偏移量和长度保存各种元数据和入口信息。3) Use the binary format to store various metadata and entry information in a hard-coded offset and length.

4) 使用二进制格式，将元数据和入口信息的名称和值一同存储，在文件头中顺序存储。4) Store the metadata and the name and value of the entry information in binary format, and store them sequentially in the file header.

5) 其它的文本或二进制格式。5) Other text or binary formats.

下面给出一个文件头的示例，使用了xml形式：An example of a file header is given below, using the xml form:

<License>Unlimited</License><License>Unlimited</License>

<Description>example document</Description> <Description>example Document</Description>

</DataEntry> </DataEntry>

</SurSenXml></SurSenXml>

对于如何在文件中定位文件头数据的起始偏移量和长度，可使用下述几种设计：The following options are available for how to locate the starting offset and length of the header data in a file:

1) 文件头数据的起始偏移量可以是固定的，也可以记录到文件中某个固定的偏移量位置;1) The starting offset of the file header data can be fixed or recorded to a fixed offset position in the file;

2) 文件头数据的长度可以是固定的，也可以记录到文件中某个固定的偏移量位置;2) The length of the header data can be fixed or recorded to a fixed offset position in the file;

3) 设计1）和设计2）的某种组合方式;3) some combination of design 1) and design 2);

4) 对于设计1）和设计2）中在固定的偏移量位置记录文件头数据的起始偏移量和长度的做法，还可以进一步深化，在某个固定的偏移量位置o1记录o2的偏移量，在o2偏移量处记录o3……，在on处记录文件头数据的偏移量和长度。4) For the design of 1) and design 2), the starting offset and length of the file header data are recorded at a fixed offset position, and can be further deepened, and the offset of o2 is recorded at a fixed offset position o1. Shift, record o3... at the o2 offset, and record the offset and length of the file header data at on.

至此，就完成了将一个文档存储在文件中的过程。At this point, the process of storing a document in a file is completed.

在上述存储过程中，步骤206是在步骤205之后执行的，事实上，步骤206也可以与步骤205同时执行，或在步骤205之前执行。In the above stored procedure, step 206 is performed after step 205. In fact, step 206 may also be performed concurrently with step 205 or before step 205.

对于依照上述方式存储后的SurXml文档，当应用程序访问该文档时，首先要确定该SurXml文档类型，然后确定要访问的内容对应的结点ID，通过在结点索引表中查找结点ID对应的表项，确定该结点数据的存储位置，继而可以访问该结点数据。可见，当访问文档时，不需要对整篇文档进行解析处理，可以直接访问结点数据，访问速度快，处理方便，极大地提高了访问性能。For the SurXml document stored in the above manner, when the application accesses the document, first determine the SurXml document type, and then determine the node ID corresponding to the content to be accessed, by searching for the node ID in the node index table. The entry of the node determines the storage location of the node data, and then accesses the node data. It can be seen that when accessing the document, the entire document does not need to be parsed, and the node data can be directly accessed, the access speed is fast, the processing is convenient, and the access performance is greatly improved.

依照上述方式存储好的SurXml文档，可能其内容占用的空间并未完全占据为SurXml文档分配的存储空间，也即为文档分配的存储空间有空闲的区域。本发明的存储方法可以对文档存储空间中的空闲区域以及文档内部结点数据区中的空闲区域进行有效的管理。The SurXml document stored in the above manner may not occupy the storage space allocated for the SurXml document, that is, the storage space allocated for the document has a free area. The storage method of the present invention can effectively manage the free area in the document storage space and the free area in the document internal node data area.

为文档分配的存储空间中空闲空间的管理是通过空闲块表来完成的。空闲块表是分部分存储的，每部分包含一个或多个连续的块，其中记录了存储空间中未分配出去的空闲块的块号。在空闲块表每个部分的第一个空闲块号项均记录了空闲块表下一部分的起始块号，空闲块表最后一个部分的第一个空闲块号项为0，表示本部分是空闲块表的最后一部分。如图4所示为一个空闲块表的示意图。在该图中，401为表1所示的空闲块表，它是整个空闲块表的第一部分，其第一项记录了空闲块表下一部分的起始块号a，据此可以找到空闲块表的下一部分402，依次类推，直到405，在405中第一项的值为0，表明这是空闲块表的最后一部分。The management of free space in the storage space allocated for the document is done through the free block table. The free block table is stored in portions, and each portion contains one or more consecutive blocks in which the block number of the free block that is not allocated in the storage space is recorded. The first free block number entry in each part of the free block table records the starting block number of the lower part of the free block table, and the first free block number entry of the last part of the free block table is 0, indicating that this part is The last part of the free block table. Figure 4 shows a schematic diagram of a free block table. In the figure, 401 is the free block table shown in Table 1, which is the first part of the entire free block table, and the first item records the starting block number a of the lower part of the free block table, from which the free block can be found. The next portion 402 of the table, and so on, up to 405, the value of the first entry in 405 is 0, indicating that this is the last portion of the free block table.

当分配或回收空闲块时，都从空闲块表的尾部进行。首先分配或回收空闲块表中最后一部分指向的空闲块。When a free block is allocated or reclaimed, it is taken from the end of the free block table. First allocate or reclaim the free block pointed to by the last part of the free block table.

为SurXml文档所占用的存储空间是可以扩展和收缩的。存储空间的扩展是比较简单的，只需要将扩充的空闲空间纳入到空闲块表的管理即可，实际上就是将新的空闲块号追加到空闲块表的末尾。The storage space occupied by the SurXml document can be expanded and contracted. The expansion of the storage space is relatively simple. It is only necessary to incorporate the expanded free space into the management of the free block table. In fact, the new free block number is appended to the end of the free block table.

SurXml文档占用存储空间的收缩要复杂一些，图5即为收缩存储空间的流程图。如图5所示，该流程包括：The shrinkage of the storage space occupied by the SurXml document is more complicated. Figure 5 is a flow chart for shrinking the storage space. As shown in Figure 5, the process includes:

步骤501，确定存储空间的目标收缩量和原存储空间中的空闲块数量。Step 501, determining a target shrinkage amount of the storage space and a number of free blocks in the original storage space.

本步骤中，确定原存储空间中的空闲块数量时，除超级块中的那部分空闲块表之外，空闲块表的其余部分占用的数据块也要计入空闲空间。In this step, when determining the number of free blocks in the original storage space, in addition to the portion of the free block in the super block, the data blocks occupied by the rest of the free block table are also counted in the free space.

步骤502，判断存储空间的收缩是否可能，若是，则执行步骤503及其后续步骤，否则结束本流程。In step 502, it is determined whether the shrinkage of the storage space is possible. If yes, step 503 and subsequent steps are performed, otherwise the flow is ended.

本步骤中，判断存储空间的收缩是否可能的方式为：如果存储空间的目标收缩量大于空闲空间的长度，则判定收缩不可能完成，否则判定收缩可以进行。In this step, it is determined whether the shrinkage of the storage space is possible: if the target shrinkage amount of the storage space is greater than the length of the free space, it is determined that the shrinkage is impossible to complete, otherwise the shrinkage can be determined.

步骤503，判断SurXml文档占用存储空间尾部的收缩区域中，是否有非空闲的数据块，若是，则执行步骤504及其后续步骤，否则执行步骤505及其后续步骤。Step 503: Determine whether the SurXml document occupies a non-idle data block in the contraction area of the tail of the storage space. If yes, perform step 504 and subsequent steps, otherwise perform step 505 and subsequent steps.

步骤504，利用SurXml文档占用存储空间的非收缩区域的空闲数据块与收缩区域的非空闲数据块进行置换，同时，更新对应的inode和空闲块表。In step 504, the free data block of the non-shrinking area occupied by the SurXml document is replaced with the non-free data block of the shrinking area, and the corresponding inode and free block table are updated.

步骤505，判断保存空闲块表的数据块是否位于SurXml文档占用存储空间尾部的收缩区域中，若是，则执行步骤506及其后续步骤，否则执行步骤507。Step 505: Determine whether the data block storing the free block table is located in the contraction area of the end of the storage space occupied by the SurXml document, and if yes, perform step 506 and subsequent steps, otherwise step 507 is performed.

步骤506，将位于收缩区域内的空闲块表中描述非收缩区域空闲块的块号项转移到非收缩区域的空闲块表中。Step 506: Transfer the block number item describing the non-shrink area free block in the free block table located in the contraction area to the free block table of the non-shrink area.

步骤507，计算出空闲块表新的表尾，并按指定长度截断SurXml文档占用的存储空间。 In step 507, a new footer of the free block table is calculated, and the storage space occupied by the SurXml document is truncated according to the specified length.

在本发明一个实施例中，可以是将所有空闲块全部收缩。In one embodiment of the invention, all of the free blocks may be shrunk.

上述是对于SurXml文档占用的存储空间的管理。在为结点数据区分配的存储区域中，其空闲空间的组织和管理是根据结点数据的存储方式，在结点数据区占用的存储区域内部进行的。The above is the management of the storage space occupied by the SurXml document. In the storage area allocated for the node data area, the organization and management of the free space is performed inside the storage area occupied by the node data area according to the storage mode of the node data.

在tlv存储方式下，在结点数据区末尾处，记录空闲区域开始处在数据区中的偏移量，以方便管理空闲区域。In the tlv storage mode, at the end of the node data area, the offset of the free area starting in the data area is recorded to facilitate management of the free area.

在SlottedPage存储方式下，结点数据区的各个页内部的空闲空间是自行管理的；如果需要为新结点搜索空闲空间时，可以逐页搜索；要提高搜索的效率，亦可另外建立一个对结点数据区中各个页中空闲空间大小进行索引的空闲页索引，可以使用的索引方法包括：直接记录页面的编号和空闲空间的大小、以空闲空间大小为键值对页面编号排序、B树/B+树（以空闲空间大小为键），等等。结点数据区的空闲页索引存储的分配、组织、回收，可以使用inode/freelist机制，将结点数据区的空闲页索引作为对应某inode的文件处理。In the SlottedPage storage mode, the free space inside each page of the node data area is managed by itself; if it is necessary to search for free space for the new node, it can search page by page; to improve the efficiency of the search, another pair can be established. The index of the free page indexed by the size of the free space in each page of the node data area, the index method that can be used includes: directly recording the number of the page and the size of the free space, sorting the page number with the size of the free space as the key value, B-tree /B+ tree (with free space size as key), and so on. The allocation, organization, and collection of the free page index storage in the node data area may use the inode/freelist mechanism to process the free page index of the node data area as a file corresponding to an inode.

应用上述方法可以方便地管理文档中的空闲区域。当结点数据区中出现较大的空闲区域后，可以根据策略将该空闲区域标记为SurXml文档的空闲数据块，即将该空闲区域从结点数据区中删除。当SurXml文档需要空间收缩操作时，依照前述方法进行压缩。这种空闲区域的管理方法，使文档空间的收放更加方便，有效提高了文档的存储效率。Applying the above method makes it easy to manage free areas in the document. When a large free area appears in the node data area, the free area may be marked as a free data block of the SurXml document according to the policy, that is, the free area is deleted from the node data area. When the SurXml document requires a space contraction operation, compression is performed as described above. The management method of the free area makes the collection and release of the document space more convenient, and effectively improves the storage efficiency of the document.

按照图2所示的方式存储SurXml文档的过程中，涉及到的对结点的操作包括：创建结点、删除结点、添加子结点和修改结点的属性。In the process of storing the SurXml document in the manner shown in Figure 2, the operations involved in the node include: creating a node, deleting a node, adding a child node, and modifying the attributes of the node.

（一）在存储文档时，对于尚未存在的结点，要存储该结点数据均涉及到创建结点的操作，例如在已建立的文档中新添加一个表格。该操作具体包括以下步骤：(One) When storing a document, for nodes that do not yet exist, storing the node data involves the operation of creating a node, such as adding a new table to an established document. The operation specifically includes the following steps:

a、确定结点数据，具体来说，可以是确定结点类型标记和结点属性数据。a. Determine the node data, specifically, the node type tag and the node attribute data.

b、在结点数据区中查找空闲区域，为结点数据分配一定量的空间。具体来说，可以是根据结点属性数据为结点数据分配一定量的空间。 b. Find the free area in the node data area and allocate a certain amount of space for the node data. Specifically, it may be that a certain amount of space is allocated to the node data according to the node attribute data.

在查找空闲区域时，根据结点数据区中结点数据的不同存储方式，采用前述的结点数据区中空闲空间的组织管理方法进行。When searching for a free area, according to different storage manners of the node data in the node data area, the organization management method of the free space in the foregoing node data area is used.

c、在结点索引表中增加一项，分配新的结点ID，并指向分配的结点数据在结点数据区中的存储位置。c. Add an item in the node index table, assign a new node ID, and point to the storage location of the allocated node data in the node data area.

d、将输入的结点数据记录到为该结点分配的在结点数据区的存储位置中。具体来说，将输入的结点类型标记和结点属性数据记录到为该结点分配的在结点数据区的存储位置中。d. Record the input node data into the storage location of the node data area allocated for the node. Specifically, the input node type flag and the node attribute data are recorded into the storage location of the node data area allocated for the node.

（二）在已存在的SurXml文档中，当文档内容进行删减时会涉及到删除结点的操作，例如，将文档中的某表格删除。该操作具体包括以下步骤：(2) In the existing SurXml document, when the content of the document is deleted, the operation of deleting the node is involved, for example, deleting a form in the document. The operation specifically includes the following steps:

a、确定删除结点的ID。a. Determine the ID of the deleted node.

b、在结点索引表中查找结点ID，得到结点数据在结点数据区中的存储位置。b. Find the node ID in the node index table to obtain the storage location of the node data in the node data area.

c、将结点数据对应存储位置的空间标记为空闲。c. Mark the space corresponding to the storage location of the node data as idle.

d、从结点索引表中删除该结点ID对应的索引项。d. The index entry corresponding to the node ID is deleted from the node index table.

（三）在已存在的SurXml文档中，将某个新创建的结点与树型结构中的其它结点建立连接关系时，会涉及到添加子结点的操作，例如，将新添加的表格与其所在的页面建立连接关系。该操作具体包括为更新相关结点的相关结点索引信息。具体包括以下步骤：(3) In the existing SurXml document, when a newly created node is connected with other nodes in the tree structure, the operation of adding the child node is involved, for example, the newly added table Establish a connection with the page on which it is located. The operation specifically includes updating the relevant node index information of the relevant node. Specifically, the following steps are included:

a、确定父结点ID，子结点ID和子结点位置。a. Determine the parent node ID, child node ID, and child node location.

其中，子节点位置是指子结点在父结点的所有子结点中的位置。Where the child node position refers to the position of the child node in all child nodes of the parent node.

b、在结点索引表中查找父子结点的ID，得到父结点数据的存储位置。b. Find the ID of the parent and child nodes in the node index table, and obtain the storage location of the parent node data.

c、在结点数据区中，更新父结点的子结点ID列表，更新子结点的父结点ID值。c. In the node data area, update the child node ID list of the parent node, and update the parent node ID value of the child node.

在本发明一实施例中，当父结点只记录子结点数和第一个子结点的ID时，更新相关结点的相关结点索引信息还可以是：利用子结点的左右兄弟结点表遍历。In an embodiment of the present invention, when the parent node records only the number of child nodes and the ID of the first child node, updating the related node index information of the relevant node may also be: using the left and right brother nodes of the child node Point table traversal.

（四）删除子结点的操作是与添加子结点相对应的，即取消结点间的连接关系，例如，将某表格从当前所在的页面移走。该操作具体包括为更新相关结点的相关结点索引信息。具体包括以下步骤：(4) The operation of deleting the child node corresponds to adding the child node, that is, canceling the connection relationship between the nodes, for example, moving a table from the current page. The operation specifically includes updating the relevant node index information of the relevant node. Specifically, the following steps are included:

a、确定父结点ID和要删除的子结点ID。a. Determine the parent node ID and the child node ID to be deleted.

b、在结点索引表中查找父子结点的ID，得到该对父子结点数据的存储位置。b. Find the ID of the parent-child node in the node index table, and obtain the storage location of the pair of parent-child node data.

c、在结点数据区中，从父结点的子结点ID列表中删除输入的子结点ID，更新子结点的父结点ID值，具体来说可以从子结点中将父结点ID值置为空，也可以是将父结点ID值置为原祖父结点的ID。c. In the node data area, delete the input child node ID from the child node ID list of the parent node, and update the parent node ID value of the child node, specifically, the parent node may be the parent node. The node ID value is set to null, or it may be the ID of the parent node ID value as the original grandparent node.

（五）当对SurXml文档进行修改后，可能会涉及到修改结点属性的操作，例如，将表格中的某一些进行修改。该操作具体包括以下步骤：(5) When the SurXml document is modified, it may involve the operation of modifying the node attributes, for example, modifying some of the tables. The operation specifically includes the following steps:

a、确定结点数据，具体来说确定结点ID、结点属性名称和结点属性值。a. Determine the node data, specifically determine the node ID, node attribute name, and node attribute value.

b、在结点索引表中查找结点ID，得到结点数据的存储位置。b. Find the node ID in the node index table to obtain the storage location of the node data.

c、在结点数据区中找到结点数据，更新结点数据。具体来说，根据结点属性名称，对对应的结点属性值进行更新。c. Find the node data in the node data area and update the node data. Specifically, the corresponding node attribute value is updated according to the node attribute name.

d、如果空间不足，则分配新的空闲空间，并将结点数据复制到新的空间，将原有的结点数据所在数据块标记为空闲，同时更新结点索引表，将对应的表项指向新的存储位置。d. If the space is insufficient, allocate a new free space, copy the node data to the new space, mark the data block where the original node data is located as idle, and update the node index table to replace the corresponding entry. Point to the new storage location.

应用上述方式存储后形成的SurXml文档，可能有时需要进行加密、无损压缩等操作。这些均可以通过对该文档的子树进行相应的可逆变换来实现。本发明实施例中对SurXml文档进行指定的可逆变换操作具体包括以下步骤：Applying the SurXml document formed after the above method is stored may sometimes require operations such as encryption and lossless compression. These can all be achieved by performing a corresponding reversible transformation on the subtree of the document. The reversible transformation operation for specifying the SurXml document in the embodiment of the present invention specifically includes the following steps:

a、确定进行变化的内容对应的子树根结点ID和变换函数。比如，确定子树根结点ID和加密函数。由于进行加密可能是整篇文档，也可能是文档中的某部分，因此本步骤中，需要确定进行加密部分在整个文档中的位置。具体地，即确定加密部分对应的子树根节点。a. Determine the subtree root node ID and the transformation function corresponding to the content to be changed. For example, determine the subtree root node ID and encryption function. Since the encryption may be the entire document or a part of the document, in this step, it is necessary to determine the location of the encrypted portion in the entire document. Specifically, the subtree root node corresponding to the encrypted portion is determined.

b、根据确定的子树根结点ID，遍历子树。b. Traverse the subtree according to the determined subtree root node ID.

本步骤中，遍历可使用深度优先遍历或广度优先遍历。In this step, the traversal can use depth-first traversal or breadth-first traversal.

c、将遍历到的各个结点的ID和数据，记录到缓冲区中，并利用所述变换函数处理所述缓冲区中的数据。c. Record the ID and data of each node traversed into the buffer, and process the data in the buffer by using the transformation function.

比如，将遍历到的各个结点的ID和数据，记录到一个线性的缓冲区中，并利用确定的加密函数处理该缓冲区中的数据。本步骤中的操作即实际完成对指定的文档内容进行加密的过程。For example, the ID and data of each node traversed are recorded in a linear buffer, and the data in the buffer is processed by the determined encryption function. The operation in this step actually completes the process of encrypting the specified document content.

d、更新结点索引表和结点数据。d. Update the node index table and node data.

本步骤中，在结点索引表中，将子树根结点标记为变换树的根；将该子树内除子树根节点外的其它变换结点标记为变换树的内部结点，并将根结点的ID记录到所述其它变换结点索引中。为保证结点间的连接关系，所有变换后的结点ID均保持不变。In this step, in the node index table, the root node of the subtree is marked as the root of the transformation tree; the other transformation nodes in the subtree except the root node of the subtree are marked as internal nodes of the transformation tree, and The ID of the root node is recorded into the other transform node index. In order to ensure the connection relationship between nodes, all transformed node IDs remain unchanged.

在结点数据区中，将利用变换函数（如，加密函数）处理后的缓冲区中的数据，作为子树根结点的数据保存；同时，清空子树中除根结点外的其它结点的数据。In the node data area, the data in the buffer processed by the transform function (for example, the encryption function) is saved as the data of the root node of the subtree; at the same time, the other nodes in the subtree except the root node are emptied. The data.

至此，便完成了对指定文档内容进行变换（如加密）的操作。At this point, the operation of transforming (such as encrypting) the specified document content is completed.

与上述对文档进行可逆变换相对应地，还可以对变换后的文档进行逆变换，得到初始文档。具体操作包括以下步骤：Corresponding to the above-described reversible transformation of the document, the transformed document may also be inversely transformed to obtain an initial document. The specific operations include the following steps:

a、确定进行逆变换的内容对应的子树根结点ID和逆变换函数。a. Determine a subtree root node ID and an inverse transform function corresponding to the content of the inverse transform.

比如，确定子树根结点ID和解密函数。与加密过程对应地，本步骤中，需要确定进行解密的部分对应的子树根节点ID。For example, determine the subtree root node ID and decryption function. Corresponding to the encryption process, in this step, it is necessary to determine the subtree root node ID corresponding to the part to be decrypted.

b、根据子树根结点ID，在结点索引表查找子树根结点数据在结点数据区中的存储位置。b. According to the subtree root node ID, find the storage location of the subtree root node data in the node data area in the node index table.

c、使用逆变换函数处理子树根结点的数据。c. Use the inverse transform function to process the data of the root node of the subtree.

比如，使用解密函数，处理子树根结点的数据。由于加密后的整个子树结点的数据均作为子树根节点数据存储，因此只要对子树根节点数据进行解密，也即对整个子树结点的数据进行解密。For example, use the decryption function to process the data of the root node of the subtree. Since the encrypted data of the entire subtree node is stored as the subtree root node data, as long as the subtree root node data is decrypted, the data of the entire subtree node is decrypted.

d、将逆变换处理后的所述子树根节点的数据按照进行变换时的遍历次序，依次恢复各个结点的数据。d. The data of the root node of the subtree after the inverse transform processing is sequentially restored to the data of each node according to the traversal order when the transform is performed.

比如，从解密函数处理得到结果中，按加密过程的遍历次序，依次恢复各个结点的数据。For example, from the result of the decryption function processing, the data of each node is sequentially restored according to the traversal order of the encryption process.

e、将结点索引表中的子树根结点和该子树各个内部结点都标记为普通结点。e. Mark the subtree root node in the node index table and each internal node of the subtree as ordinary nodes.

至此，逆变换操作结束。At this point, the inverse transform operation ends.

对于其它诸如压缩等可逆的变换和逆变换的过程与上述实施方式相同，只是将相应的加密、解密函数替换为压缩、解压缩等函数即可，这里就不再赘述。The process of reversible transformation and inverse transformation such as compression is the same as that of the above embodiment, except that the corresponding encryption and decryption functions are replaced by functions such as compression and decompression, and will not be described here.

依照上述方法不仅可以对整篇文档进行可逆变换，还可以对其中存储后的任何子树进行变换，从而实现对文档的部分内容进行加密、无损压缩等操作。According to the above method, not only the entire document can be reversibly transformed, but also any subtree stored therein can be transformed, thereby realizing encryption, lossless compression and the like on part of the document.

XML文档是一种常见的文档格式，依照本发明的存储方法形成的SurXml文档还可以与XML格式文档进行相互转换。An XML document is a common document format, and a SurXml document formed in accordance with the storage method of the present invention can also be converted to and from an XML format document.

具体地，将XML文档转换为SurXml文档的具体过程包括：Specifically, the specific process of converting an XML document into a SurXml document includes:

a、得到XML文档的Schema或DTD，根据XML文档的Schema或DTD定义SurXml文档的文档类型。a. Get the schema or DTD of the XML document, and define the document type of the SurXml document according to the schema or DTD of the XML document.

b、解析XML文档，将各个元素分别转换为SurXml文档中的结点，将其属性设置在相应的结点中；并根据XML文档中各个元素的父子关系，设置SurXml文档中各个结点的父子关系。b. Parse the XML document, convert each element into a node in the SurXml document, and set its properties in the corresponding node; and set the parent and child of each node in the SurXml document according to the parent-child relationship of each element in the XML document. relationship.

c、根据步骤a和b的结果生成SurXml文档。c. Generate a SurXml document based on the results of steps a and b.

将SurXml文档转换为XML文档的具体过程包括：The specific process of converting a SurXml document to an XML document includes:

a、得到SurXml的文档类型定义，如果是Schema或DTD或Relax NG，则无需进一步处理；如果是自定义类型表，则需要将自定义的类型表转换为Schema或DTD或Relax NG。a, get the document type definition of SurXml, if it is Schema or DTD or Relax NG, no further processing is required; if it is a custom type table, you need to convert the custom type table to Schema or DTD or Relax NG.

b、遍历SurXml文档中的树型结构，根据各个结点在文档类型定义中对应的定义，将结点转换为XML元素并设置对应的属性。b. Traverse the tree structure in the SurXml document, convert the node into an XML element and set the corresponding attribute according to the corresponding definition of each node in the document type definition.

本步骤中的过程是递归的，子结点将转换为XML文档中的子元素。The process in this step is recursive, and the child nodes are converted to child elements in the XML document.

c、根据步骤a和b的结果生成XML文档。c. Generate an XML document based on the results of steps a and b.

Industrial applicability

由上述技术方案可见，在本发明中，预先配置代表不同树型结构的文档类型，定义文档存储格式包括结点索引表和结点数据区。利用上述带结点索引的文档存储格式，可以实现结点的快速搜索。在存储某文档时，首先根据该文档的文档类型，将文档内容映射为对应树型结构中的各个结点及其数据。通过此种方式，使文档由结点构成，各个结点间通过树型结构组织起来。然后，根据文档长度，在存储介质中为文档分配一定的存储空间，在该存储空间中进一步为结点索引表和结点数据区分别分配各自的存储区域。最后，将文档对应的各个结点数据存储到结点数据区，并在结点索引表中记录每个结点与该结点数据存储位置的对应关系，在文件头中存储访问文档的入口信息。通过上述方式，就可以将一个文档按照树型结构在存储介质中进行存储，同时采用带结点索引的存储格式。对于此类采用树型结构和结点索引方式存储后的文档，当用户对文档的部分内容进行访问时，可以直接对该内容对应的结点进行访问，而不需要预先对整个文档进行处理，节省了系统资源，提高了访问性能；另外，本发明的方法依然保留文档类型的配置，继承了XML文档的优势。利用本发明的文档存储方法可以支持复杂树形结构数据的存储、存储空间小、扩展性好、可增量修改、易于提高存储安全性。As can be seen from the above technical solution, in the present invention, document types representing different tree structures are pre-configured, and the document storage format is defined to include a node index table and a node data area. With the above document storage format with node index, a fast search of nodes can be achieved. When storing a document, the document content is first mapped to the corresponding nodes and their data in the corresponding tree structure according to the document type of the document. In this way, the document is made up of nodes, and the nodes are organized by a tree structure. Then, according to the length of the document, a certain storage space is allocated to the document in the storage medium, and the storage area is further allocated to the node index table and the node data area in the storage space. Finally, each node data corresponding to the document is stored in the node data area, and the correspondence relationship between each node and the storage location of the node data is recorded in the node index table, and the entry information of the access document is stored in the file header. . In the above manner, a document can be stored in a storage medium in a tree structure, and a storage format with a node index is used. For such a document stored in a tree structure and a node index manner, when the user accesses part of the content of the document, the node corresponding to the content can be directly accessed without the need to process the entire document in advance. The system resources are saved and the access performance is improved; in addition, the method of the present invention still retains the configuration of the document type and inherits the advantages of the XML document. The document storage method of the invention can support the storage of complex tree structure data, small storage space, good expansibility, incremental modification, and easy to improve storage security.

更进一步地，本发明对结点索引表和结点数据区分配存储区域时，将结点索引表和结点数据区分别当作两个文件看待，采用与UNIX的inode/freelist近似的机制进行这两个存储区域的分配、组织和回收，使这两个存储区域易于收缩和扩张，从而简化了增加结点及增长结点数据时的操作。Further, when the present invention allocates a storage area to the node index table and the node data area, the node index table and the node data area are treated as two files respectively, and the mechanism similar to UNIX inode/freelist is used. The allocation, organization, and recycling of these two storage areas make the two storage areas easy to shrink and expand, simplifying operations when adding nodes and growing node data.

以上仅为本发明的较佳实施例而已，并非用于限定本发明的保护范围。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above are only the preferred embodiments of the present invention and are not intended to limit the scope of the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and scope of the present invention are intended to be included within the scope of the present invention.

Sequence table free content

Claims

A method for storing a document, comprising:

Maps the content of the document to a tree structure according to the specified document type; wherein the type information of each node is defined in the document type;

Storing the contents of the nodes constituting the tree structure and the relationship information between the nodes;

Recording a storage location of the node data in the node index table; wherein the node data includes a node content;

Store the node index table.

According to claim 1 The method is characterized in that the type information of the node includes a node type tag, a node type name, information of all legal attributes of the node type, and a node type of all legal child nodes of the node type.

The method according to claim 2, wherein the attribute information comprises an attribute tag, an attribute name, and an attribute type.

The method of claim 1, wherein the node content comprises a node type tag, a node length, a name/tag and a value of a node attribute; and Or, the node data further includes relationship information between the node and other nodes.

The method according to claim 4, wherein the relationship information between the node and other nodes comprises: a leftmost child node ID of the node And the number of child nodes of the node.

The method according to claim 5, wherein the related node index information further comprises: a parent node identifier ID of the node, and Or, the left and right brother node IDs, and / or, all child node IDs of the node.

The method of claim 1 wherein tlv (type tag, length, value) mode, slottedpage mode, and Any one or combination of the inode/freelist modes stores the content of the nodes and the relationship information between the nodes

The method of claim 7 wherein said tlv The method is: sequentially storing the node type mark, the node length, and the attribute name of each node in a predetermined storage space / a tag and a value; wherein the storage location of each node data is: a starting offset of each node data in the predetermined storage space;

The slottedpage The method is: dividing a predetermined storage space into a plurality of pages, storing data of each node on a specific page, and storing an offset array in the page, indicating that each node in the page is within the page a starting offset; the storage location of the node data is: a page address where the node data is located and an element index corresponding to the node data in the offset array;

The inode/freelist mode is: set an inode for each node And storing the block number of the data block occupied by the node; storing the node data in the data block, and storing the inode number corresponding to the node in the inode table; wherein the storage location of the node data is: Node corresponding Inode number.

According to claim 8 The method is characterized in that content of each node and relationship information between nodes are stored in a node data area, the method further comprising: managing a free space in the node data area.

The method according to claim 9, wherein when the content of each node and the relationship between the nodes are in tlv When the mode is stored in the node data area, the managing the free space in the node data area is:

At the end of the node data area, record the offset of the free area starting in the data area;

When the content of each node and the relationship between the nodes are slottedpage When the mode is stored in the node data area, the managing the free space in the node data area is:

Establish a free page index for indexing the size of the free space in each page in the node data area;

When the content of each node and the relationship information between the nodes are inode/freelist When the mode is stored in the node data area, the managing the free space in the node data area is:

The free block in the node data area is managed by the free block table.

According to claim 1 The method is characterized in that the content of each node and the relationship information between the nodes are stored in the node data area; between the contents of the nodes constituting the tree structure and the nodes Before the relationship information, further includes: The respective storage areas are allocated for the node index table and the node data area.

According to claim 1 The method of the present invention, further comprising: allocating a storage area for the file header; wherein the file header stores the identification information and the entry information of the accessed document.

The method of any of claims 1 further comprising expanding and shrinking the storage space occupied by the document.

The method of claim 13 wherein the inode/freelist mechanism is used to shrink the storage space occupied by the document; The inode/freelist mode is: set an inode for each node to record the data block block number occupied by the node; store the node data in the data block, in the inode The inode number corresponding to the storage node in the table.

The method according to claim 14, wherein the storage space occupied by the extended document comprises: filling a block number of the extended free block into the end of the free block table.

The method according to claim 14, wherein the storage space of the contracted document comprises:

Determining whether there is a non-idle data block in the contracted area at the end of the current storage space of the document, and if so, replacing the free data block of the non-shrinking area of the current storage space with the non-idle data block of the contracted area, and updating the corresponding Inode and free block table;

Determining whether the data block storing the free block table is located in the contraction area of the current storage space, and if so, transferring the block number item describing the non-shrink area free block in the free block table located in the contraction area to the free block table of the non-shrink area in;

Calculate the new footer of the free block table and truncate the current storage space.

According to the claims The method of claim 16, further comprising: before determining whether there is a non-idle data block in the contraction area of the tail of the current storage space of the document, further comprising:

It is possible to determine the shrinkage of the storage space based on the determined target shrinkage amount and the number of free blocks in the current storage space.

According to claim 17 The method is characterized in that the determining the shrinkage of the storage space is possible: determining whether the target shrinkage amount is greater than a free space of a current storage space, and if so, determining that shrinkage of the storage space is possible.

The method according to any one of claims 1 to 18, characterized in that the method further comprises: adding, deleting, and/or the node data Or, modify it.

The method of claim 19, wherein the adding the node data comprises:

Determining node data;

Find free areas and allocate space for node data;

Add an entry in the node index table, assign a new node ID, and point to the storage location allocated for the node data;

The entered node data is recorded to the storage location assigned to the node.

The method of claim 19, wherein deleting the node data comprises:

Determine the ID of the deleted node;

Find the node ID in the node index table to obtain the storage location of the node data.

Marking the space corresponding to the storage location of the node data as idle;

The index entry corresponding to the node ID is deleted from the node index table.

According to claim 20 or 21 The method is characterized in that, when the added node data has a connection relationship with other nodes, or when the deleted node data has a connection relationship with other nodes, the method further includes:

Update the relevant node index information of the relevant node.

The method of claim 19, wherein modifying the node data comprises:

Determining node data;

Update node data.

According to claim 23 The method is characterized in that, when the modification of the node data results in insufficient space, the method further includes:

Allocate new free space, copy the node data to the new space, mark the data block where the original node data is located as idle, update the node index table, and point the corresponding table entry to the new storage location.

A method according to any one of claims 1 to 18, further comprising: transforming and inversely changing the stored document.

The method of claim 25, wherein the transforming the stored document comprises:

Determining a subtree root node ID and a transformation function corresponding to the content to be transformed;

Traversing the subtree according to the subtree root node ID;

Recording the ID and data of each node traversed into the buffer, and processing the data in the buffer by using the transformation function;

In the node index table, mark the root node of the subtree as the root of the transformation tree, mark other nodes in the subtree except the root node of the subtree as internal nodes of the transformation tree, and Recording the subtree root node in other node indexes ID ;

The data in the buffer is saved as data of the root node of the subtree, and data of other nodes in the subtree except the root node of the subtree is cleared.

The method according to claim 25, wherein the inverse transforming the stored document comprises: determining a subtree root node ID corresponding to the content of the inverse transform And inverse transform function;

Finding a storage location of the subtree root node data in the node index table according to the subtree root node ID;

Processing the data of the root node of the subtree using the inverse transform function;

The data of the root node of the subtree after the inverse transform processing is sequentially restored to the data of each node according to the traversal order when the transform is performed;

In the node index table, the subtree root node and each node of the subtree are marked as ordinary nodes.

The method according to any one of claims 1 to 18, further comprising: converting the stored document into an extensible markup language XML document; and / or , convert the XML document to the stored document.

The method of claim 28, wherein said converting said stored document to an extensible markup language XML The documentation includes:

Traversing the tree structure of the stored document, converting the node to an XML element and setting the corresponding attribute according to the corresponding definition of each node in the document type, and converting the child node to Child elements in an XML document;

Generate an XML document based on the results of the above steps.

According to the claims The method of claim 29, wherein when the document type of the stored document is a type table, before the traversing the tree structure of the stored document, the method further comprises:

Convert the document type to Schema , DTD , or Relax NG .

The method of claim 28, wherein converting the XML document to the stored document comprises:

Schema, DTD, or Relax NG based on the XML document Defining document type information of the stored document;

Parsing the XML document, converting each element to a node in the stored document, setting its properties in the corresponding node; and according to XML The parent-child relationship of each element in the document sets the parent-child relationship of each node in the stored document.

The stored document is generated according to the results of the above two steps.