GB2488576A - Compression of XML documents - Google Patents
Compression of XML documents Download PDFInfo
- Publication number
- GB2488576A GB2488576A GB1103550.8A GB201103550A GB2488576A GB 2488576 A GB2488576 A GB 2488576A GB 201103550 A GB201103550 A GB 201103550A GB 2488576 A GB2488576 A GB 2488576A
- Authority
- GB
- United Kingdom
- Prior art keywords
- document
- grammar
- encoding
- structural information
- unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/14—Tree-structured documents
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/3084—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
- H03M7/3097—Grammar codes
-
- G06F17/2247—
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/14—Tree-structured documents
- G06F40/146—Coding or compression of tree-structured data
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
-
- H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03M—CODING; DECODING; CODE CONVERSION IN GENERAL
- H03M7/00—Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
- H03M7/30—Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
- H03M7/70—Type of the data to be coded, other than image and sound
- H03M7/707—Structured documents, e.g. XML
-
- H04L29/0604—
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L69/00—Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
- H04L69/04—Protocols for data compression, e.g. ROHC
-
- H04L29/08792—
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/56—Provisioning of proxy services
- H04L67/565—Conversion or adaptation of application format or content
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/56—Provisioning of proxy services
- H04L67/565—Conversion or adaptation of application format or content
- H04L67/5651—Reducing the amount or size of exchanged application data
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Security & Cryptography (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Document Processing Apparatus (AREA)
Abstract
Storage and transmission optimization of similar documents of the XML (eXtensible Markup Language) type by improving the compression. After having encoded (110) and provided (165) a reference document encoded according to an encoding method based on the use of grammars produced during the encoding step of the reference document, new grammars based on the grammars produced during the encoding step of the reference document are generated (130). The generated grammars are then used for encoding (145) a second document, similar to the first one. The encoded compressed second document is provided (165) along with a reference to the reference document. A corresponding decoding method is also disclosed.
Description
METHODS AND DEVICES FOR OPTIMIZING STORAGE AND
TRANSMISSION OF DOCUMENTS OF THE XML TYPE
The present invention concerns the storage and transmission of documents conforming to a markup language, in particular of documents of the XML type (XML being the acronym for "Extensible Markup Language"), and more precisely methods and devices for optimizing storage and transmission of such documents, by improving the compression ratio of documents having similar structure, particularly when there is no means for providing a schema specifying the structure of these documents.
The XML language is a syntax for defining computer languages.
Thus, XML makes it possible to create languages that are adapted for different uses but which may be processed by the same tools.
A document of the XML type (referred to below as an XML document) is composed of elements, each element starting with a start tag comprising the name of the element (for example: <tag>) and ending with an end tag which also comprises the name of the element (for example c/tag>).
Each element may contain other elements in hierarchical manner, constituting a hierarchical structure of data, or text data.
Furthermore, an element may be refined by attributes, each attribute being defined by a name and having a value. The attributes are placed in the start tag of the element they specify (for example <tag attribute="value">).
XML syntax also makes it possible to define comments (for example c!--Comment-->) and processing instructions, which may specify to a computer application the processing operations to apply to the XML document (for example "<?myprocessing?>"), as well as escape sections which make it possible to prevent a section of text from being interpreted as a tag when it has the form thereof (for example "<![CDATA[<text>Escapec/text>]]>" in which <text> is recognized as a string and not as a tag).
In XML terminology, the set of the terms "element" (identified by a start tag), "attribute", "text data", "comment", "processing instruction" and "escape section" defining the XML data are grouped together under the generic name of "item" that represents markup information.
Several different languages based on XML may contain elements with the same name. To be able to mix several different languages, an addition has been made to XML syntax making it possible to define "Namespaces". Two elements are identical only if they have the same name and are situated in the same namespace. A namespace is defined by a URI (acronym for "Uniform Resource Identifier"), for example "http://canon.crf.fr/xml/mylanguage". The use of a namespace in an XML document is via the definition of a prefix which is a shortcut to the URI of that namespace. This prefix is defined using a specific attribute (for example: the attribute xmlns:m1"http://canon.crf.fr/xml/mylanguage" associates the prefix "ml" with the URI "http://canon.crf.fr/xml/mylanguage"). Next, the namespace of an element or of an attribute is specified by preceding its name with the prefix associated with the namespace followed by ":" (for example "cml:tag ml:attribute="value">" indicates that the element tag belongs to the namespace ml and that the same applies for the attribute attribute).
An XML document must be read in memory to be processed. To that end, the whole document may be represented in the memory as a tree.
Accordingly, each part of the XML document can be easily accessed. The Application Protocol Interface (API) known as Document Object Model (DOM) uses such an XML document representation. However, one of its main drawbacks lies in the large amount of memory that is required to memorize the whole document.
Alternatively, an XML document may be represented as a sequence of events. The whole XML document is then processed one event at a time. The methods using such an XML document representation in memory allow XML documents to be read through streaming. They require buffering only one event at a time, thus reducing the maximum memory amount. However, a main drawback of this XML document representation is the inability to access quickly any specific part of the XML document. An example of an API using this XML document representation is the SAX API (SAX standing for "Simple API for XML").
Examples of XML events are start tag event, represented as SE hereafter, end tag event, represented as EE, attribute event, represented as AT, and text event, represented as OH.
The XML format has numerous advantages and has become a standard for storing data in a file or for exchanging data. XML makes it possible in particular to have numerous tools for processing the files generated.
Furthermore, an XML document may be manually edited with a simple text editor. Moreover, as an XML document contains its structure integrated with the data, such a document is very readable even without knowing the specification.
However, the main drawback with XML syntax is to be very verbose.
This may lead to XML documents having a size several times greater than the intrinsic size of the data. In turn, such large sizes of XML documents lead to long processing times when the XML documents are generated and when they are read.
To mitigate this drawback, methods for coding XML documents have been developed. The object of these methods is to code the content of the XML documents in a more compact form, enabling the XML documents to be easily reconstructed. All the resulting encoding formats are collectively named "Binary XML".
A simple method consists, for example, in encoding markup information using a binary format instead of a textual format. Moreover, redundancy in markup information can be eliminated or at least decreased. For example, it is not necessary to specify the name of the element in both the start-tag and the end-tag. Such mechanisms are used by all binary XML formats.
Another type of coding consists in using one or more indexing tables.
Indexes can be used to efficiently encode element or attribute names which are usually repeated in an XML document. In this way, upon the first occurrence of an element name, this name is encoded as a string and an index is associated with it. Further occurrences of this element name are then encoded using the index, instead of the full string. This allows reducing the size of the encoded document, but also speeding up the parsing of the document (when parsing a start-tag, the parser only needs to decode an index instead of parsing a string, and moreover determining which element is being decoded can be achieved by comparing integers instead of strings). For example, specifications such as Fast Infoset or Efficient XML Interchange (EXI) use indexing mechanisms.
Yet another mechanism is to use indexing tables to describe the structure of the elements in an XML document. For example, an indexing table can be used for all the elements sharing a given name. Upon the first occurrence of a child element inside the content of an element with that name, a new entry describing that child element is added in the indexing table. Further occurrences of a similar child element inside that element or another element with the same name can then be encoded using the index associated with this entry. Such a mechanism is used by the EXI specification.
EXI is a Binary XML format, currently being standardized by the W3C (W3C being the acronym for "World Wide Web Consortium"), which uses a set of grammars (which are local indexing tables) to encode the structure of an XML document. These grammars can be built either from an XML Schema describing the XML document or directly from the XML document while being processed.
Content in the XML document is encoded by default using indexed strings. If type information is available, a value can be encoded using a binary representation corresponding to its type. Type information can only be derived from an XML Schema.
To be able to code the items comprised in an XML document, the Efficient XM L Interchange specification divides each of the elements encountered in the document to code into elementary parts called events, for example a start tag or an attribute.
A grammar is composed of a set of productions, each production comprising an XML event description, an associated coding value and the indication of the following grammar to use (for coding the following event). To code an XML event using a grammar, the production containing the most precise description of the XML event is used. The coding value contained in that production is used to represent the event, and the information contained in the event and not described in the production is coded.
A grammar according to Efficient XML Interchange may evolve. In a certain number of cases, after the occurrence of an XML event already described by a production of the grammar (if it is not described by a production, it cannot be coded by the grammar), the grammar is modified to include a new production corresponding to that XML event. This production may either contain a more precise description of the event, reducing the number of pieces of information to code to represent the event, or have a more compact coding value.
The coding values, or "codes", are expressed in the form of "priorities" having, generally, between I and 3 levels. Coding a coding value amounts to coding the values of its priority. According to one coding mode known as "bit-packed", each level is coded over a minimum number of bits to be able to code the greatest value of that level associated with a production of the grammar. For example, for a level taking values from U to 6, the number of coding bits used is 3.
To code an XML document, a set of grammars is used. A few grammars serve for coding the actual structure of the XML document.
Furthermore, for each type of XML element present in the document (a type of XML element being a set of elements having the same name), the same set of grammars is used to code the XML elements of that type.
The grammars used may be generic grammars, common to all the XML documents and constructed on the basis of the XML syntax. They may also be specific to a type of document when they are constructed beforehand from a schema describing the document's structure. This makes it possible to obtain coding (or decoding) that is more efficient in terms of compression.
On decoding, the inverse process is used: the coding value is extracted and makes it possible to identify the coded XML event, as well as the complementary information to decode on the basis of the productions.
Furthermore, on decoding, the same grammar evolution rules are used, making it possible at any time to have a set of grammar rules and productions identical to that which was used on coding.
Two types of alignment are provided by the EXI format to represent the coding values: a byte-aligned format, for which the coded data are aligned with the bytes, that is to say that each new coded value starts on a new byte, and the bit-packed format, for which the data are coded over the smallest possible number of bits, without any padding bits. In the latter mode, the coded data does not therefore necessarily start on a new byte. The first mode may be decoded more efficiently, whereas the second produces files that are more compressed.
To obtain more efficient coding or decoding operations, it has been seen that it is advantageous to configure, in advance, the software engine for coding or decoding with external configuration data representative a priori of the documents to process.
It has thus been envisaged to appropriate the use of the XML
Schema structure description files to that end.
The XML Schema standard or XML schema defines a language enabling the structure of one or more XML documents to be described, through an XML schema document. The latter describes the set of the elements and the attributes which may be present in an XML document in accordance with that XML Schema document, as well as certain relationships between those elements and those attributes.
The conventional use of this standard, concerns the validation of XML documents as being in accordance with the expected structure (that described by the XML schema document).
Due to its descriptive nature, the XML schema document appears to be a good candidate for configuring EXI processors.
More particularly, as an XML schema describes a data model, it is possible to construct grammars representing this model when initializing the coder/decoder. The coding (or decoding) operations by the coder are then reduced to the mere translation of the description in terms of grammars and productions.
The creation of grammars from an XML schema is in particular
described in the EXI specification.
When no XML schema is available for a given XML document, it is possible to build efficient grammars for this document, as disclosed in US Patent Application No 2010/0153837 wherein document statistics are used for building grammars.
However, while EXI has proved to be efficient for transmitting XML documents by reducing the amount of transmitted data, there is a continuous need to decrease the amount of data to be transmitted.
With this objective, the invention is directed to solving this drawback of the prior art so as to improve the compression ratio of documents of the XML type having similar structure.
To that end, the present invention concerns in particular a method of encoding a document to be transferred from a first unit to at least one second unit, the first unit comprising an encoder, the document being similar to another document and the document and said another document being documents conforming to a markup language, the method comprising the steps of: -encoding and providing said another document to the at least one second unit, said another document being encoded according to an encoding method based on the use of at least one grammar produced during the encoding step, said another document forming a reference document; -generating at least one grammar, the at least one generated grammar being based on said at least one grammar produced during the encoding step; -encoding said document according to said at least one generated grammar; and, -providing said document encoded according to said at least one generated grammar to the at least one second unit, along with a reference to the reference document.
Therefore, according to the method of the invention, the compression ratio of documents conforming to a markup language, for example documents of the XML type, is improved by encoding a document by reference to a similar document. As a consequence, the method of the invention allows optimization of the use of a bandwidth of a communication link when providing similar documents and of storage means when storing similar documents.
Advantageously, the method further comprises the steps of obtaining at least one structural information item of said another document and providing the obtained at least one structural information item to the at least one second unit, said at least one generated grammar being based on said at least one grammar produced during the encoding step and on the obtained at least one structural information item. Accordingly, the compression ratio of the documents having similar structure is further improved by taking into account structural information of the documents.
According to a particular embodiment, the at least one structural information item is obtained from a schema associated with said another document, from an application that generated said another document, or from statistics established on the basis of documents similar to said another document. The use of a schema, when available, is an efficient way of obtaining structural information of a corresponding document.
Advantageously, the method further comprises a step of filtering the obtained at least one information item, the step of providing the at least one structural information item to the at least one second unit consisting in providing the filtered obtained at least one structural information item to the at least one second unit, said at least one generated grammar being based on said at least one grammar produced during the encoding step and on the filtered obtained at least one structural information item. Accordingly, only the relevant structural information items are used for further improving the compression ratio of the documents. The filtering step is preferably based on a mode of encoding the document to be transferred, and/or on a measure of coding efficiency of the at least one structural information item.
In a particular embodiment, the method further comprises a step of encoding the filtered obtained at least one structural information item before it is provided to the at least one second unit so as to further improve the compression ratio of the documents. The encoding of the filtered obtained at least one structural information item is preferably based on said at least one grammar produced during the step of encoding said another document so as to optimize the use of the grammars and the compression ratio of the documents.
Advantageously, the method further comprises a step of adding the encoded filtered obtained at least one structural information item or a reference to the encoded filtered obtained at least one structural information item in said document before it is provided to the at least one second unit so that the document can be decoded efficiently by the at least one second unit.
Another object of the invention is a method of decoding a document in a second unit, the document being obtained from a first unit, the second unit comprising a decoder, the document being similar to another document and the document and said another document being documents conforming to a markup language, the document being encoded according to the method described above, the method comprising the steps of: -obtaining said another document, said another document being encoded according to an encoding method based on the use of at least one grammar produced during the encoding or decoding of said another document, said another document forming a reference document; -generating at least one grammar, the at least one generated grammar being based on said at least one grammar produced during the decoding of said another document; -obtaining the document to be decoded; and, -decoding said document according to said at least one generated grammar.
Therefore, according to the method of the invention, the compression ratio of documents conforming to a markup language, for example documents of the XML type, is improved by decoding a document by reference to a similar document. As a consequence, the method of the invention allows optimization of the use of a bandwidth of a communication link when obtaining similar documents and of storage means when storing similar documents.
In a particular embodiment, the method further comprises a step of storing the at least one grammar produced during the decoding step of said another document so that it can be used directly for decoding said document.
Advantageously, the method further comprises a step of identifying a reference to the reference document of which at least one grammar produced during the decoding of said another document is used to generate said generated grammar used for decoding the document. Accordingly, the generated grammar required for decoding the document may be efficiently retrieved.
In a particular embodiment, the method further comprises a step of obtaining at least one structural information item, said at least one generated grammar being based on said at least one grammar produced during the decoding of said another document and on the obtained at least one structural information item. Accordingly, the compression ratio of the documents having similar structure is further improved by taking into account structural information of the documents.
Advantageously, the method further comprises a step of decoding the obtained at least one structural information item so as to further improve the compression ratio of the documents.
According to a particular embodiment said document and said another document are XML documents, said another document being encoded
according to the EXI specification.
Still according to a particular embodiment, indexing tables are used in conjunction with grammars to further improve the compression ratio of the documents.
It is another object of the invention to provide a computer program product readable by a microprocessor, comprising portions of software code adapted to implement each step of the method described above when the program is loaded and executed by the microprocessor.
It is still another object of the invention to provide an encoder and a decoder comprising means for carrying each step of the method described above.
The advantages provided by the computer program product, the encoder, and the decoder are similar to the ones described above in relation with the method of the invention.
Still other particularities and advantages of the invention will appear in the following description, illustrated by the accompanying drawings, in which: -Figure 1, comprising Figure Ia and Figure Ib, illustrates some of the processing steps for encoding XML documents to be transmitted, according to the invention; -Figure 2 illustrates the steps of obtaining structural information items of an XML document and of filtering the obtained structural information items, as shown in Figure la; -Figure 3 illustrates, in flow diagram form, some of the processing steps for decoding a document that has been previously encoded and transmitted according to the invention, for example according to the algorithm illustrated in Figure Ia; and, -Figure 4 shows a particular hardware configuration of an information processing device adapted for an implementation of the methods according to the invention.
In broad terms, the invention aims at increasing the compression ratio of documents conforming to a markup language, in particular of XML documents, transmitted via a communication network from a first unit or a first device to another one or to a plurality of other units or devices, as documents encoded according to a predetermined specification of binary markup language, in particular according to the EXI specification (referred to below as an EXI document), when these documents conforming to the markup language share, at least partially, a similar structure.
In an alternative embodiment, the invention aims at increasing the compression ratio of documents conforming to a markup language, in particular of XML documents, compressed by a first unit of an apparatus and stored in a second unit (storage unit) of the apparatus, for subsequent decoding.
For sake of illustration, the following description mainly concerns the coding/decoding of documents to be transmitted between two devices.
However, it is to be understood that a similar process applies for coding/decoding documents in units of an apparatus where they are to be stored.
Accordingly, the first device, used for transmitting documents in a compressed format, comprises an encoder for encoding, for example, the XML documents to obtain corresponding EX1 documents, while the second device, to which the documents have to be sent in the compressed format, comprises a decoder for decoding, for example, received EXI documents to obtain the corresponding XML documents.
The invention is particularly useful for transmitting information between a data source (also referred to as a producer) and a data addressee (also referred to as a consumer) using web feeds such as Atom or RSS. In such a scenario, each update from the producer typically takes the form of a feed entry, either a new feed entry for new information items or an updated entry for modified information items.
It has been observed that in such a scenario, the producer and the consumer exchange many documents, each document containing one or more feed entries having very similar structures. However, an XML schema describing these entries is generally not available on both sides of the transmission path (producer and consumer) and therefore cannot be used to improve the encoding of the entries. Typically, the XML schema is only available on the producer side.
Furthermore, even if an XML schema is available, it describes a very open structure for the entries in order to allow many different usages of the format. However, in most cases, the entries produced by one source share a regular structure. For example, while the XML schema for Atom allows the elements making the content of an entry appear in any order, these entries are generated in most cases by applicative code that generates the elements in a well-defined order. This means that a more restricted schema can be used, enabling a better encoding of the entries, resulting in decreasing the amount of data to be transmitted.
According to the invention, a first XML document, used as a reference document and referred to below as an XML reference, is transmitted from the first device to the second one as an EXI document encoded according to the EXI specification, in a conventional way, using grammars.
The grammars used to process this reference document, that is to say to encode the XML reference during an encoding step performed in the first device and to decode the corresponding reference EXI document during a decoding step performed in the second device, are kept as reference on both the encoder and the decoder. Alternatively, references to these grammars can be kept in lieu of the grammars themselves.
When following XML documents are transmitted from the first device to the second one, the grammars used for processing the XML reference are reused to encode and decode these following XML documents. This enables an increase in compression ratio between XML documents and corresponding transmitted EXI documents as there is no need to retransmit information contained in these grammars.
In addition, some indexing tables used to process the reference document can also be kept as reference and reused while processing following documents.
In addition, the encoder can use external information to improve these grammars and/or indexing tables. This external information needs to be transmitted once to the decoder, but can then be used to process some or all of the further documents transmitted.
As a variant, some of the grammars or indexing tables may be discarded. For example, if a grammar corresponds to the default grammar as defined by EXI, or if it corresponds to the structure of an element which appears to have a high degree of variability, it may not be useful to reuse it. An indexing table corresponding for example to local names, but containing no entry, could also be discarded.
According to the nature of the XML documents to be transmitted, it is possible to use several XML references. Each XML reference should comprise an indication so that it may be identified as an XML reference and handled as such in a decoder.
The encoding/decoding information items used according to the invention that must be transmitted from the encoder to the decoder are preferably encoded in a specific user field inside the EXI header. This user field is composed of a main element, referred to, for example, as "grammarReuse", containing other elements used for describing the encoding/decoding information items. It is of the following form: cgr;grammarReuse xmlns:gr="http://crf.canon.fr/grammarReuse"> </gr:grammarkeuse> It is to be noted that this main element name ("grammarReuse") is given for the sake of illustration. It is advantageously chosen to be more compact when stored as a string.
It is also possible to encode the encoding/decoding information items inside the field known as "schemald' in the EXI header.
An efficient way to transmit the encoding/decoding information items is to encode the "grammarReuse" element and all its content according to the EXI specification, using schema-based grammars built according to the structure of that element (the grammars are created from the XML Schema describing the "grammarReuse" and its content) and indexing tables initialized from the reference document used. Using the indexing tables from the reference document enables encoding of the description of the elements and attributes from the transmitted document without having to encode their URIs and Names as they are defined by the indexing tables corresponding to the reference document used. Next, "grammarReuse" element and all its content information items are included, without any header, in a specific user field inside the EXI header of the main document.
An XML schema defining the structure of the "grammarReuse" element and its content is given in Appendix, as Code 1. Other structuring of the information items is possible.
When an XML document is selected as an XML reference for encoding other documents, this information should preferably be transmitted to the decoder along with these other documents (otherwise, the decoder should keep the grammars and indexing tables from all the documents it receives from the encoder).
An XML document considered as an XML reference may be transmitted as an EXI document having a particular indication. Such an indication can be an element of the "grammarReuse" element of the EXI header.
For example, the empty element "cgr:reference/>" can be added to the "<gr:grarnmarReuse>" element.
According to this example, a reference document is indicated with the following XML snippet: <gr:grammarReuse xmlns:gr= "http://crf.canon. fr/grammarReuse"> <gr:reference/> </gr:grammarReuse> In other words, the presence of this element means that both the encoder and the decoder should consider this document as a reference document. Therefore, a reference to this document is preferably added to a list of reference documents kept up to date independently by the encoder and the decoder.
Then, to specify that a document is encoded using a specific reference document, a specific element of the "grammarReuse" main element of the EXI header is preferably used. For example, the "cgr:use>" element containing the index of the reference document in the list of reference documents can be used.
As an option, it is possible to specify both absolute references and relative references. An index value greater or equal to zero can be used as an index in the list while a strictly negative value can be used as an index in the reversed list (-1 corresponding to the last element in the list and -list_size representing the first element of the list, IisL size representing the number of reference in the list of references).
To optimize the compactness of the index encoding, if only absolute references are used, the index is advantageously encoded as a bounded integer ranging from 0 to (IisL size -1). When using both absolute and relative references, the index is preferably encoded as a bounded integer ranging from (-IisL size) to (IisLsize -1).
Several possibilities exist for improving such a mechanism, that is to say for reducing the size of the list of references.
A first solution consists in removing a reference document from the list. This can be performed by clearing the list of reference documents when the encoder and the decoder do not exchange documents for a predefined period of time. Any other scheme for clearing or removing entries from the list of reference documents is possible.
Another solution consists in using a n element of the "cgr:deleteReference>" type, containing the index of a reference document to be removed from the list of reference documents. Accordingly, the encoder may remove a reference document from its list and transmit the "ccgr:deleteReference>" element to the decoder with a reference to this reference document so that the decoder may remove it from its list of reference documents.
These solutions give good results in controlled environments where both an encoder and a decoder can easily track the exchanged reference documents. However, in a more open environment where many exchange possibilities exist, these solutions may lead to complex management of the reference documents. By way of illustration, if the encoder is integrated into a web server, it would be costly for it to track the list of reference documents for all the different web clients it is transmitting documents to. In addition, it would mean encoding the same document in different ways depending on the reference documents available at the decoder side.
In such a case, reference documents are still advantageously marked with the "cgr:reference/>" empty element but references are not expressed using an index but rather a URI or any other naming scheme suitable for the usage scenario.
The decoder can use this URI to obtain the reference document from its local storage (if it has received it and stored it). Otherwise, it can retrieve the reference document by dereferencing the URI.
In the interest of compactness, the URI can be specified as a relative URI, based on the URI of the document containing it.
According to such a solution, the encoder is able to use the same list of reference documents for all its web clients: if a client receives a document encoded using a reference document it has not yet received, it is able to request it from the server.
In addition, the storage policy for reference documents can reuse the validity information contained in HTTP headers.
For optimizing the encoding of a document to be transmitted, the encoder should be able to identify the best reference document, or at least a good reference document, that is to say to determine to which reference document, amongst a plurality of reference documents, the document to be transmitted is the most similar to, or at least a reference document that is similar to the document to be transmitted.
First of all, it is noted that due to its configuration, the encoder may know the type of each document. For example, an encoder may have to handle only one type of document. As another example, an encoder inside a web server may use the document type information to sort documents.
However, in some cases, the encoder does not have external information about the type of the documents, or this external information may not be sufficient for the invention to be efficient. For example, an encoder may know that a document is of the xhtml type (xhtml being the acronym for "Extensible HyperText Markup Language"). However, the structure of an xhtml document is very variable and the invention would not be very efficient by using this single information for estimating similarity.
To enable documents to be grouped according to their similarity, some hand-made configuration may be carried out. For example, in the case of documents of the xhtml type, this configuration may be done by grouping the documents depending on their content. For the sake of illustration, documents representing albums (i.e. groups of photos) form a first group while documents representing photos form a second group in a photo-album application.
Another solution concerns the use of similarity measures to detect similar documents.
An example of similarity measures is directed to the comparison of simplified structure of documents. To that end, after having extracted the structure of each analyzed document by removing its content (i.e. removing the textual content and the attribute values), the structures are simplified by detecting repetitions of elements or groups of elements. Next, these simplified structures are compared by counting the number of differences between them.
According to this example, the similarity measure is the ratio of the number of differences over the size of the structures.
As mentioned above, the invention can also be used for storing several XML documents at the same location. A first XML document, used as a reference document, is encoded, using the EXI standard specification, in a first unit and provided to a second unit to be stored. The first and second units typically belong to the same apparatus.
When a second XML document, similar to the first one, is added to the storage, it is encoded using the invention.
In such a scenario, it is possible to store the reference relations, as well as the structural information, separately from the documents. Moreover, as the encoder and decoder share the same information about reference documents and additional structural information, it is easy to remove reference documents and change the structural information used.
In addition, it is easier to use statistics over several documents for finding structural information. A first case is when several similar documents are stored together in the system. A first document is chosen as a reference, and all the documents are studied to find the relevant structural information. A second case is when several similar documents are already stored. Statistics can be extracted from them, enabling structural information to be found. Those documents can then be re-encoded using that structural information to achieve better compression.
Figure 1, comprising Figure 1 a and Figure 1 b, illustrates some of the processing steps for encoding XML documents to be transmitted, according to the invention.
More precisely, Figure Ia illustrates, in flow diagram form, some of the processing steps for encoding XML documents to be transmitted, according to an example of use of the invention while Figure lb illustrates schematically how grammars and indexing tables used for encoding XML documents are generated from a reference document and additional information.
As shown in Figure la, after an XML document to be transmitted or transferred has been selected (step 100), a test is performed to determine whether or not that document is a reference document (step 105). Such a test may be based upon a choice by a user or may be performed automatically. For example, the document is compared to each document from a list of reference documents associated with the encoder. If a measure of similarity between the document to be transmitted and one of the documents of that list is greater than a predetermined threshold, then the document from the list is used as a reference document for the document to be transmitted. Otherwise, the document to be transmitted is considered as a reference document.
If the document to be transmitted is considered as a reference document (Figure ib, reference 170), the encoder encodes this XML document (step 110) so that it can be used as reference (XML reference). This encoding step is performed using the EXI specification in a standard way (Figure ib, step 172), in a schema-less mode, that is to say in a mode in which the grammars to be used are built from the document and not from an XML schema.
As stated above, an indication is preferably added to the EXI header of the encoded document in order to indicate that it is a reference document.
At the end of this step, some of the grammars and indexing tables (i.e., encoding/decoding information, referred to as 174 and 176 in Figure Ib) are kept in order to be retrieved while encoding a following similar document.
Several ways of keeping grammars and indexing tables are possible: -the encoder saves the grammars and the indexing tables in memory; -the encoder serializes the grammars and the indexing tables in a logical stream representing these grammars and indexing tables. Such a logical stream can be stored in a file, a database, or any similar means. The grammars and the indexing tables are later rebuilt by de-serializing the corresponding logical stream; or -the encoder saves this XML reference so that the grammars and the indexing tables are rebuild by re-encoding this document, possibly using a simplified process targeting only the rebuilding of the grammars and the indexing tables and not doing the full encoding process.
It is noted that the EXI specification aims to index many types of values for improving the compactness of XML documents. In the context of the invention, the created indexing tables may vary in usefulness. Accordingly, some of these tables may be discarded.
The indexing tables corresponding to element and attribute names (i.e. the indexing tables used for URIs and local names) can be considered as being very useful because they correspond to the structure of the document which is the common part of similar documents. Conversely, the indexing tables corresponding to content (i.e. element textual content and attribute values) are of less importance because they correspond to information that can vary a lot between similar documents.
As a consequence, by default, only the indexing tables corresponding to URIs and Local Names are kept for the reference document.
The other indexing tables are discarded after processing each document.
Regarding the indexing tables corresponding to prefixes, they are empty if prefixes are not preserved by EXI. If prefixes are preserved, these tables are not empty but generally contain few entries. Accordingly, these tables are preferably discarded (Figure 1 b, reference 178).
As an option, it is possible to specify further tables that should be preserved. For example, if prefix usage is important, then the prefix indexing tables can be preserved. Furthermore, some elements or attributes can take their values in a restricted set. In such a case, the indexing table corresponding to this content can be preserved. The elements or attributes taking their values from a restricted set can be determined either by some external information about the document, or by comparing the number of different values and the number of instances of occurrence for an element or attribute. The preservation of optional tables is advantageously specified jointly with the specification of a document as a reference.
Returning to Figure Ia, in a following step (step 115), structural information items of the document to be transmitted are obtained. It is to be noted that the structural information (referred to as 1 80 in Figure 1 b) can be obtained from several sources.
Figure 2 illustrates the step 115 of obtaining structural information items of a document and the step of filtering the obtained structural information items.
Structural information items can be obtained from an XML schema that describes the structure of the processed XML document and therefore contains much information about the organization of elements and attributes and about the type of the content. Accordingly, if the XML schema is not used to encode the associated XML document to be transferred (because, for example, the XML schema is not available at the decoder side), the XML schema can be used to obtain structural information of the XML document to be transferred.
In addition, information about the values of elements and attributes can also be extracted from the XML schema. For example, the fact that a value is an enumeration of predefined values can be extracted, to better encode the value.
To that end, the XML schema associated with the document to be encoded is obtained (step 200) and then the structural information items contained in this XML schema are extracted (step 205).
Other sources of structural information exist. They can be used in a complementary or alternative way (as suggested by dotted arrows).
For example, the construction process of an XML document can be seen as a potential source of structural information. For the sake of illustration, if a value is given with a specific type (using for example a typed API, or by passing a typed value) instead of the default string type used by XML specification, then this value can be considered as being of this specific type and thus, it can be encoded accordingly.
Another possibility is to use data detectors to determine types of values. The used data detectors may parse values to determine whether or not they match specific types. For the sake of illustration, the date format of values can be determined in such a way, by testing the most common formats for displaying dates. Likewise, other values such as integers and floating point values can be detected. Moreover, the name of the containing element or of the attribute may help in selecting the most relevant data detectors. As an example, if the name of an attribute is "date", its content is probably of the date type.
Still another possibility is to extract structural information from templates used to create documents. Many Web Application Frameworks contain a template mechanism for generating documents. Those templates contain static parts and dynamic parts. The static parts do not change from one document to the other while the dynamic parts, used to include information that depends on the user request, can change.
For the sake of illustration, when considering the generation of an Atom feed from a template, the template would contain a static part describing the structure of the feed and dynamic parts for transferring information specific to the feed, such as the feed title, its update date and so on. In addition, it would contain a specific dynamic part for expressing the characteristic whereby the feed can contain several entries. For each entry, there exists a static part defining the structure of the entry and dynamic parts for transferring the information specific to each entry.
By analyzing such templates, it is possible to obtain information regarding the structure of the generated documents. For example, the possibility of repeating a given element is contained in the template. The hierarchy of the elements and their order are also contained in the template.
The fact than an element is mandatory or not is also contained in the template.
Furthermore, type information can also be extracted by using formatting commands used to integrate dynamic values in the templates.
Finally, the encoder can also rely on statistical information from a set of documents to determine some structural information. Such statistical information can include the possible child elements for an element, their order, whether a given element may be repeated or not, and so on. It can also include the possible attributes for an element, and the type of their values, as well as the type of textual values contained in elements. Statistical information can also be used to determine whether certain attributes or textual values are restricted to a set of values and therefore whether the corresponding indexing tables should be retained or not.
Therefore, structural information contained in the document itself, or in a set of similar documents, can be extracted (step 210): structural information items are based on characteristics and statistics obtained from a set of documents which includes at least the document to encode (and preferably other documents to be transmitted).
After the structural information of the document to be transmitted is obtained, the obtained structural information items are filtered in order to keep only useful structural information (step 120). Figure 2 illustrates an example of some steps of filtering structural information items.
A first filtering step can be based, for example, on the encoding mode used for encoding the document.
If the encoding mode corresponds to the strict EXI encoding mode, the structural information items, guessed from the reference document or obtained through statistics, that are considered to be uncertain, should not be used.
Alternatively, if the encoding mode corresponds to the EXI lax mode, all the structural information items can be used. However, some structural information items may be tagged with confidence marks, the information items associated with a confidence mark that is below a predefined value being discarded. Such confidence marks can be set, for example, by statistical analysis, depending on the number of used samples. It can also be set by a data detector, still using the number of samples used, mitigated by the type of the data detector. For example, an integer data detector may set low confidence marks, as an integer may be a special case of floating point value, whereas a float data detector may set higher confidence marks, as there is less ambiguity regarding the type of floating point values.
Accordingly, the algorithm checks whether or not the strict mode is used (step 215). If the strict mode is used, all uncertain structural information items are removed (step 220), as in this mode, unexpected data in the document cannot be encoded. If the lax mode is used, structural information items with low-confidence are discarded (step 225). This is carried out by removing all items for which the confidence value is below a predefined threshold. As a special case, such a threshold could also be applied to the strict mode case, but with a very high threshold (for example, keeping only structural information with at least a 95% confidence value).
According to the given example, a second filtering step is based on the efficiency of each structural information item (steps 230 and 235). The efficiency of a structural information item is computed as the difference between the encoding cost for encoding that information item and the expected encoding improvement associated with it for the documents to be transmitted. The expected encoding improvement can be measured using the kind of structural information item and the corresponding occurrence number in the reference document.
As a variant, the expected encoding improvement can be measured on the basis of several documents by using the average number of occurrences of structures related to the structural information items in those documents.
As another variant, the expected encoding improvement can be measured on the basis of an expected number of occurrences of structures related to the structural information items in the document to be transmitted, measured either from the reference document or from a set of documents, multiplied by the expected number of documents to be transmitted corresponding to the reference document. This variant is the most precise one, but the expected number of documents to be transmitted may not be available in all cases.
The limit for rejecting a structural information item may be zero: if the encoding cost is greater than the expected encoding improvement, then this information item has to be discarded.
However, it is possible to use a more conservative strategy in which the limit is greater than zero: the encoding cost must be strictly greater, by a predefined value, than the expected encoding improvement.
In addition, structural information items discarded in a first analysis (for example, while selecting them to encode the first document using a given reference document), but close to the limit may be kept to be further evaluated as a function of several documents. Then, after encoding a few documents, another evaluation of these information items is made, and those selected by this second evaluation are then used for encoding further documents and are therefore transmitted to the decoder. Such a filtering step is suggested in Figure I a by the dotted arrow.
It is to be noted that several filtering steps can be performed.
Accordingly, the two filtering steps illustrated in Figure 2 by references 215-225 and 230-235 can be both performed or only one of these steps can be performed as illustrated with dotted arrows. Other filtering steps may also be implemented.
Returning to Figure la, after having filtered the structural information items, the obtained structural information is encoded (step 125) so that it can be transmitted to the decoder for decoding further documents. To that end, the encoder preferably uses the kept indexing tables associated with the reference document in order to obtain more compact encoding (element and attribute names can be encoded using those tables): most of the structural information items used by the invention relate to an element or to an attribute. Therefore, to specify a structural information item, it is necessary to refer to the corresponding element or attribute, which is defined by its name. Since encoding the name would be costly, a more compact encoding can be advantageously used. Such compact encoding is preferably provided by the indexing tables from the reference document: as the element or attribute was used in this reference document, its name was indexed during the encoding of the reference document.
In addition to specifying the name of the element or attribute the structural information item relates to, it is also necessary to specify the type of the item of the structural information, as well as all its relevant details. Common examples are described below.
First, the grammars created from the reference document do not specify any order for the sub-elements contained in an element. Specifying that the children of an element form a sequence with a well-defined order helps to make the encoding of that element more compact. This can be achieved using the following "cgr:sequence>" element: <gr:sequence name= "element-name"> repetitions'c/gr:sequence> where the content of the "cgr:sequence>" element is empty or comprises a set of elements specifying the repetition possibilities for each sub-element composing the sequence. For each sub-element, the repetition specification is one of the following elements: -"ccgr:optional name="element-name">" if the sub-element is optional (zero or one occurrence); -"cgr:unique name="element-name">" if the sub-element must appear once and only once (one occurrence); or -"cgr:repeated name="element-name">" if the sub-element can appear any number of times (zero to an unbounded number of occurrences).
If nothing is specified for a sub-element, it is considered as being allowed any number of times.
In all these elements, the type of the "name" attribute is defined to be a qualified name, enabling to use the indexing tables kept from the reference document. This is also the case for the other attributes containing element, attribute, or type names described hereafter.
A second useful information item is the type of an attribute.
Specifying the type of an attribute enables it to be encoded more efficiently. In addition, it can be useful to specify whether an attribute is required or not. This can be specified as follows: <gr: attribute elem en t= "element-name" name="attribute-name" type= "type-name">requirement</gr:attribute> The value of the attribute "type", indicated by "type-namd' above, is the name of the attribute's type. This attribute "type" is defined as being a qualified name, and therefore, this type is encoded using the indexing tables from the reference document. As those indexing tables were generated for the schema-less encoding mode, they do not contain the name of the types defined in the XML schema specification. Therefore, for the purpose of encoding the structural information items, these indexing tables are enriched with these type names to enable encoding the attribute's type names more efficiently.
Optionally, the indexing tables could be enriched during their selection step as described previously.
An attribute can be specified as required by adding an element "cgr:required>" inside the <gr:attribute>" element specifying that attribute.
Next, the grammars and the indexing tables used for encoding the document at step 110 are retrieved and modified according to the selected structural information, that is to say according to the filtered structural information. In doing so, new grammars and new indexing tables are generated (step 130). These new grammars and new indexing tables (referred to as 182 and 184 in Figure ib) should be used to encode further documents having a structure similar to that of the processed document (to be transmitted and used as reference).
The document or a reference to the document is then added (step 135) into a list of reference documents associated with the encoder along with the encoded filtered structural information, the generated grammars, the retained indexing tables and, preferably, the obtained structural information (so that the structural information items can be filtered again later).
The encoded document is then transmitted (step 165). It is to be noted that the document may be sent as soon as it encoded, that it is to say after step 110 is performed.
In addition, the steps 115 to 135 could be delayed, and be executed only before sending another document using that document as a reference. In such a case, those steps would be executed between the steps 140 and 145 described hereafter, for the first document using this document as a reference.
In such a case, during step 135, the document or a reference to the document is added to a list of reference documents associated with the encoder, along with the generated grammars and the retained indexing tables.
If the document to be transferred should not be considered as a reference document (step 105), a reference document has to be selected (step 140). As described above, the selected reference document is preferably the document of the reference document list associated with the decoder, whose structure is the most similar to that of the document to be transmitted.
After having selected a reference document, the document to be transmitted is encoded (step 145). To that end, the grammars generated in relation with the selected reference document (at step 130) are used.
These grammars can be seen as a special case of schema-based grammars. Therefore, two options are possible for using these grammars: either they are used in a strict mode not allowing any deviation from the structure described in the grammars, or they are used in a lax mode allowing deviations.
Preferably those grammars are used in a lax mode, as their building process may result in the lack of some structural information items. For example, an optional element may not be present in the reference document, nor in the structural information used, and therefore will not be expected by the grammars.
Using the strict mode would prevent encoding of that optional element, while using the lax mode would let this optional element to be encoded as a deviation.
However, in some cases, those grammars may be known as describing all the possibilities for the documents and they can then be used according to the strict mode. Additionally, the encoder may try to use the strict mode and revert to the lax mode in case of failure.
The encoded structural information that has been used to generate the grammars used for encoding the document to be transmitted should be transmitted to the decoder so that the decoder may generate the grammars that should be used for decoding the received document.
Accordingly, a test is performed to determine whether or not the encoded filtered structural information has already been sent to the decoder (step 150). Such a test may be based, for example, on the value of a flag associated with a decoder, the flag being initialized or set to a first value when the structural information items are filtered, and being set to a second value when these structural information items are sent to the corresponding decoder.
If the encoded filtered structural information has not already been sent to the decoder, it is added or attached to the encoded document (step 155) to be transmitted along with the encoded document.
Conversely, if the encoded filtered structural information has already been sent to the decoder, a reference to this information is added to the encoded document (step 160) so that the decoder may retrieve it. Such a reference may consist, for example, of a reference to a previously sent encoded document comprising this encoded filtered structural information.
The encoded filtered structural information or a reference to the encoded filtered structural information can be added, for example, in the header of the document, either in the schemald field of the EXI header, or in a specific
user field.
Then the encoded document is transmitted (step 165). It is typically transmitted to a device comprising a decoder according to the invention.
Figure 3 illustrates, in flow diagram form, some of the processing steps for decoding a document that has been previously encoded and transmitted according to the invention, for example according to the algorithm illustrated in Figure la.
After having received a document to decode (step 300), a test is performed to determine whether or not the received document has been encoded using a reference document (step 305). Such a test can be based upon the content of the header of the received document.
If the received document does not use a reference document, it is decoded using the EXI specification in a standard way (step 310), in a schema-less mode, that is to say in a mode in which the grammars to be used are built from the document and not from an XML schema. As described below by reference to step 350, the grammars and the indexing tables generated during the decoding of the received document may be stored if that document is a reference document.
If the received document is encoded using a reference document, the reference document to be used for decoding the received document is retrieved (step 315). As described above, a reference to the reference document is preferably found in the received document header.
Once the reference document has been identified, the grammars and the indexing tables associated with that reference document are retrieved (step 320).
A test is then performed to determine whether or not filtered structural information relative to the identified reference document has been received jointly with the received document (step 325).
If filtered structural information has been received jointly with the received document, it is obtained from the received document. As described above, such information is preferably encoded in the received document or attached to the received document and encoded with the encoding information associated with the reference document. Accordingly, the encoding information is retrieved from the reference of the reference document and the filtered structural information is decoded (step 330) using the information contained in the indexing tables retrieved at step 320.
New grammars and new indexing tables are then generated (step 335) from the grammars and indexing tables retrieved at step 320 and the decoded filtered structural information, as done in the encoder and described by reference to step 130 of Figure Ia.
The received document is then decoded using the generated grammars (step 340).
If no filtered structural information relative to the identified reference document has been received jointly with the received document, the document is decoded (step 340) using the grammars and the indexing tables retrieved at step 320.
A test is then performed to determine whether the decoded document will be used as a reference document or not (step 345).
If this is the case, the grammars and indexing tables used for decoding it are stored (step 350) for later use when decoding another document using that document as a reference.
Further documents similar to a reference document can be processed in several ways: -they can be directly encoded using generated grammars obtained from the modified grammars and indexing tables of the reference document.
This is the most frequent case. In this case, documents to be decoded use the grammars generated for decoding a first document transmitted after transmitting the reference document (step 340 of Figure 3), and no further structural information is to be transmitted with these documents; -they can be transmitted after applying to them the steps used for transmitting a first document following the transmission of the used reference document (possibly selecting a different set of structural information); or, -they can be transmitted after applying to them the steps used for transmitting a first document following the transmission of the used reference document and using another document as a reference instead of the reference document. This enables the compression to be improved step by step by using more and more structural information.
An example of XML document that can be transmitted more efficiently using the invention is given in the Appendix. Code 2 illustrates a file used to evaluate the invention. This file is an Atom feed containing one entry.
Coding the XML version of the document given in Code 2 requires 865 bytes while encoding the corresponding EXI file, without using any schema, requires 543 bytes. Those two figures correspond to the encoding size obtained using state of the art encoding for XML documents.
Encoding the same document with a similar document as a reference, and without any additional structural information requires 428 bytes.
This figure shows that using the invention in a basic mode, where only a reference document is used, but no additional information items are used, already provides an improvement over the state of the art. As described above, encoding may be further improved by using additional structural information that an Atom feed generator may know. Such structural information items may for example be the following: -the order of the elements contained in another element is invariable; -some elements are not repeated (for example the entry title or identifier) and some elements are mandatory (for example the entry title or identifier); -some attributes are mandatory (for example, the href attribute of the link elements); and, -some elements are of predetermined and known type (for example, the updated element is a date).
Such structural information can be transmitted according to the XML snippet of Code 3. The use of these structural information items enables the file to be reduced to 409 bytes. This shows that adding structural information to the reference document further enables the invention to reduce the size of the encoded document.
As described above, the invention can be implemented in devices comprising an encoder and/or a decoder for transmitting XML documents from a device comprising at least an encoder to another device comprising at least the decoder.
The device comprising at least the encoder comprises means for encoding an XML reference according to the EXI specification. According to a particular embodiment, this device may comprise means for storing the grammars used for encoding the XML reference.
The device comprising at least the encoder further comprises means for obtaining structural information associated with the XML reference and for filtering this structural information. This device also comprises means for encoding the filtered structural information (i.e. additional structural information) using the encoding grammars of the XML reference. It further comprises means for generating grammars from the encoding grammars and indexing tables used to encode the XML reference and the additional structural information as well as means for coding at least one other XML document, similar to the XML reference, with the generated grammars and additional means for transmitting information linking this other XML document to the XML reference.
According to a particular embodiment, this device further comprises means for mandating the storage of the first document.
All the means of the device comprising at least the encoder may be implemented with the use of instructions that can be loaded and executed by a programmable device such as the one illustrated in Figure 4.
The device comprising at least the decoder comprises means for storing decoding grammars and indexing tables associated with an XML reference and means for obtaining an encoded document (EXI document). This device further comprises means for extracting a reference to an XML document and retrieving associated decoding grammars. It also comprises means for extracting structural information from a received encoded document, the structural information being encoded using stored indexing tables, as well as means for generating grammars from decoding grammars and, optionally, extracted structural information. This device further comprises means for decoding the received encoded document with generated grammars.
According to a particular embodiment, the device further comprises means for identifying a document as a reference document, that is to say as an XML reference to be used.
All the means of the device comprising at least the decoder may be implemented with the use of instructions that can be loaded and executed by a programmable device such as the one illustrated in Figure 4.
With reference to Figure 4, a description is now given by way of example of a particular hardware configuration of a processing device adapted for an implementation of the method according to the invention.
A processing device implementing the present invention is for example a micro-computer 400, a workstation, a personal assistant, or a mobile telephone connected to different peripherals.
The peripherals connected to the information processing device comprise for example a digital camera 405, or a scanner or any other means of image acquisition or storage, that is connected to an input/output card (not shown) and supplying multimedia data, possibly in the form of XML documents, to the information processing device.
The device 400 comprises a communication bus 410 to which there are connected: -a central processing unit CPU 415 taking for example the form of a microprocessor; -a read only memory 420 in which may be contained the programs whose execution enables the implementation of the method according to the invention; -a random access memory 425, which, after powering up of the device 400, contains the executable code of the programs of the invention as well as registers adapted to record variables and parameters necessary for the implementation of the invention; -a screen 430 for displaying data and/or serving as a graphical interface with the user, who may thus interact with the programs according to the invention, using a keyboard 435 or any other means such as a pointing device, for example a mouse 440 or an optical stylus; -a hard disk 445 or a storage memory, such as a memory of compact flash type, able to contain the programs of the invention as well as data used or produced on implementation of the invention; -an optional diskette drive 450, or another reader for a removable data carrier, adapted to receive a diskette 455 and to read/write thereon data processed or to process in accordance with the invention; and -a communication interface 460 connected to the telecommunications network 465, the interface 460 being adapted to transmit and receive data.
In the case of audio data, the device 400 is preferably equipped with an input/output card (not shown) which is connected to a microphone 470.
The communication bus 410 permits communication and interoperability between the different elements included in the device 400 or connected to it. The representation of the bus 410 is non-limiting and, in particular, the central processing unit 415 unit may communicate instructions to any element of the device 400 directly or by means of another element of the device 400.
The diskettes 455 can be replaced by any information carrier such as a compact disc (CD-ROM) rewritable or not, a ZIP disk or a memory card.
Generally, an information storage means, which can be read by a micro-computer or microprocessor, integrated or not into the information processing device, and which may possibly be removable, is adapted to store one or more programs whose execution permits the implementation of the method according to the invention.
The executable code enabling the processing device to implement the invention may equally well be stored in read only memory 420, on the hard disk 445 or on a removable digital medium such as a diskette 455 as described earlier. According to a variant, the executable code of the programs is received by the intermediary of the telecommunications network 465, via the interface 460, to be stored in one of the storage means of the device 400 (such as the hard disk 445) before being executed.
The central processing unit 415 controls and directs the execution of the instructions or portions of software code of the program or programs of the invention, the instructions or portions of software code being stored in one of the aforementioned storage means. On powering up of the device 400, the program or programs which are stored in a non-volatile memory, for example the hard disk 445 or the read only memory 420, are transferred into the random-access memory 425, which then contains the executable code of the program or programs of the invention, as well as registers for storing the variables and parameters necessary for implementation of the invention.
It will also be noted that the device implementing the invention or incorporating it may be implemented in the form of a programmed device. For example, such a device may then contain the code of the computer program(s) in a fixed form in an application specific integrated circuit (ASIC).
The device described here and, particularly, the central processing unit 415, may implement all or part of the processing operations described in relation with Figures 1 to 3, to implement the method of the present invention.
Naturally, in order to satisfy local and specific requirements, a person skilled in the art may apply to the solution described above many modifications and alterations all of which, however, are included within the scope of protection of the invention as defined by the following claims.
APPENDIX
<schema targetNamespace= "http://crf.canon. fr/grammarReuse" xmlns= "http://www.w3.org/2OO1/XMLSchema> <element name= "grammarReuse"> <complexType> <sequence> <element name= "reference" minOccurs= "O"j'> <element name= "use" minOccurs-"0" type= "integer"!> <element name= "options" minOccurs= "0"> <complexType> <sequence> <element name= "deleteReference" mThQccurs= "0" type= "integer"!> <element name= "sequence" minOccurs= "0" maxOccurs= "unbounded"> <complexType> <choice> <element name= "optional"!> <element name= "unique"!> <element name= "repeated","> </choice> <attribute name= "name" type= "QName'/> <,"complexType> </element> <element name= "attribute" minoccurs= "0" maxQccurs= "unbounded"> <complexType> <sequence> <element name= "required" minOccurs= "0","> <!sequence> <attribute name= "name" type= "QName","> <attribute name= "type" type= "QName","> <,"complexType> <j'element> <!sequence> </complexType> <,element> <,!sequence> </complexType> <jelement> <,schema> Code 1: Additional information item XML Schema <?xml version = "1.0" encoding= "utf-8"?'> <feed xmlns= "http:,/,/www. w3.org,"2005,"Atom"> <title>Example Feed<1/title> <subtitle>A subtitle. </subtitle> <link href= "http://example.org,/feed," rel= "self",!> <link href= "http://example.org,",/> <id>urn:uuid:60a76c80-d399-1 1d9-b91 C-0003939e0af6<,Iid> <updated>2003-12-13T18:30:02Z<!updated> <author> <name>John Doe<,/name> <email>johndoe@example.com </email> </author> <entry> <title>Atom-Powered Robots Run Amok</title> <link href= "http://example.org/2003/12/13/atomO3" I> <link rel= "alternate" type= "text/html" href= "http://example.org/2003/12/13/atomo3.html"/> <link rel= "edit" href= "http://example.org/2003/12/13/atomo3/edit"/> <id>urn:uuid:1225c695-cfbS-4ebb-aaaa-SOda344efa6a</id> <updated>2003-12-13T18:30:02Z</updated>
<summary>Some text </summary>
c/entry> </feed> Code 2: Atom feed sample <gr:grammarReuse xmlns:gr= "http://crf.carion. fr/grammarReuse" xmlns:xsd= "http://www.w3.org/2OO1/XMLSchema" xmlns:atom= "http://www.w3.org/2OO5/Atom"> <gr:use>-1 </gr:use> <gr:sequence name= "atorn:feed"> <gr:unique name= "atom:title"/> <gr:optional name= "atom:subtitle"/> <gr:unique name= "atom:id"/> <gr:unique name= "atom:updated"/> </gr:sequence> <gr:sequence name= "atom:author"> <gr:unique name= "atom:name"/> <gr:optional name= "atom:emall"/> </gr:sequence> <gr:sequence name= "átom:entry"> <gr:unique name= "atom:title"/> <gr:unique name= "atom:id"/> <gr:unique name= "atom:updated"/>
<gr:optional name= "atom:summary7>
</gr:sequence> <gr:attribute element= "àtom:entry" name= "href" type= "xsd:string"> <gr:required/> </gr:attribute> </gr:grammarReuse> Code 3: XML snippet for structural information items
Claims (29)
- CLAIMS1. A method of encoding a document to be transferred from a first unit to at least one second unit, the first unit comprising an encoder, the document being similar to another document and the document and said another document being documents conforming to a markup language, the method comprising the steps of: -encoding (110) and providing (165) said another document to the at least one second unit, said another document being encoded according to an encoding method based on the use of at least one grammar produced during the encoding step, said another document forming a reference document; -generating (130) at least one grammar, the at least one generated grammar being based on said at least one grammar produced during the encoding step; -encoding (145) said document according to said at least one generated grammar; and, -providing (165) said document encoded according to said at least one generated grammar to the at least one second unit, along with a reference to the reference document.
- 2. The method according to claim 1, further comprising the steps of obtaining (115) at least one item of structural information on said another document and providing said obtained at least one structural information item to the at least one second unit, said at least one generated grammar being based on said at least one grammar produced during the encoding step and on the obtained at least one structural information item.
- 3. The method according to claim 2, wherein the at least one structural information item is obtained from a schema associated with said another document, from an application that generated said another document, or from statistics established on the basis of documents similar to said another document.
- 4. The method according to claim 2 or claim 3, further comprising a step of filtering (120) the obtained at least one information item, the step of providing the at least one structural information item to the at least one second unit consisting in providing the filtered obtained at least one structural information item to the at least one second unit, said at least one generated grammar being based on said at least one grammar produced during the encoding step and on the filtered obtained at least one structural information item.
- 5. The method according to claim 4, wherein the filtering step is based on a mode of encoding the document to be transferred, and/or on a measure of coding efficiency of the at least one structural information item.
- 6. The method according to claim 4 or claim 5, further comprising a step of encoding (125) the filtered obtained at least one structural information item before it is provided to the at least one second unit.
- 7. The method according to claim 6, wherein the encoding of the filtered obtained at least one structural information item is based on said at least one grammar produced during the step of encoding said another document.
- 8. The method according to claim 6 or claim 7 further comprising a step of adding (155, 160) the encoded filtered obtained at least one structural information item or a reference to the encoded filtered obtained at least one structural information item to said document before it is provided to the at least one second unit.
- 9. A method of decoding a document in a second unit, the document being obtained from a first unit, the second unit comprising a decoder, the document being similar to another document and the document and said another document being documents conforming to a markup language, the document being encoded according to the method of any one of claims I to 8, the method comprising the steps of: -obtaining (300) said another document, said another document being encoded according to an encoding method based on the use of at least one grammar produced during the encoding or decoding of said another document, said another document forming a reference document; -generating (310, 335) at least one grammar, the at least one generated grammar being based on said at least one grammar produced during the decoding of said another document; -obtaining (300) the document to be decoded; and, -decoding (340) said document according to said at least one generated grammar.
- 10. The method according to claim 9, further comprising a step of storing (350) the at least one grammar produced during the decoding step of said another document.
- 11. The method according to claim 9 or claim 10, further comprising a step of identifying (315) a reference to the reference document of which at least one grammar produced during the decoding of said another document is used to generate said generated grammar used for decoding the document.
- 12. The method according to any one of claims 9 to 11, further comprising a step of obtaining at least one structural information item, said at least one generated grammar being based on said at least one grammar produced during the decoding of said another document and on the obtained at least one structural information item.
- 13. The method according to claim 12 further comprising a step of decoding (330) the obtained at least one structural information item.
- 14. The method according to any one of claims ito 13 wherein the document and said another document are XML documents, said another document being encoded according to the EXI specification.
- 15. The method according to any one of claims 1 to 14 wherein the first unit is a server device and the second unit is a client device, and wherein said document is transmitted from the server device to the client device via a communication network.
- 16. The method according to any one of claims 1 to 14 wherein the first unit is an encoder unit and wherein the second unit is a storage unit belonging to the same apparatus.
- 17. A computer program product readable by a microprocessor, comprising portions of software code adapted to implement each step of the method according to any one of claims 1 to 16, when it is loaded and executed by the microprocessor.
- 18. An encoder for encoding a document to be transferred from a first unit to at least one second unit, the document being similar to another document and the document and said another document being documents conforming to a markup language, the encoder comprising: -means for encoding and providing said another document to the at least one second unit, said another document being encoded according to an encoding method based on the use of at least one grammar produced during the encoding of said another document, said another document forming a reference document; -means for generating at least one grammar, the at least one generated grammar being based on said at least one grammar produced during the encoding of said another document; -means for encoding said document according to said at least one generated grammar; and, -means for providing said document encoded according to said at least one generated grammar to the at least one second unit, along with a reference to the reference document.
- 19. The encoder according to claim 18, further comprising the means for obtaining at least one item of structural information on said another document and means for providing the obtained at least one structural information item to the at least one second unit, said at least one generated grammar being based on said at least one grammar produced during the encoding of said another document and on the obtained at least one structural information item.
- 20. The encoder according to claim 19, further comprising means for filtering the obtained at least one information item, the means for providing the at least one structural information item to the at least one second unit enabling provision of the filtered obtained at least one structural information item to the at least one second unit, said at least one generated grammar being based on said at least one grammar produced during the encoding of said another document and on the filtered obtained at least one structural information item.
- 21. The encoder according to claim 20, further comprising means for encoding the filtered obtained at least one structural information item before it is provided to the at least one second unit.
- 22. The encoder according to claim 21 further comprising means for adding the encoded filtered obtained at least one structural information item or a reference to the encoded filtered obtained at least one structural information item to said document before it is provided to the at least one second unit.
- 23. A decoder for decoding a document, the document being similar to another document and the document and said another document being documents conforming to a markup language, the document being encoded by the encoder of any one of claims 18 to 22, the decoder comprising: -means for obtaining said another document, said another document being encoded according to an encoding method based on the use of at least one grammar produced during its encoding or decoding, said another document forming a reference document; -means for generating at least one grammar, the at least one generated grammar being based on said at least one grammar produced during the decoding of said another document; -means for obtaining the document to be decoded; and, -means for decoding said document according to said at least one generated grammar.
- 24. The decoder according to claim 23, further comprising means for storing the at least one grammar produced during the decoding of said another document.
- 25. The decoder according to claim 23 or claim 24, further comprising means for identifying a reference to the reference document of which at least one grammar produced during its decoding is used to generate said generated grammar used for decoding the document.
- 26. The decoder according to any one of claims 23 to 25, further comprising means for obtaining at least one structural information item, said at least one generated grammar being based on said at least one grammar produced during the decoding of the another document and on the obtained at least one structural information item.
- 27. The decoder according to claim 26 further comprising means for decoding the obtained at least one structural information item.
- 28. A method, device or computer program for of encoding a document to be transferred from a first unit to at least one second unit substantially as hereinbefore described with reference to the accompanying drawings.
- 29. A method, device or computer program for decoding a document as hereinbefore described with reference to the accompanying drawings.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GB1103550.8A GB2488576B (en) | 2011-03-02 | 2011-03-02 | Method and devices for optimizing storage and transmission of documents of the xml type |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| GB1103550.8A GB2488576B (en) | 2011-03-02 | 2011-03-02 | Method and devices for optimizing storage and transmission of documents of the xml type |
Publications (3)
| Publication Number | Publication Date |
|---|---|
| GB201103550D0 GB201103550D0 (en) | 2011-04-13 |
| GB2488576A true GB2488576A (en) | 2012-09-05 |
| GB2488576B GB2488576B (en) | 2013-06-19 |
Family
ID=43904430
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| GB1103550.8A Expired - Fee Related GB2488576B (en) | 2011-03-02 | 2011-03-02 | Method and devices for optimizing storage and transmission of documents of the xml type |
Country Status (1)
| Country | Link |
|---|---|
| GB (1) | GB2488576B (en) |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20040083219A1 (en) * | 2002-10-24 | 2004-04-29 | Koninklijke Philips Electronics N.V. | Method and system for reducing code in an extensible markup language program |
| US20080281920A1 (en) * | 2007-05-07 | 2008-11-13 | International Business Machines Corporation | Adaptive parsing and compression of soap messages |
| US20090254882A1 (en) * | 2008-04-07 | 2009-10-08 | Canon Kabushiki Kaisha | Methods and devices for iterative binary coding and decoding of xml type documents |
| US20090287625A1 (en) * | 2008-05-15 | 2009-11-19 | Canon Kabushiki Kaisha | Method and device for coding a structured document and method and device for decoding a document so coded |
| US20100083101A1 (en) * | 2008-09-30 | 2010-04-01 | Canon Kabushiki Kaisha | Methods of coding and decoding a structured document, and the corresponding devices |
| EP2312757A1 (en) * | 2009-09-28 | 2011-04-20 | Thomson Licensing | Method and apparatus for compressing XML documents |
-
2011
- 2011-03-02 GB GB1103550.8A patent/GB2488576B/en not_active Expired - Fee Related
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20040083219A1 (en) * | 2002-10-24 | 2004-04-29 | Koninklijke Philips Electronics N.V. | Method and system for reducing code in an extensible markup language program |
| US20080281920A1 (en) * | 2007-05-07 | 2008-11-13 | International Business Machines Corporation | Adaptive parsing and compression of soap messages |
| US20090254882A1 (en) * | 2008-04-07 | 2009-10-08 | Canon Kabushiki Kaisha | Methods and devices for iterative binary coding and decoding of xml type documents |
| US20090287625A1 (en) * | 2008-05-15 | 2009-11-19 | Canon Kabushiki Kaisha | Method and device for coding a structured document and method and device for decoding a document so coded |
| US20100083101A1 (en) * | 2008-09-30 | 2010-04-01 | Canon Kabushiki Kaisha | Methods of coding and decoding a structured document, and the corresponding devices |
| EP2312757A1 (en) * | 2009-09-28 | 2011-04-20 | Thomson Licensing | Method and apparatus for compressing XML documents |
Non-Patent Citations (1)
| Title |
|---|
| Journal of Computer and System Sciences, 2009, vol 75, pages 303-322, February 2009, Sherif Sakr, "XML compression techniques: A survey and comparison" * |
Also Published As
| Publication number | Publication date |
|---|---|
| GB201103550D0 (en) | 2011-04-13 |
| GB2488576B (en) | 2013-06-19 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US8341129B2 (en) | Methods of coding and decoding a structured document, and the corresponding devices | |
| US8533172B2 (en) | Method and device for coding and decoding information | |
| US8601368B2 (en) | Processing method and device for the coding of a document of hierarchized data | |
| CN1166072C (en) | Method, apparatus and system for data compression, transmission, storage and communication | |
| US8346737B2 (en) | Encoding of hierarchically organized data for efficient storage and processing | |
| CN105447099B (en) | Log-structuredization information extracting method and device | |
| US9208256B2 (en) | Methods of coding and decoding, by referencing, values in a structured document, and associated systems | |
| US8914718B2 (en) | Coding a structured document as a bitstream by storing in memory a reference to an entry in a coding dictionary | |
| US20040103105A1 (en) | Subtree-structured XML database | |
| US20120310868A1 (en) | Method and system for extracting and managing information contained in electronic documents | |
| US8892991B2 (en) | Encoder compiler, computer readable medium, and communication device | |
| US8949207B2 (en) | Method and apparatus for decoding encoded structured data from a bit-stream | |
| JP2005538436A (en) | Method and apparatus for encoding / decoding structured text, especially XML text | |
| US8234288B2 (en) | Method and device for generating reference patterns from a document written in markup language and associated coding and decoding methods and devices | |
| US20070143664A1 (en) | A compressed schema representation object and method for metadata processing | |
| WO2008051783A2 (en) | Context-free grammar | |
| JP3865694B2 (en) | Path encoding and decoding method in tree structure of structured document | |
| US20090271695A1 (en) | Method of accessing or modifying a part of a binary xml document, associated devices | |
| US20070283245A1 (en) | Event-based parser for markup language file | |
| US9367528B2 (en) | Method and device for document coding and method and device for document decoding | |
| US20120151330A1 (en) | Method and apparatus for encoding and decoding xml documents using path code | |
| CN101986303A (en) | Digital television HSML analysis method and system applying DOM analysis engine | |
| GB2488576A (en) | Compression of XML documents | |
| O'Connor et al. | Desirable properties for XML update mechanisms | |
| Ruellan | XML entropy study |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PCNP | Patent ceased through non-payment of renewal fee |
Effective date: 20170302 |