[go: up one dir, main page]

WO2011067769A1 - Shared dictionary compression over http proxy - Google Patents

Shared dictionary compression over http proxy Download PDF

Info

Publication number
WO2011067769A1
WO2011067769A1 PCT/IL2010/001023 IL2010001023W WO2011067769A1 WO 2011067769 A1 WO2011067769 A1 WO 2011067769A1 IL 2010001023 W IL2010001023 W IL 2010001023W WO 2011067769 A1 WO2011067769 A1 WO 2011067769A1
Authority
WO
WIPO (PCT)
Prior art keywords
sdch
implementing
over http
protocol according
compression over
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/IL2010/001023
Other languages
French (fr)
Inventor
Ariel Yaloz
Roman Shterenzon
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
InfoGin Ltd
Original Assignee
InfoGin Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by InfoGin Ltd filed Critical InfoGin Ltd
Publication of WO2011067769A1 publication Critical patent/WO2011067769A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/70Type of the data to be coded, other than image and sound
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • H03M7/3091Data deduplication
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/04Protocols for data compression, e.g. ROHC

Definitions

  • the present invention relates to web page compression techniques and in particular to Shared Dictionary Compression over HTTP (SDCH).
  • SDCH Shared Dictionary Compression over HTTP
  • the present invention provides a proxy server which implements the Shared Dictionary Compression over HTTP protocol.
  • a system for implementing the Shared Dictionary Compression over HTTP (SDCH) protocol including page metrics calculating and aggregating functionality operative to calculate and aggregate page metrics of a web page, compression decision functionality operative to utilize the page metrics to decide whether to compress the web page, feature vector generating functionality operative to generate a feature vector for the web page, web page comparison functionality operative to compare the feature vector of the web page to feature vectors of other web pages and to thereby identify similarities between the web page and the other web pages, web page cluster generation functionality operative to utilize the web page comparison functionality to generate clusters of similar web pages, web page cluster assignment functionality operative to utilize the web page comparison functionality to assign the web page to a cluster of similar web pages, dictionary generating functionality operative to generate and assign a dictionary for each of the clusters of similar web pages, and web page compression functionality operative to compress the web page assigned to the cluster using a dictionary assigned to the cluster.
  • SDCH Shared Dictionary Compression over HTTP
  • the system resides on a proxy server.
  • the proxy is operative to intercept requests from at least one web client to at least one web server to download at least one web page.
  • the page metrics include the number of times the specific web page has been requested during a predefined time duration. Additionally, the page metrics also include the size of the average payload downloaded in response to the request. Additionally, the page metrics also include a product of the number of times the specific web page has been requested during a predefined time duration and the size of the average payload downloaded in response to the request.
  • the feature vector includes an array of page characteristics, the characteristics differentiating between groups of pages.
  • the characteristics include at least one of statistical analysis of HTML tags, DOM tree, CSS Class names, HTML tag IDs, page size, elements size, links to external resources and page keywords.
  • the feature vector generating functionality includes parsing functionality and tokenizing functionality.
  • the web page cluster generation functionality includes utilization of at least one similarity distance metric.
  • the at least one similarity distance metric is a Euclidian metric.
  • the at least one similarity distance metric is a Pearson metric.
  • calculation of at least one similarity distance metric is achieved by utilizing at least one of Hierarchical Clustering, K-Means and Fuzzy C-Means.
  • the similarities include similarities of web content creation suites used to create the web pages.
  • the web page comparison functionality is operative to disregard private web pages.
  • the private pages include pages to which a client was redirected from an HTTPS connection as specified by a "Referer" request header.
  • the private pages include pages which are accessed after a POST request which resulted in a cookie or in appended request parameters.
  • the private pages include pages which include a "Cache-Control: private" header.
  • the web page comparison functionality is operative to identify the private web pages by accessing URLs associated with potentially private web pages without the use of a session identifier, and comparing received responses to previously recorded responses received when a session identifier was used.
  • the web page comparison functionality is operative to compare content of HTML tags which are shared amongst at least two pages. Additionally, the web page comparison functionality is operative to compare content of HTML tags which are shared amongst at least two pages by utilizing a generic differencing algorithm.
  • the generic differencing algorithm is VCDiff.
  • the dictionary generating functionality is operative to generate and assign a dictionary for each of the clusters which include at least a predetermined number of web pages.
  • the system is operative to store compressed web pages.
  • the system is operative to compress web pages targeted at SDCH enabled clients.
  • the system is operative to identify SDCH enabled clients by analyzing the Accept-Encoding header for presence of the "sdch" string.
  • Fig. 1 is a simplified flowchart indicating steps in the operation of a system implementing the Shared Dictionary Compression over HTTP protocol, constructed and operative in accordance with a preferred embodiment of the present invention.
  • caching techniques such as HTTP caching
  • HTTP caching stores complete web pages or objects, and is therefore ineffective in optimizing bandwidth usage for web sites for which only a small portion of a page is modified between subsequent downloads.
  • CNN.com http://www.cnn.com
  • color scheme are modified very rarely, while what is actually modified are the stories embedded in the page layout. This presents an opportunity to implement a mechanism for caching partial pages.
  • the HTTP/ 1.1 protocol supports response compression via the Accept-Encoding and Content-Encoding headers.
  • the most commonly used HTTP response compression encoding is gzip, which compresses data that is repeated multiple times within a given response.
  • the HTTP/ 1.1 protocol does not provide a mechanism for compressing data that is repeated over multiple responses.
  • delta encoding Another class of encoding techniques, known as delta encoding, has proven to be effective in compressing inter-response data.
  • Previous efforts to extend the HTTP/1.1 protocol to support delta encoding have focused on encoding an HTTP response as a delta, or a difference, between the current response and a previous version of the current response.
  • One such approach is discussed in RFC3229 "Delta encoding in HTTP" (http://www.rfc-editor.org/rfc/rfc3229.txt). While RFC3229 is effective in reducing the size of the downloaded payload for many types of resources, it may not be suitable for certain classes of responses. Specifically, under RFC3229, deltas can only be applied to responses originating from the same URL.
  • the previously stored instance to which to apply the delta to recreate an entire response is identified using its Last-Modified timestamp or entity-tag.
  • Content hashes can be used to identify previously stored instances, however this may result in retrieving false positives. Additionally, storing all previous responses on the server may not be practical.
  • SDCH Shared Dictionary Compression over HTTP
  • a dictionary used by the SDCH protocol is a file downloaded by the client from the server which contains strings or elements that are likely to appear in subsequent HTTP responses. These elements can be stored in the dictionary which is available to both the client and to the server, and the server can substitute these elements with references to the dictionary, allowing the client to reconstruct the original page from these references. By substituting dictionary references for repeated elements in HTTP responses, the payload size can be reduced.
  • SDCH can only be beneficial when both the client (e.g. web browser) and the web server support the SDCH protocol.
  • adding support for the SDCH protocol requires substantial investment in the content generating software as well as in content generating work flows, and therefore content providers are reluctant to do so. Therefore, for example, users browsing CNN.com with an SDCH enabled web browser cannot benefit from SDCH because currently CNN.com, similarly to the vast majority of internet web servers, does not support SDCH.
  • the SDCH protocol has been implemented in the Google ChromeTM web browser, as well as in Microsoft Internet ExplorerTM with the GoogleTM Toolbar installed.
  • the present invention seeks to provide a SDCH protocol implementation system wherein the SDCH dictionary and the dictionary references are added to HTTP responses by a server other than the web server hosting the web site.
  • the system acts as a proxy which accesses web servers using the standard HTTP 1.0/1.1 protocol, generates the dictionary, and provides the dictionary as well as the related compressed data to an SDCH supported browser.
  • This system removes the burden of dictionary generation and server side SDCH support from the web servers, while providing the benefits of this technology to SDCH enabled browsers.
  • the proxy operates in an intercepting and fully transparent mode leaving source and destination IP addresses intact, and does not interfere with the services and access controls provided to the client by the web server, which depend on requests from the client to the web server originating from the client's IP address.
  • Fig. 1 is a simplified flowchart indicating steps in the operation of a system implementing the Shared Dictionary Compression over HTTP protocol, constructed and operative in accordance with a preferred embodiment of the present invention.
  • the system of Fig. 1 includes page metrics calculating and aggregating functionality operative to calculate and aggregate page metrics of a web page, compression decision functionality operative to utilize said page metrics to decide whether to compress the specific web page, feature vector generating functionality operative to generate a feature vector for the web page, web page comparison functionality operative to compare the feature vectors of the web page to feature vectors of other web pages and to thereby identify similarities between the web page and the other web pages, web page cluster generation functionality operative to utilize the web page comparison functionality to generate clusters of similar web pages, web page cluster assignment functionality operative to utilize said web page comparison functionality to assign said web page to a cluster of similar web pages, dictionary generating functionality operative to generate and assign a dictionary for each of the clusters of similar web pages, and web page compression functionality operative to compress the web page assigned to the cluster in compliance with the SDCH protocol.
  • the system resides on a proxy server which intercepts requests from web clients to web servers to download web pages.
  • the system continuously monitors incoming requests, and for each web page requested the system tracks the number of times the web page is requested during a predefined time duration and the size of the average payload downloaded in response to the requests.
  • a product of these two numbers is used by the system as a page metric to determine whether a requested page surpasses a predetermined threshold, making it a candidate for compression.
  • a request made by a client to a server to download a particular page is intercepted by the proxy server (100).
  • the page metric for the particular page is calculated (102), and the metric is used by the system to decide whether the particular page should be compressed (104). If the metric is below the predetermined threshold, the page is sent to the client uncompressed (106). If the metric is above the predetermined threshold, the page is marked as a candidate for compression.
  • a dictionary To compress the page, a dictionary must be chosen from multiple dictionaries available to the server.
  • One method of choosing a dictionary is to compress the page using all available dictionaries, and to choose the dictionary which provides the highest compression ratio for the page. In cases where a predefined minimum compression ratio is not achieved, a new dictionary is created.
  • compressing each page using all available dictionaries results in an 0(N 2 ) computation complexity, which uses computational resources inefficiently.
  • the SDCH compression mechanism is more efficient when a single dictionary is shared by multiple similar pages of a domain which share common elements. Therefore, when compressing a particular page, the proxy server identifies a cluster of similar pages in the domain and compresses the particular page using a dictionary corresponding to that cluster.
  • the present invention seeks to provide a more efficient compression mechanism by identifying similarities between pages using feature vectors.
  • a feature vector is an array of page characteristics which differentiate between groups of pages, such as statistical analysis of HTML tags, DOM tree, CSS Class names, HTML tag IDs, page size, elements size, links to external resources and page keywords.
  • a feature vector is created for a page to be compressed by parsing and tokenizing the page (108), and is then compared to existing page clusters (110).
  • Clusters of similar pages are built using a similarity distance metric (e.g. Euclidian, Pearson).
  • a similarity distance metric e.g. Euclidian, Pearson.
  • Several algorithms may be used to compute the similarity distance metric, such as Hierarchical Clustering, K-Means and Fuzzy C-Means.
  • the present invention also uses various, heuristics (such as URL structure and page contents) to analyze the pages and to determine whether they have been created using common web content creation suites (for example, phpBB forum generating software). All pages sharing such commonalities are added to one cluster, obviating the need for any additional similarity related computations.
  • a dictionary comprises content which is shared among multiple pages in a cluster.
  • Content of a private nature e.g. content which is accessed only by an authorized client, is not be shared. Therefore, when comparing pages of a cluster to determine which content elements should comprise the dictionary, content of a private nature must be identified and disregarded by the comparison process.
  • the following content elements are marked as potentially including content of a private nature:
  • a session identifier such as a cookie or a session parameter in the URL, such as JSESSIONID.
  • the received responses are compared to previously recorded responses received when a session identifier was used. If the responses do not match, i.e. alternative content was received when using a session identifier, the elements are considered to be private and are disregarded by the comparison process.
  • the remaining pages in the cluster are then first compared by identifying shared HTML tags, and then by identifying shared tag- content using a generic differencing algorithm such as VCDiff (specified in RFC3284; http://www.rfc-editor.org/rfc/rfc3284;txt). Page elements which appear in a percentage of pages that is higher than a predefined percentage threshold are saved as the SDCH dictionary for the cluster.
  • VCDiff specified in RFC3284; http://www.rfc-editor.org/rfc/rfc3284;txt.
  • an SDCH dictionary may be applied to multiple pages, and a particular page may be compressed by multiple different dictionaries.
  • a page which has been marked as a candidate for compression is compared to existing page clusters (110) to determine whether the page shares any similarities with any of the existing clusters (112), and to thereby determine which dictionary is most relevant to be used for compressing the page. If the page is found not to share similarities with any existing clusters, or if the distance between the cluster found to be most similar to the page and the page, as calculated by the generic differencing algorithm, is greater than a predetermined threshold, the system creates a new cluster based on the page (114), and the page is sent uncompressed (116).
  • the system determines whether a dictionary exists for the cluster (118). If a dictionary for the cluster exists, the page is compressed using the dictionary (120), and the compressed page is sent (122). If a dictionary for the cluster does not exist and if the number of pages in the cluster exceeds a predetermined threshold (124), a dictionary is created for the cluster (126). If the number of pages in the cluster does not exceed the predetermined threshold, a dictionary is not created, and the page is sent uncompressed (128).
  • the software maintains a local cache of SDCH compressed pages in full compliance with the cache-related clauses of RFC2616 (http://www.rfc- editor.org/rfc/rfc2616.txt) between the web servers and the proxy, and between the proxy and the clients. It is appreciated that the system only compresses content targeted at SDCH enabled clients. SDCH enabled clients are identified by analyzing the Accept- Encoding header for presence of the "sdch" string.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

A system for implementing the SDCH protocol, including page metrics calculating and aggregating functionality operative to calculate and aggregate page metrics of a webpage, compression decision functionality operative to utilize page metrics to decide whether to compress the webpage, feature vector generating functionality operative to generate a feature vector for the webpage, webpage comparison functionality operative to compare the feature vector of the webpage to feature vectors of other webpages to identify similarities between the webpage and other webpages, webpage cluster generation functionality operative to utilize webpage comparison functionality to generate clusters of similar webpages, webpage cluster assignment functionality operative to utilize the webpage comparison functionality to assign the webpage to a cluster of similar webpages, dictionary generating functionality operative to generate and assign a dictionary for clusters of similar webpages, and webpage compression functionality operative to compress webpages assigned to the cluster using a dictionary assigned to the cluster.

Description

SHARED DICTIONARY COMPRESSION OVER HTTP PROXY
REFERENCE TO RELATED APPLICATIONS
Reference is made to U.S. Provisional Patent Application Serial No. 61/266,247, filed December 3, 2009 and entitled "SHARED DICTIONARY COMPRESSION OVER HTTP PROXY", the disclosure of which is hereby incorporated by reference and priority of which is hereby claimed pursuant to 37 CFR 1.78(a) (4) and (5)(i).
FIELD OF THE INVENTION
The present invention relates to web page compression techniques and in particular to Shared Dictionary Compression over HTTP (SDCH).
BACKGROUND OF THE INVENTION The following patent publications are believed to represent the current state of the art:
U.S. Patent Nos.: 5,991,713 and 6,665,837; and
"A Proposal for Shared Dictionary Compression over HTTP" by Jon Butler, Wei-Hsin Lee, Bryan McQuade and Kenneth Mixter at Google, Inc., September 8, 2008. SUMMARY OF THE INVENTION
The present invention provides a proxy server which implements the Shared Dictionary Compression over HTTP protocol.
There is thus provided in accordance with a preferred embodiment of the present invention a system for implementing the Shared Dictionary Compression over HTTP (SDCH) protocol, the system including page metrics calculating and aggregating functionality operative to calculate and aggregate page metrics of a web page, compression decision functionality operative to utilize the page metrics to decide whether to compress the web page, feature vector generating functionality operative to generate a feature vector for the web page, web page comparison functionality operative to compare the feature vector of the web page to feature vectors of other web pages and to thereby identify similarities between the web page and the other web pages, web page cluster generation functionality operative to utilize the web page comparison functionality to generate clusters of similar web pages, web page cluster assignment functionality operative to utilize the web page comparison functionality to assign the web page to a cluster of similar web pages, dictionary generating functionality operative to generate and assign a dictionary for each of the clusters of similar web pages, and web page compression functionality operative to compress the web page assigned to the cluster using a dictionary assigned to the cluster.
In accordance with a preferred embodiment of the present invention, the system resides on a proxy server. Preferably, the proxy is operative to intercept requests from at least one web client to at least one web server to download at least one web page. Preferably, the page metrics include the number of times the specific web page has been requested during a predefined time duration. Additionally, the page metrics also include the size of the average payload downloaded in response to the request. Additionally, the page metrics also include a product of the number of times the specific web page has been requested during a predefined time duration and the size of the average payload downloaded in response to the request.
Preferably, the feature vector includes an array of page characteristics, the characteristics differentiating between groups of pages. Preferably, the characteristics include at least one of statistical analysis of HTML tags, DOM tree, CSS Class names, HTML tag IDs, page size, elements size, links to external resources and page keywords.
Preferably, the feature vector generating functionality includes parsing functionality and tokenizing functionality. Preferably, the web page cluster generation functionality includes utilization of at least one similarity distance metric.
In accordance with a preferred embodiment of the present invention, the at least one similarity distance metric is a Euclidian metric. Alternatively, the at least one similarity distance metric is a Pearson metric. Preferably, calculation of at least one similarity distance metric is achieved by utilizing at least one of Hierarchical Clustering, K-Means and Fuzzy C-Means. Preferably, the similarities include similarities of web content creation suites used to create the web pages.
In accordance with a preferred embodiment of the present invention, the web page comparison functionality is operative to disregard private web pages. Preferably, the private pages include pages to which a client was redirected from an HTTPS connection as specified by a "Referer" request header. Additionally, the private pages include pages which are accessed after a POST request which resulted in a cookie or in appended request parameters. Additionally, the private pages include pages which include a "Cache-Control: private" header. Preferably, the web page comparison functionality is operative to identify the private web pages by accessing URLs associated with potentially private web pages without the use of a session identifier, and comparing received responses to previously recorded responses received when a session identifier was used.
Preferably, the web page comparison functionality is operative to compare content of HTML tags which are shared amongst at least two pages. Additionally, the web page comparison functionality is operative to compare content of HTML tags which are shared amongst at least two pages by utilizing a generic differencing algorithm. Preferably, the generic differencing algorithm is VCDiff.
In accordance with a preferred embodiment of the present invention, the dictionary generating functionality is operative to generate and assign a dictionary for each of the clusters which include at least a predetermined number of web pages. Preferably, the system is operative to store compressed web pages. Additionally, the system is operative to compress web pages targeted at SDCH enabled clients. Additionally, the system is operative to identify SDCH enabled clients by analyzing the Accept-Encoding header for presence of the "sdch" string.
BRIEF DESCRIPTION OF THE DRAWINGS
The present invention will be understood and appreciated more fully from the following detailed description, taken in conjunction with the drawings in which:
Fig. 1 is a simplified flowchart indicating steps in the operation of a system implementing the Shared Dictionary Compression over HTTP protocol, constructed and operative in accordance with a preferred embodiment of the present invention.
DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS
Many of today's World Wide Web sites are not optimized for use by clients having limited bandwidth capacity. While this may not be necessary for clients having broad-band ADSL or cable connectivity to the internet, bandwidth capacity considerations are an issue on capacity-sensitive communication links, such as UMTS, GPRS or satellite links. Additionally, the current situation in which existing bandwidth is used inefficiently is unfavorable for many mobile operators who provide internet access to their customers for a flat monthly fee, regardless of the amount of bandwidth actually utilized. The proliferation of high-end devices such as the Apple® iPhone and Android and Symbian based handsets is expected to generate an increase in mobile traffic, and to thus generate an increase in demand for bandwidth, and, consequently, for relevant hardware and available radio frequencies. Therefore, mobile operators are incentivized to utilize less bandwidth in order to cut costs and to maintain current levels of revenue.
As known to persons skilled in the art, a large portion of internet web pages are modified only slightly over even long periods of time. Therefore, caching techniques, such as HTTP caching, have evolved to reduce bandwidth usage by storing copies of downloaded items, thereby enabling subsequent HTTP requests for the same items to retrieve those items from a local cache instead of redundantly downloading them from the internet. However, HTTP caching stores complete web pages or objects, and is therefore ineffective in optimizing bandwidth usage for web sites for which only a small portion of a page is modified between subsequent downloads. For example, the CNN.com (http://www.cnn.com) homepage layout and color scheme are modified very rarely, while what is actually modified are the stories embedded in the page layout. This presents an opportunity to implement a mechanism for caching partial pages.
In order to reduce the size of downloaded payloads or pages, the HTTP/ 1.1 protocol supports response compression via the Accept-Encoding and Content-Encoding headers. The most commonly used HTTP response compression encoding is gzip, which compresses data that is repeated multiple times within a given response. However, the HTTP/ 1.1 protocol does not provide a mechanism for compressing data that is repeated over multiple responses.
Another class of encoding techniques, known as delta encoding, has proven to be effective in compressing inter-response data. Previous efforts to extend the HTTP/1.1 protocol to support delta encoding have focused on encoding an HTTP response as a delta, or a difference, between the current response and a previous version of the current response. One such approach is discussed in RFC3229 "Delta encoding in HTTP" (http://www.rfc-editor.org/rfc/rfc3229.txt). While RFC3229 is effective in reducing the size of the downloaded payload for many types of resources, it may not be suitable for certain classes of responses. Specifically, under RFC3229, deltas can only be applied to responses originating from the same URL. Additionally, the previously stored instance to which to apply the delta to recreate an entire response is identified using its Last-Modified timestamp or entity-tag. This makes RFC3229 unsuitable for compressing dynamically generated responses which are difficult to identify uniquely using entity tags or last modified timestamps, such as responses to a URL comprising varying query parameters (e.g. a search results page). Content hashes can be used to identify previously stored instances, however this may result in retrieving false positives. Additionally, storing all previous responses on the server may not be practical.
To address these shortcomings, Google Inc. has proposed a mechanism called Shared Dictionary Compression over HTTP (SDCH), which may be used to compress data more effectively and to reduce the volume of data which needs to be transmitted over the network.
A dictionary used by the SDCH protocol is a file downloaded by the client from the server which contains strings or elements that are likely to appear in subsequent HTTP responses. These elements can be stored in the dictionary which is available to both the client and to the server, and the server can substitute these elements with references to the dictionary, allowing the client to reconstruct the original page from these references. By substituting dictionary references for repeated elements in HTTP responses, the payload size can be reduced.
SDCH can only be beneficial when both the client (e.g. web browser) and the web server support the SDCH protocol. However, adding support for the SDCH protocol requires substantial investment in the content generating software as well as in content generating work flows, and therefore content providers are reluctant to do so. Therefore, for example, users browsing CNN.com with an SDCH enabled web browser cannot benefit from SDCH because currently CNN.com, similarly to the vast majority of internet web servers, does not support SDCH. The SDCH protocol has been implemented in the Google Chrome™ web browser, as well as in Microsoft Internet Explorer™ with the Google™ Toolbar installed.
The present invention seeks to provide a SDCH protocol implementation system wherein the SDCH dictionary and the dictionary references are added to HTTP responses by a server other than the web server hosting the web site. The system acts as a proxy which accesses web servers using the standard HTTP 1.0/1.1 protocol, generates the dictionary, and provides the dictionary as well as the related compressed data to an SDCH supported browser. This system removes the burden of dictionary generation and server side SDCH support from the web servers, while providing the benefits of this technology to SDCH enabled browsers. The proxy operates in an intercepting and fully transparent mode leaving source and destination IP addresses intact, and does not interfere with the services and access controls provided to the client by the web server, which depend on requests from the client to the web server originating from the client's IP address.
Reference is now made to Fig. 1, which is a simplified flowchart indicating steps in the operation of a system implementing the Shared Dictionary Compression over HTTP protocol, constructed and operative in accordance with a preferred embodiment of the present invention.
The system of Fig. 1 includes page metrics calculating and aggregating functionality operative to calculate and aggregate page metrics of a web page, compression decision functionality operative to utilize said page metrics to decide whether to compress the specific web page, feature vector generating functionality operative to generate a feature vector for the web page, web page comparison functionality operative to compare the feature vectors of the web page to feature vectors of other web pages and to thereby identify similarities between the web page and the other web pages, web page cluster generation functionality operative to utilize the web page comparison functionality to generate clusters of similar web pages, web page cluster assignment functionality operative to utilize said web page comparison functionality to assign said web page to a cluster of similar web pages, dictionary generating functionality operative to generate and assign a dictionary for each of the clusters of similar web pages, and web page compression functionality operative to compress the web page assigned to the cluster in compliance with the SDCH protocol.
Preferably, the system resides on a proxy server which intercepts requests from web clients to web servers to download web pages. The system continuously monitors incoming requests, and for each web page requested the system tracks the number of times the web page is requested during a predefined time duration and the size of the average payload downloaded in response to the requests. A product of these two numbers is used by the system as a page metric to determine whether a requested page surpasses a predetermined threshold, making it a candidate for compression.
As seen in Fig. 1, a request made by a client to a server to download a particular page is intercepted by the proxy server (100). The page metric for the particular page is calculated (102), and the metric is used by the system to decide whether the particular page should be compressed (104). If the metric is below the predetermined threshold, the page is sent to the client uncompressed (106). If the metric is above the predetermined threshold, the page is marked as a candidate for compression.
To compress the page, a dictionary must be chosen from multiple dictionaries available to the server. One method of choosing a dictionary is to compress the page using all available dictionaries, and to choose the dictionary which provides the highest compression ratio for the page. In cases where a predefined minimum compression ratio is not achieved, a new dictionary is created. However, compressing each page using all available dictionaries results in an 0(N2) computation complexity, which uses computational resources inefficiently.
As known to persons skilled in the art, the SDCH compression mechanism is more efficient when a single dictionary is shared by multiple similar pages of a domain which share common elements. Therefore, when compressing a particular page, the proxy server identifies a cluster of similar pages in the domain and compresses the particular page using a dictionary corresponding to that cluster. The present invention seeks to provide a more efficient compression mechanism by identifying similarities between pages using feature vectors. A feature vector is an array of page characteristics which differentiate between groups of pages, such as statistical analysis of HTML tags, DOM tree, CSS Class names, HTML tag IDs, page size, elements size, links to external resources and page keywords. A feature vector is created for a page to be compressed by parsing and tokenizing the page (108), and is then compared to existing page clusters (110).
Clusters of similar pages are built using a similarity distance metric (e.g. Euclidian, Pearson). Several algorithms may be used to compute the similarity distance metric, such as Hierarchical Clustering, K-Means and Fuzzy C-Means. The present invention also uses various, heuristics (such as URL structure and page contents) to analyze the pages and to determine whether they have been created using common web content creation suites (for example, phpBB forum generating software). All pages sharing such commonalities are added to one cluster, obviating the need for any additional similarity related computations.
As mentioned hereinabove, a dictionary comprises content which is shared among multiple pages in a cluster. Content of a private nature, e.g. content which is accessed only by an authorized client, is not be shared. Therefore, when comparing pages of a cluster to determine which content elements should comprise the dictionary, content of a private nature must be identified and disregarded by the comparison process. The following content elements are marked as potentially including content of a private nature:
pages to which a client was redirected from an HTTPS connection as specified by a "Referer" request header, as well as any following pages;
pages which are accessed after a POST request which resulted in a cookie or in appended request parameters; and
pages which include a "Cache-Control: private" header.
To determine whether the aforementioned elements are indeed private, their associated URLs are accessed by the system without the use of a session identifier (such as a cookie or a session parameter in the URL, such as JSESSIONID). The received responses are compared to previously recorded responses received when a session identifier was used. If the responses do not match, i.e. alternative content was received when using a session identifier, the elements are considered to be private and are disregarded by the comparison process. The remaining pages in the cluster are then first compared by identifying shared HTML tags, and then by identifying shared tag- content using a generic differencing algorithm such as VCDiff (specified in RFC3284; http://www.rfc-editor.org/rfc/rfc3284;txt). Page elements which appear in a percentage of pages that is higher than a predefined percentage threshold are saved as the SDCH dictionary for the cluster.
It is appreciated that an SDCH dictionary may be applied to multiple pages, and a particular page may be compressed by multiple different dictionaries.
Returning now to Fig. 1, a page which has been marked as a candidate for compression is compared to existing page clusters (110) to determine whether the page shares any similarities with any of the existing clusters (112), and to thereby determine which dictionary is most relevant to be used for compressing the page. If the page is found not to share similarities with any existing clusters, or if the distance between the cluster found to be most similar to the page and the page, as calculated by the generic differencing algorithm, is greater than a predetermined threshold, the system creates a new cluster based on the page (114), and the page is sent uncompressed (116).
If a cluster sufficiently similar to the page is found, the system determines whether a dictionary exists for the cluster (118). If a dictionary for the cluster exists, the page is compressed using the dictionary (120), and the compressed page is sent (122). If a dictionary for the cluster does not exist and if the number of pages in the cluster exceeds a predetermined threshold (124), a dictionary is created for the cluster (126). If the number of pages in the cluster does not exceed the predetermined threshold, a dictionary is not created, and the page is sent uncompressed (128).
The software maintains a local cache of SDCH compressed pages in full compliance with the cache-related clauses of RFC2616 (http://www.rfc- editor.org/rfc/rfc2616.txt) between the web servers and the proxy, and between the proxy and the clients. It is appreciated that the system only compresses content targeted at SDCH enabled clients. SDCH enabled clients are identified by analyzing the Accept- Encoding header for presence of the "sdch" string.
It will be appreciated by persons skilled in the art that the present invention is not limited by what has been particularly shown and described hereinabove. Rather the scope of the present invention includes both combinations and subcombinations of the various features described hereinabove as well as modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not in the prior art.

Claims

C L A I M S
1. A system for implementing the Shared Dictionary Compression over HTTP (SDCH) protocol, said system comprising:
page metrics calculating and aggregating functionality operative to calculate and aggregate page metrics of a web page;
compression decision functionality operative to utilize said page metrics to decide whether to compress said web page;
feature vector generating functionality, operative to generate a feature vector for said web page;
web page comparison functionality operative to compare said feature vector of said web page to feature vectors of other web pages and to thereby identify similarities between said web page and said other web pages;
web page cluster generation functionality operative to utilize said web page comparison functionality to generate clusters of similar web pages;
web page cluster assignment functionality operative to utilize said web page comparison functionality to assign said web page to a cluster of similar web pages;
dictionary generating functionality operative to generate and assign a dictionary for each of said clusters of similar web pages; and
web page compression functionality operative to compress said web page assigned to said cluster using a dictionary assigned to said cluster.
2. A system for implementing the Shared Dictionary Compression over HTTP (SDCH) protocol according to claim 1 and wherein said system resides on a proxy server.
3. A system for implementing the Shared Dictionary Compression over HTTP (SDCH) protocol according to claim 2 and wherein said proxy is operative to intercept requests from at least one web client to at least one web server to download at least one web page.
4. A system for implementing the Shared Dictionary Compression over HTTP (SDCH) protocol according to claim 3 and wherein said page metrics include the number of times said specific web page has been requested during a predefined time duration.
5. A system for implementing the Shared Dictionary Compression over HTTP (SDCH) protocol according to claim 4 and wherein said page metrics also include the size of the average payload downloaded in response to said request.
6. A system for implementing the Shared Dictionary Compression over
HTTP (SDCH) protocol according to claim 5 and wherein said page metrics also include a product of said number of times said specific web page has been requested during a predefined time duration and said size of the average payload downloaded in response to said request.
7. A system for implementing the Shared Dictionary Compression over HTTP (SDCH) protocol according to claim 1 and wherein said feature vector includes an array of page characteristics, said characteristics differentiating between groups of pages.
8 A system for implementing the Shared Dictionary Compression over
HTTP (SDCH) protocol according to claim 7 and wherein said characteristics include at least one of statistical analysis of HTML tags, DOM tree, CSS Class names, HTML tag IDs, page size, elements size, links to external resources and page keywords.
9. A system for implementing the Shared Dictionary Compression over HTTP (SDCH) protocol according to claim 1 and wherein said feature vector generating functionality includes parsing functionality and tokenizing functionality.
10. A system for implementing the Shared Dictionary Compression over
HTTP (SDCH) protocol according to claim 1 and wherein said web page cluster generation functionality includes utilization of at least one similarity distance metric.
11. A system for implementing the Shared Dictionary Compression over HTTP (SDCH) protocol according to claim 10 and wherein said at least one similarity distance metric is a Euclidian metric.
12. A system for implementing the Shared Dictionary Compression over HTTP (SDCH) protocol according to claim 10 and wherein said at least one similarity distance metric is a Pearson metric.
13. A system for implementing the Shared Dictionary Compression over
HTTP (SDCH) protocol according to claim 10 and wherein calculation of at least one similarity distance metric is achieved by utilizing at least one of Hierarchical Clustering, K-Means and Fuzzy C-Means.
14. A system for implementing the Shared Dictionary Compression over
HTTP (SDCH) protocol according to claim 1 and wherein said similarities include similarities of web content creation suites used to create said web pages.
15. A system for implementing the Shared Dictionary Compression over HTTP (SDCH) protocol according to claim 3 and wherein said web page comparison functionality is operative to disregard private web pages.
16. A system for implementing the Shared Dictionary Compression over HTTP (SDCH) protocol according to claim 15 and wherein said private pages include pages to which a client was redirected from an HTTPS connection as specified by a "Referer" request header.
17. A system for implementing the Shared Dictionary Compression over HTTP (SDCH) protocol according to claim 15 and wherein said private pages include pages which are accessed after a POST request which resulted in a cookie or in appended request parameters.
18. A system for implementing the Shared Dictionary Compression over
HTTP (SDCH) protocol according to claim 15 and wherein said private pages include pages which include a "Cache-Control: private" header.
19. A system for implementing the Shared Dictionary Compression over
HTTP (SDCH) protocol according to claim 15 and wherein said web page comparison functionality is operative to identify said private web pages by accessing URLs associated with potentially private web pages without the use of a session identifier, and comparing received responses to previously recorded responses received when a session identifier was used.
20. A system for implementing the Shared Dictionary Compression over HTTP (SDCH) protocol according to claim 1 and wherein said web page comparison functionality is operative to compare content of HTML tags which are shared amongst at least two pages.
21. A system for implementing the Shared Dictionary Compression over HTTP (SDCH) protocol according to claim 1 and wherein said web page comparison functionality is operative to compare content of HTML tags which are shared amongst at least two pages by utilizing a generic differencing algorithm.
22. A system for implementing the Shared Dictionary Compression over HTTP (SDCH) protocol according to claim 21 and wherein said generic differencing algorithm is VCDiff.
23. A system for implementing the Shared Dictionary Compression over HTTP (SDCH) protocol according to claim 1 and wherein said dictionary generating functionality is operative to generate and assign a dictionary for each of said clusters which include at least a predetermined number of web pages.
24. A system for implementing the Shared Dictionary Compression over
HTTP (SDCH) protocol according to claim 1 and wherein said system is operative to store compressed web pages.
25. A system for implementing the Shared Dictionary Compression over
HTTP (SDCH) protocol according to claim 1 and wherein said system is operative to compress web pages targeted at SDCH enabled clients.
26. A system for implementing the Shared Dictionary Compression over
HTTP (SDCH) protocol according to claim 25 and wherein said system is operative to identify SDCH enabled clients by analyzing the Accept-Encoding header for presence of the "sdch" string.
PCT/IL2010/001023 2009-12-03 2010-12-02 Shared dictionary compression over http proxy Ceased WO2011067769A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US26624709P 2009-12-03 2009-12-03
US61/266,247 2009-12-03

Publications (1)

Publication Number Publication Date
WO2011067769A1 true WO2011067769A1 (en) 2011-06-09

Family

ID=44114661

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IL2010/001023 Ceased WO2011067769A1 (en) 2009-12-03 2010-12-02 Shared dictionary compression over http proxy

Country Status (1)

Country Link
WO (1) WO2011067769A1 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130114626A1 (en) * 2011-11-08 2013-05-09 Canon Kabushiki Kaisha Methods and network devices for communicating data packets
WO2013079999A1 (en) * 2011-12-02 2013-06-06 Canon Kabushiki Kaisha Methods and devices for encoding and decoding messages
US20150089052A1 (en) * 2012-05-04 2015-03-26 Qun Yang Lin Context-Aware HTTP Compression
US9973597B1 (en) 2014-12-10 2018-05-15 Amazon Technologies, Inc. Differential dictionary compression of network-accessible content
CN113194430A (en) * 2021-04-28 2021-07-30 杭州电力设备制造有限公司 Switch cabinet sensor network data compression method based on periodic transmission model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7058843B2 (en) * 2001-01-16 2006-06-06 Infonet Services Corporation Method and apparatus for computer network analysis
US7246306B2 (en) * 2002-06-21 2007-07-17 Microsoft Corporation Web information presentation structure for web page authoring
US7343626B1 (en) * 2002-11-12 2008-03-11 Microsoft Corporation Automated detection of cross site scripting vulnerabilities
US20090064193A1 (en) * 2007-09-05 2009-03-05 Yahoo! Inc. Distributed Network Processing System including Selective Event Logging

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7058843B2 (en) * 2001-01-16 2006-06-06 Infonet Services Corporation Method and apparatus for computer network analysis
US7246306B2 (en) * 2002-06-21 2007-07-17 Microsoft Corporation Web information presentation structure for web page authoring
US7343626B1 (en) * 2002-11-12 2008-03-11 Microsoft Corporation Automated detection of cross site scripting vulnerabilities
US20090064193A1 (en) * 2007-09-05 2009-03-05 Yahoo! Inc. Distributed Network Processing System including Selective Event Logging

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
BUTLER ET AL.: "A Proposal for Shared Dictionary Compression over HTTP.", 8 September 2008 (2008-09-08), pages 1 - 5, 7, 9, 11, 14-16, Retrieved from the Internet <URL:http://old.nabble.com/attachmenU19381291/0/Shared_Dictionary_Compression_over_HTTP.pdf> [retrieved on 20110218] *
STREHL ET AL.: "Impact of Similarity Measures on Web-page Clustering", AAAI-2000: WORKSHOP OF ARTIFICIAL INTELLIGENCE FOR WEB SEARCH, July 2000 (2000-07-01), pages 1 - 2, Retrieved from the Internet <URL:http://www.ideal.ece.utexas.edu/papers/strehl_aaai00.pdf> [retrieved on 20110219] *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130114626A1 (en) * 2011-11-08 2013-05-09 Canon Kabushiki Kaisha Methods and network devices for communicating data packets
US9338258B2 (en) * 2011-11-08 2016-05-10 Canon Kabushiki Kaisha Methods and network devices for communicating data packets
WO2013079999A1 (en) * 2011-12-02 2013-06-06 Canon Kabushiki Kaisha Methods and devices for encoding and decoding messages
WO2013079277A1 (en) * 2011-12-02 2013-06-06 Canon Kabushiki Kaisha Methods and devices for encoding and decoding messages
US10051090B2 (en) 2011-12-02 2018-08-14 Canon Kabushiki Kaisha Methods and devices for encoding and decoding messages
US20150089052A1 (en) * 2012-05-04 2015-03-26 Qun Yang Lin Context-Aware HTTP Compression
US9973597B1 (en) 2014-12-10 2018-05-15 Amazon Technologies, Inc. Differential dictionary compression of network-accessible content
CN113194430A (en) * 2021-04-28 2021-07-30 杭州电力设备制造有限公司 Switch cabinet sensor network data compression method based on periodic transmission model

Similar Documents

Publication Publication Date Title
US7603483B2 (en) Method and system for class-based management of dynamic content in a networked environment
EP1886472B1 (en) Method for multipart encoding
Palpanas et al. Web prefetching using partial match prediction
US10686726B2 (en) Method for optimizing resource loading at mobile browsers based on cloud-client cooperation
WO2013143363A1 (en) A method and apparatus for data storage and downloading
US6532492B1 (en) Methods, systems and computer program products for cache management using admittance control
US20180302489A1 (en) Architecture for proactively providing bundled content items to client devices
WO2011067769A1 (en) Shared dictionary compression over http proxy
Chen et al. Popularity-based PPM: An effective web prefetching technique for high accuracy and low storage
Psounis Class-based delta-encoding: A scalable scheme for caching dynamic web content
CN107682281A (en) A kind of SDN switch and the application management method of SDN switch
CN103167554B (en) Gateway flow constriction processing method and device, network service providing system
Savant et al. Server-friendly delta compression for efficient web access
Sow et al. Prefetching based on web usage mining
Chen et al. Coordinated data prefetching for web contents
Canali et al. A two-level distributed architecture for efficient Web content adaptation and delivery
WO2003083612A2 (en) System and method for optimizing internet applications
Lindemann et al. Evaluating hardware and software web proxy caching solutions
Jain Optimizing web server performance using data mining techniques
Chang et al. Caching personalised and database-related dynamic web pages
Deng et al. A review of network latency optimization techniques
BROWSER et al. INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET)
Naaman et al. Evaluation of delivery techniques for dynamic Web content (extended version)
González-Cañete et al. A content-type based evaluation of web Cache replacement policies
Pons Enhancement of Web object speculative retrieval

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10834310

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10834310

Country of ref document: EP

Kind code of ref document: A1