WO2010119262A2

WO2010119262A2 - Apparatus and method for generating advertisements

Info

Publication number: WO2010119262A2
Application number: PCT/GB2010/000771
Authority: WO
Inventors: Rajesh Pampapathi; Michael Oxley; Kevin Keenoy; Boris Galitsky
Original assignee: CONTEXTURED Ltd
Current assignee: CONTEXTURED Ltd
Priority date: 2009-04-17
Filing date: 2010-04-16
Publication date: 2010-10-21
Anticipated expiration: 2011-10-17
Also published as: WO2010119262A3; GB0906639D0

Abstract

An apparatus and method for automatically generating advertisements, the apparatus comprising: a pre-processor module configured to identify and extract main content from an input document, the main content comprising information for generating advertisements; an extraction module configured to identify advertising components from the main content; and an advertisement generator configured to compile at least one advertisement, each advertisement comprising a template retrieved from a template data storage device and at least one identified advertising component.

Description

APPARATUS AND METHOD FOR GENERATING ADVERTISEMENTS

TECHNICAL FIELD

This invention relates to an apparatus and method for automatically generating advertisements. In particular, an apparatus and method for automatically generating advertisements from a webpage. In particular, an apparatus and method for automatically generating advertisements for search engine marketing.

BACKGROUND

Advertisements are conventionally written by advertisement writers. A webpage may require an advertisement for each of a plurality of products/services made available on the webpage. Therefore, at least one advertisement needs to be written for each product/service, resulting in numerous advertisements. This can be a very time consuming and expensive process.

It is therefore desirable to provide an automated process for generating advertisements. However, the content of an advertisement is required to reflect the content of the webpage and if the webpage is general (e.g. a homepage), then the advertisement may be required to reflect the content of a collection of underlying webpages. Consequently, it has proved difficult to automatically generate meaningful advertisements.

The aim of this invention is to generated a plurality of advertisements, quickly and efficiently, which effectively summarize the product and/or service that is offered on a webpage.

SUMMARY

In one embodiment of the invention, an apparatus for automatically generating advertisements is provided. The apparatus comprising: a pre-processor module configured to identify and extract main content from an input document, the main content comprising information for generating advertisements; an extraction module configured to identify advertising components from the main content; and an advertisement generator configured to compile at least one advertisement, each advertisement comprising a template retrieved from a template data storage device and at least one identified advertising component.

In another embodiment of the invention, the pre-processor module is further configured to: divide the document into a plurality of sections; and one or more of the following: perform size based analyses on each section of the plurality of sections; perform position based analyses on each section of the plurality of sections; perform linguistic quality based analyses on text of each section of the plurality of sections; and identify the main content taking into account at least one of the said analyses.

In another embodiment of the invention, the step of dividing the document into the plurality of sections comprises: providing the document as a tree structure defining a logical structure of the document; and dividing the document into the sections, each section comprising all text containing nodes at a same relative position in the tree structure.

In another embodiment of the invention, the tree structure comprises a document object model (DOM) tree structure.

In another embodiment of the invention, the step of performing size based analyses on each section, comprises: determining a size of each section; and ranking the sections based on the determined size, the section having the largest determined size being ranked first.

In another embodiment of the invention, the size comprises: a number of characters in each section; a number of words in each section; a number of sentences in each section; or a number of paragraphs in each section. In another embodiment of the invention, the step of performing position based analyses on each section, comprises: ranking the sections based on a position of each section in the document, the section appearing first in the document being ranked first.

In another embodiment of the invention, the step of determining a quality score for the text of each section comprises: determining a quality score with reference to predetermined linguistic statistics.

In another embodiment of the invention, the step of determining the quality score for the text of each section comprises: assigning a score to each sentence of each section; assigning a score to each section based on a sum of the score assigned to each sentence; and ranking the sections based on the section scores.

In another embodiment of the invention, the section with the highest score is considered to be the main content.

In another embodiment of the invention, either the pre-processor module or the extraction module is further configured to: parse the main content in order to identify a structure of the main content; and wherein the extraction module is further configured to: identify the advertising components from the main content taking into account the structure of the main content.

In another embodiment of the invention, either the pre-processor module or the extraction module is further configured to: derive relationships between text of the main content, and wherein the extraction module is further configured to: identify the advertising components from the main content taking into account the relationships between the text of the main content.

In another embodiment of the invention, either the pre-processor module or the extraction module is further configured to: determine linguistic statistics of the main content, and wherein the extraction module is further configured to: identify the advertising components from the main content taking into account the linguistic statistics of the main content.

In another embodiment of the invention, the apparatus further comprises: an advertising component storage device comprising additional advertising components, and wherein the advertisement generator is further configured to: compile the at least one advertisement using additional advertising components selected using at least one identified advertising component.

In another embodiment of the invention, the additional advertising component is substitutable for the at least one identified advertising component.

In another embodiment of the invention, the apparatus further comprises: an advertising component extraction device configured to obtain additional advertising components from at least one other document, the additional advertising components comprising a relationship with at least one identified advertising component.

In another embodiment of the invention, the at least one other document comprises a webpage.

In another embodiment of the invention, the apparatus further comprises: a linguistics storage device comprising linguistic data, and wherein the advertisement generator is further configured to: compile the at least one advertisement using linguistic data, such that the at least one advertisement is grammatically correct.

In another embodiment of the invention, the apparatus further comprises: an abbreviation module for reducing the size of the at least one advertisement to a pre-determined size.

In another embodiment of the invention, the pre-determined size comprises a predetermined number of characters. In another embodiment of the invention, reducing the size of the at least one advertisement comprises one or more of: removing non-essential advertising components; replacing words with acronyms or abbreviations; replacing words with shorter words having a similar meaning.

In another embodiment of the invention, each advertising component comprises at least one word.

In another embodiment of the invention, each advertising component comprises at least one of: an entity phrase; a sentiment phrase; a sentiment phrase with entity; a description phrase; a description phrase with entity; an imperative phrase without entity; an imperative phrase with entity; an interrogative phrase; an interrogative phrase with entity; a price phrase; a delivery phrase; a stock information and return information phrase.

In another embodiment of the invention, an entity comprises a product or a service.

In another embodiment of the invention, the template storage device comprises a plurality of templates.

In another embodiment of the invention, the at least one advertisement comprises a link to the input document.

In another embodiment of the invention, the input document comprises a webpage.

In another embodiment of the invention, the apparatus compiles a plurality of advertisements, and the apparatus further comprises: an ordering module configured to order the plurality of advertisements based on advertisement quality analyses.

In one embodiment of the invention, a method of controlling a computer apparatus to automatically generate advertisements is provided. The method comprising: identifying and extracting main content from an input document, the main content comprising information for generating advertisements; identifying advertising components from the main content; and compiling at least one advertisement, each advertisement comprising a template retrieved from a template storage device and at least one identified advertising component.

In another embodiment of the invention, the step of identifying and extracting the main content comprises: dividing the document into a plurality of sections; and one or more of the following: performing size based analyses on each section of the plurality of sections; performing position based analyses on each section of the plurality of sections; performing linguistic quality based analyses on text of each section of the plurality of sections; and identifying the main content taking into account at least one of the said analyses.

In another embodiment of the invention, the step of performing position based analyses on each section, comprises: ranking the sections based on a position of each section in the document, the section appearing first in the document being ranked first.

In another embodiment of the invention, following the step of identifying and extracting the main content, the method further comprises: parsing the main content in order to identify a structure of the main content; and wherein the step of identifying the advertising components comprises: identify the advertising components from the main content taking into account the structure of the main content.

In another embodiment of the invention, following the step of identifying and extracting the main content, the method further comprises: deriving relationships between text of the main content; and wherein the step of identifying the advertising components comprises: identifying the advertising components from the main content taking into account the relationships between the text of the main content.

In another embodiment of the invention, following the step of identifying and extracting the main content, the method further comprises: determining linguistic statistics of the main content; and wherein the step of identifying the advertising components comprises: identifying the advertising components from the main content taking into account the linguistic statistics of the main content.

In another embodiment of the invention, the method further comprises: compiling the at least one advertisement using additional advertising components retrieved from an advertising component storage device, the additional advertising components selected using the at least one identified advertising component.

In another embodiment of the invention, the method further comprises: compiling the at least one advertisement using additional advertising components retrieved from at least one other document, the additional advertising components comprising a relationship with the at least one identified advertising component.

In another embodiment of the invention, the method further comprises: compiling the at least one advertisement using linguistic data retrieved from a linguistics storage device, such that the at least one advertisement is grammatically correct.

In another embodiment of the invention, the method further comprises: reducing the size of the at least one advertisement to a pre-determined size.

In another embodiment of the invention, the step of reducing the size of the at least one advertisement comprises one or more of: removing non-essential advertising components; replacing words with acronyms or abbreviations; replacing words with shorter words having a similar meaning.

In another embodiment of the invention, a plurality of advertisements are compiled, and the method further comprises: ordering the plurality of advertisements based on advertisement quality analyses.

In one embodiment of the invention, an apparatus for identifying a main content of a document is provided. The apparatus comprising a processor module configured to: divide the document into a plurality of sections; and one or more of the following: performing size based analyses on each section of the plurality of sections; performing position based analyses on each section of the plurality of sections; performing linguistic quality based analyses on text of each section of the plurality of sections; and identify the main content taking into account at least one of the said analyses.

In another embodiment of the invention, the step of dividing the document into the plurality of sections comprises: providing the document as a tree structure defining a logical structure of the document; and dividing the document into the sections, each section comprising all text containing nodes at a same relative position in the tree structure. In another embodiment of the invention, the tree structure comprises a document object model (DOM) tree structure.

In another embodiment of the invention, the size comprises: a number of characters in each section; a number of words in each section; a number of sentences in each section; or a number of paragraphs in each section.

In another embodiment of the invention, the input document comprises a webpage. In one embodiment of the invention, a method of controlling a computer apparatus for identifying a main content of an input document is provided. The method comprising: dividing the document into a plurality of sections; and one or more of the following: performing size based analyses on each section of the plurality of sections; performing position based analyses on each section of the plurality of sections; performing linguistic quality based analyses on text of each section of the plurality of sections; and identifying the main content taking into account at least one of the said analyses.

In another embodiment of the invention, the step of determining the quality score for the text of each section comprises: assigning a score to each sentence of each section; assigning a score to each section based on a sum of the score assigned to each sentence; and ranking the sections based on the section scores. In another embodiment of the invention, the section with the highest score is considered to be the main content.

In one embodiment of the invention, an apparatus for reducing a size of an automatically generated advertisement comprising at least one advertising component to a predetermined size is provided. The apparatus comprising: an abbreviation module configured to remove non-essential advertising components; and/or replace advertising components with acronyms or abbreviations; and/or replace advertising components with shorter advertising components having a similar meaning.

In another embodiment of the invention, the non-essential advertising components comprise: sentiment phrases; description phrases; imperative phrases without entity; interrogative phrases; price phrases; delivery phrases; stock information and returns information phrase.

In another embodiment of the invention, the pre-determined size comprises a predetermined number of characters.

In one embodiment of the invention, a method of controlling a computer apparatus for reducing a size of an automatically generated advertisement comprising at least one advertising component to a pre-determined size is provided. The method comprising: removing non-essential advertising components; and/or replacing advertising components with acronyms or abbreviations; and/or replacing advertising components with shorter advertising components having a similar meaning.

In one embodiment of the invention, an apparatus for automatically generating a summary of a document, the apparatus comprising: a pre-processor module configured to identify and extract main content from an input document, the main content comprising information for generating the summary; an extraction module configured to identify summary components from the main content; and an advertisement generator configured to compile at least one summary, each summary comprising a template retrieved from a template data storage device and at least one identified summary component.

In one embodiment of the invention, a method of controlling a computer apparatus to automatically generate a summary of a document is provided. The method comprising: identifying and extracting main content from an input document, the main content comprising information for generating the summary; identifying summary components from the main content; and compiling at least one summary, each summary comprising a template retrieved from a template storage device and at least one identified summary component.

In one embodiment of the invention, a computer program product comprising programme code means for performing the methods described above is provided.

In one embodiment of the invention, a computer readable medium recorded with computer readable code arranged to cause a computer to perform the methods described above is provided.

In one embodiment of the invention, a computer programme code means for performing the methods described above is provided. DESCRIPTION OF THE DRAWINGS

For a better understanding of the invention and to show how the same may be carried into effect, reference will now be made by way of example to the accompanying drawings:

Figure 1 illustrates schematically a process for generating advertisements;

Figure 2 illustrates schematically a process for identifying and extracting main content from a webpage;

Figure 3 illustrates schematically an apparatus for generating advertisements;

Figure 4 illustrates schematically a parse tree;

Figure 5 illustrates schematically a process for extracting advertisement components from main content;

Figure 6 illustrates schematically another process for extracting advertisement components from main content; and

Figure 7 illustrates schematically an advertisements generator.

DETAILED DESCRIPTION

Embodiments of the invention will now be described, by way of example only, with reference to the accompanying drawings.

Figure 1 illustrates schematically a process for automatically generating advertisements for a webpage(s), referred to throughout this document as the "base" webpage(s). A base webpage, or a collection of webpages, within a base website is identified for which advertisements are to be generated. Normally, an owner of the base webpage/website will request generation of advertisements for the products and/or services offered on the base webpage. The generated advertisements include a link to the base webpage.

The advertisements may be generated using content from the base webpage, for example using information provided on the base webpage regarding the offered products and/or services. The advertisements may also be generated using content from the base webpage as well as from the other sources, such as other webpages and databases, if appropriate content is not available at the base webpage.

The generated advertisements may reflect the style of the base webpage. Alternatively, the generated advertisements may reflect the style of the webpage on which it is to be displayed, referred to throughout this document as the "distribution" webpage. For example, the base webpage may be an English language webpage, whereas the distribution webpage may be French language webpage, in this instance, the advertisements may be generated in French, to reflect the style of the distribution webpage. In another example, the base webpage may be a conventional retail webpage, whereas the distribution webpage may be a networking webpage popular with teenagers, in this instance, the advertisements may be generated using words and/or phrases popular with teenagers, in order to reflect the style of the distribution webpage.

Each generated advertisement summarises a product or service that is offered on the base webpage. In one example, the generated advertisement may comprise the product or service; features of the product or service; positive sentiment(s) and a call to action, which are extracted from the base webpage and/or other sources. "Positive sentiment" may be considered to be positive terms describing the product or services. "Calls to action" may be considered to be terms such as "buy now" and "apply now" etc.

Figure 1 illustrates schematically a process for generating advertisements. A webpage 10 or text document 20 is analysed and the main content 30 of the webpage 10 or text document 20 is extracted at step 110. The main content 30 is considered to be the part of the webpage 10/text document 20 which contains information relevant for generating advertisements. For example, many webpages comprise text, such as in the headers, footers, navigation tool bars, menus, etc. which contain no information useful for generating advertisements. Furthermore, webpages may have information such as contact details, terms and conditions, declaimers, security information, etc., all of which contains no information useful for generating advertisements. Such information is not considered the main content of the webpage. Therefore, it is advantageous to be able to distinguish

1 Δ between the main content 30 of the webpage and the other information provided on the webpage, and then to extract the main content 30 at step 110 and disregard the other information.

Following identification of the main content 30, components for generating the advertisements are then extracted from the main content 30 at step 120. Components are considered to be the product or service (entity); features of the entity; positive sentiments, calls to action, price information, delivery information, availability etc. This list is not exhaustive and any information which is deemed useful for generating advertisements may be considered a component and extracted (if available) at step 120 from the webpage 10 or text document 20. A component may be a single word, or a phrase.

As stated above, the advertisements may be generated using components extracted from the content 30 of the base webpage 10 as well as from the other sources, such as other webpages and databases. For example, if the base webpage 10 does not have a very detailed description of the entity then it is possible to obtain this information at step 130 for other webpages. In addition, if the base webpage 10 describes the entity as "great" (the positive sentiment), an advertisement may be generated using the positive sentiment "fantastic" retrieved from a database comprising equivalent positive sentiments at step 130. Equivalent positive sentiments may be compound, e.g. "cost effective" may be swapped with "efficient". Many components of an advertisement, apart from the entity, price and delivery information, can be obtained from other sources. In some instance step 130 may not be required.

Below is an example of extracted main content 30 of a webpage.

At Barclays we believe in great loan deals, that's why we offer 9.9% APR typical on our loans of £7,500 to £25,000**. It's also why we pledge to pay the difference if you're offered a better deal elsewhere.

What you get with a personal loan from Barclays: • An instant decision if you're an Online Banking customer and get your money in 3 hours, ifacceptedt

• Our price guarantee: if you're offered a better deal elsewhere we 'Il pledge to pay you the difference between loan repayments***

• Apply to borrow up to £25,000

• No fees for arrangement or set up

• Fixed monthly payments, so you know where you are

• Optional tailored Payment Protection Insurance.

The bold text indicates the components of the content 30 which are extracted. However, alternative text could be extracted as the components, such as the phrases "No fees for arrangement' and "Fixed monthly payments" etc.

The entity is the "loan"; features of the entity are "9.9% APR typical", "£7,500 to £25,000", and "get your money in 3 hours", the positive sentiment is "great" loans and the call to action is "Apply to borrow up to £25,000".

The following are component types which are identified and extracted as potentially for use in compiling an advertisement:

• Entity Phrases - represent the name of the entity. The name of the entity may be multi-word, for example, "loan", "Barclay's loan", "Barclay's loan deal", "special loan deal", "limited offer loan";

• Sentiment Phrases - express sentiment, for example, "Excellent", "highly reliable", "extremely clear", "Great style and outstanding performance";

• Sentiment Phrases with Entity - noun phrases which contain a direct reference to the entity and include positive sentiment adjective phrases, for example "The highly reliable Acer Aspire", "The comfortable and stylish Nike trainers";

• Description Phrases - describe features of an entity using non-sentiment terms without mentioning the entity, for example, "is made to work as hard as you do", "redefines mobile connectivity"; • Description Phrases which explicitly mention the Entity - noun phrases in which an entity phrase occurs, but no sentiment phrases, for example: "Black size 10 Nike Shox trainers", "pure wool single breasted plan suit";

• Imperative Phrases which do not mention the Entity - for example: "Buy now!", "Order now!", "Order now for next day delivery";

• Imperative Phrases which do mention the Entity - for example: "Buy your Nike Shox trainers here!", "Get all car accessories";

• Interrogative Phrases - for example, "Need a break?", "Want to loose weight?"

• Interrogative Phrases which mention the Entity - for example, "Need a holiday?", "Want a new BMW car?;

• Price Phrases - mention money or price, for example, "£295", "Only £295", "Reduced to £399", "On Sale Now", "Reduced to clear", "money back guarantee";

• Delivery Phrases - for example, "24h shipping", "Next day delivery promise";

• Stock Information and Returns Information Phrase - for example, "In stock now", "10 in stock".

The referent of each phrase should be resolved if possible. Typically, sentences refer to a subject either directly or indirectly (via a pronoun such as "it"). In the case of indirect reference, the referent should be resolved in order to establish if it is the entity. It is expected to find sentences which refer to the entity both directly and indirectly, and other sentences which do not refer to the entity at all - either directly or indirectly.

The component types which are identified and extracted as potentially for use in compiling an advertisement may also be scored. A phrase which directly mentions the entity is highly scored; a phrase which refers using a pronoun which can be resolved to the entity is scored less highly; whilst a phrase which refers using a pronoun which cannot be resolved to the entity is scored less well; and a phrase which refers directly to something other than the entity is scored extremely lowly and may not be included in the final advertisements.

A webpage 10/text document 20 may not comprise all of the above mentioned types of components and an advertisement does not require all of the above mentioned types of components. A plurality of different advertisements may be generated for the same or different products or services displayed on the same base webpage 10. Therefore, multiple different components and various combinations of components may be selected for use in the plurality of generated advertisements.

Once the components have been extracted, an advertisement 60 can be compiled at step 140. For example, from the components extracted above, the following two advertisements can be generated:

1. Great Loan Deals

9.9% APR typical on loans of £7,500 to £25,000. Apply now!

2. Apply for a Barclays loan We offer 9.9% APR typical Get your money in 3 hours!

However, as stated above a plurality of different advertisements may be generated and the two advertisements above are provided as possible examples only.

An advertisement may be compiled using the components together with template data 40 and linguistic data 50. The template data 40 may provide an outline of an advertisement into which the components are to be inserted. Below are four examples of advertisement templates:

1. Headline: <entity phrasexsentiment phrase> Linel: interrogative phrase>

Line2: <price phrase> AND <delivery phrase>

2. Headline: <entity phrase> Linel: description phrase> Line2: <price phrase>

3. Headline: <entity phrase> Linel: <price phrase>

4. Headline: <sentiment phrase>

Linel: imperative phrase which mentions entity> Line2: <stock information phrase>

However, numerous different templates 40 may be provided in a template database for selection and use in generating advertisements. The templates may be selected based on the components available, for example, if no description of the entity is available, template 3 may be selected. Alternatively, a missing component may be obtained from other sources in order to complete the advertisement template.

Google™ requires that advertisements conform to their constraints. A Google™ advertisement must consist of three lines, the Headϋne must be 25 characters (or less) in length, and Linel and Line2 must be 35 character (or less) in length each. Therefore, the advertisements may be generated to comply with known constraints.

Linguistic data 50 is used so that the compiled advertisement 60 makes grammatical sense and forms speech that is considered normal in advertising.

However, following generation of an advertisement 60, it may be desirable to abbreviate the advertisement at step 150 to generate a reduced version 70 of the advertisement. It may be desirable, if the area into which an advertisement is to be placed has a limited size or specifies a maximum number of characters. In some instance step 150 may not be required.

In order to abbreviate a generated advertisement 60, the advertisement is analysed and components which are considered to be non-essential, in accordance with predefined rules, may be removed. In one example, known words may be replaced with well known abbreviations or acronyms, e.g. the phrase "Get A Loan in Hours" may be abbreviated to "Get A Loan in Hrs", or "p.m." may be abbreviated to "pm". In another example, prepositions may be removed, e.g., "to", "in", "for" etc. In another example, determiners may be removed, e.g. the phrase "Get A Great Loan" may be abbreviated to "Get Great Loan". In another example, adjectives may be removed, e.g. the phrase "Get A Great Loan" may be abbreviated to "Get A Loan". In another example, long adjectives may be replaced with shorter adjectives, e.g. "Get A Great Loan" may be abbreviated to "Get A Fab Loan". In another example, a multi-word entity name may be replaced with a single word entity name.

The abbreviation rules may be applied iteratively, such that a first rule is applied, the advertisement abbreviated in accordance with the first rule, the size of the abbreviated advertisement determined, and if the abbreviated advertisement does not meet the size requirements, then a second (different) abbreviation rule is applied ... etc.

Figure 2 illustrates a process for identifying and extracting the main content of the webpage 10/text document 20 (step 110 in figure 1).

The webpage 10/text document 20 is represented as a tree structure at step 1100. In the case of a webpage 10, a XML Document Object Module (DOM) tree structure may be generated using the Java tool JDOM (http://www.jdom.org/) as known in the art. The representation of the webpage 10/text document 20 as a tree structure defines the logical structure of the webpage 10/text document 20 to be identified enabling the text of the webpage 10/text document 20 to be represented in a way such that a clearer view of the content, structure and design can be obtained. Although a tree structure is used, any method of representing the text as its various components can be used. In addition, it is also possible to determine the content 30 of the webpage 10/text document 20 without representing the webpage 10/text document 20 as a tree structure.

The process of identifying a chunk of text which is most likely to be the main content 30 of a webpage is relatively complex. Consequently, the process of Figure 2 uses several different heuristics in order to identify the most likely chunk (section) of text as the main content 30. A chunk (section) is defined as a concatenation of all the text containing nodes formed at the same relative position in the tree and may comprise a plurality of sentences/paragraphs.

At step 1110 the size of each chunk of text is determined. The size may be determined based on the number of characters in a chunk, the number of words in a chunk, the number of sentences in a chunk, the number of paragraphs in a chunk, or the overall length of a chunk etc. Before determining the size of each chunk of text any formatting elements (e.g. elements such as bold, italic and fonts) and some low level structural elements (e.g. <span> and <li>) may be collapsed to the same level as the surrounding text.

The chunks of text are then ranked by size at step 1120, with the largest chunk of text being given the first position and the smallest chunk of text being given the last position. In one embodiment only the first n largest chunks of text are ranked by size and the remaining chunks are discarded.

The size of each chunk of text is important since it has been found in some instances that the largest chunk of text is the main content 30. However, this is not always the case. For example, disclaimers, legal small print, or comments left by previous visitors to a webpage may also be large chunks of text. Consequently the chunks of text are ranked according to their position on the page at step 1130. For example, the chunk of text occurring first on the page is given the first position and the chunk of text found at the bottom of the page is given the last position. In one embodiment, only the largest n chunks of text are ranked in order of their position on the page, the remaining chunks having being discarded at step 1120.

It is then possible to determine likely candidates for the main content 30 based on the identification of the largest chunks of text and the chunks of text which appears first on the page. However, in order to provide greater confidence, the text within the chunks of text which are considered to be likely candidates for the main content 30 is analyzed at step 1140 in order to identify the main content. In one example, the quality of the text is determined at step 1150 in order to identify the main content. The quality of the text within each chunk is determined by applying a number of different criteria and using a number of different quality measures. Text which is determined to be of high quality is then considered to be the main content 30.

In order to determine the quality of the text, a score is provided for each sentence in a chunk. The score is a weighted sum of the following scoring metrics:

1. determine the length (in number of words) of each sentence and provide a score normalised against the longest sentence on the page;

2. determine the number of words in each sentence that occur in the page <title> tag;

3. determine the number of words in each sentence that occur in the page <meta description> tag;

4. determine the number of words in each sentence that occur in the page <meta keywords> tag; and

5. determine the number of words in each sentence that occur in page headings (and apply individual weightings for <hl>,... <h6>).

The weightings are real numbers (in the range negative to positive infinity), which means that some weights can be set to zero, thereby ignoring a particular metric if required. Weights are set for each webpage 10.

Although a score is provided for each sentence in a chunk, a score could be provided for each pre-determined number of characters in a chunk, each pre-determined number of words in a chunk, each pre-determined number of sentences in a chunk, each predetermined number of paragraphs in a chunk etc.

A score is then provided for each chunk. The score is a weighted sum of the following scoring metrics:

1. determine a mean of the score provided for each sentence in the chunk (determined above); 2. determine the length of each chunk (i.e. as the number of sentences in each chunk) normalised against longest chunk on the page;

3. determine a Type:Token Ratio (TTR) for each sentence of each chunk, and/or each paragraph of each chunk, and/or each sequence of x characters of each chunk. For example, the sentence "dogs, dogs, dogs, fight, fight, fight" has six word tokens and two word types, so the TTR for this sentence is 2:6. Then determine the mean chunk TTR. If the mean chunk TTR is below a predetermined threshold level then the total overall score for the chunk is set to zero, as it is probably stuffed with keywords. Otherwise a weighted score is added for the Type/Token value;

4. sort the chunks from highest to lowest chunk score. The chunks with the highest score are most likely to be main page content.

In another example, "Good" sentences are determined at step 1155 in order to identify the main content. In order to identify "Good" sentences, the text of the webpage 10 is split into individual sentences. Each sentence is then provided with a score, as a weighted sum of:

1. the sentence length. The mean sentence length for the webpage 10 is determined (rounded up to the next whole number and at least n words long). The score for each sentence length is then, negative if the sentence length is below the mean sentence length, or positive if the sentence length is above the mean sentence length;

2. whether the sentence begins with a capital letter. If the sentence begins with a capital letter a weighting is added to the score; and

3. the Punctuation:Token Ratio (PTR) for each sentence. A score is provided based on the closeness of the PTR to a predetermined average "good language" PTR for a sentence. The further the PTR diverges from the "good language" PTR, the lower the score. For example, the sentence "This, for example, is a punctuated sentence." has 3 punctuation marks and 7 words, so a PTR is 3:7 = 0.429. The "good language" PTR is set as 0.232, so the sentence receives a score of (l-(0.429-0.232)) = 0.791.

The average sentence score is then determined for all the sentences and sequences of "good" sentences (chunks) identified. A "good" sentence is defined as a sentence with a quality score above a "cut off" level. The "cut off" level is determined as the average sentence score plus a constant, the constant being specific to each webpage 10. Sentences with a score below the "cut off" are discarded.

Again, although a score is provided for each sentence, a score could be provided for each pre-determined number of characters, each pre-determined number of words, each predetermined number of sentences, each pre-determined number of paragraphs etc.

The identified sequences of "good" sentences are then scored according to the following:

1. the number of words in the sequence that occur in the page <title> tag;

2. the number of words in the sequence that occur in the page <meta description> tag;

3. the number of words in the sequence that occur in the page <meta keywords> tag;

4. the number of words in the sequence that occur in the page headings (individual weightings for <hl>,... <h6>);

5. the length of the sequence (i.e. number of sentences in the sequence) normalised against longest sequence on the webpage;

6. the determined Type:Token Ratio (TTR) for each sequence of sentences. If the TTR for the sequence is below an average "good language" TTR for a sequence of sentences, then the total overall score for is set to zero. Otherwise a weighted score is added for the Type/Token value.

Finally, the sequence of sentences are sort based on the calculated scores, from highest to lowest score. The sequence of sentences with the highest score are considered to be the main content 30.

Alternatively, or in addition to determining the TTR and/or the PTR, the Punctuation:Character Ratio (PCR) for each sentence of each chunk, and/or each paragraph of each chunk, and/or or each sequence of x characters of each chunk could be determined. For example, the sentence "This, for example, is a punctuated sentence." has 3 punctuation marks and 38 characters (without spaces), so the PCR is 3:38, and has 3 punctuation marks and 44 characters (with spaces), so the PCR is 3:44. The average "good language" TTR for a sentence, for a paragraph and for a sequence of x characters, the average "good language" PTR for a sentence, for a paragraph and for a sequence of x characters, the average "good language" PCR for a sentence, for a paragraph and for a sequence of x characters, and the average "good language" sentence length is determined with reference to a collection of documents which are considered to be written using "good" language. With reference to the English language, documents held by the British National Corpus are used to determine the "good language" averages. However, other databases of good language could be used. These predetermined averages are stored in a database 310 accessible by a pre-processor module 300 illustrated in Figure 3.

Finally, the identified main content 30 is extracted at step 1160.

In some embodiments, the main content 30 needs to be cleaned and formatted. In order to do this all punctuation symbols are mapped into "." or "?" simplifying the extraction. In addition, all non-standard characters (such as *, • etc.) are eliminated and proper ends of sentences are forced.

Steps 1100 to 1160 of Figure 2 form the process of identifying and extracting the main content 30 from the webpage 10/text 20. However, these steps do not have to be performed in the order illustrated in Figure 2, and in some instance not all the steps may be required. For example it may be preferable to rank the chunks by their position on the page (step 1130) before determining the size of each chunk (step 1110) and ranking the size of each chunk (step 1120). In addition or alternatively, it may be preferable to determine the quality of the text (step 1150)/identify the "Good" sentences (step 1155) before determining the size and position of each chunk (step 1110 to 1130). In addition or alternatively, it may be preferable to determine the quality of the text (step 1150)/identify the "Good" sentences (step 1155) without determining the size and position of each chunk. In addition or alternatively, it may be preferable to determine the size and position of each chunk without determining the quality of the text (step 1150)/identify the "Good" sentences (step 1155). These examples are not limiting. Following extraction of the main content 30 at step 1160, the relationship between components of the main content 30 are determined at step 1170. The main content of a webpage may identify and describe several products and then may list the prices for all the products in a separate section. Therefore, the relationship between each product name, description of the product and product price etc is derived.

In order to derive the relationship at step 1170, it is possible to render the webpage through a browser and then to analyse the image on a pixel by pixel basis using image analysis in order to determine the relationships between the text of the webpage with an image and any other text of the webpage and then derive a link between the text. It is advantageous to render the webpage as an image since, items which appear close on a webpage may be provided large distances apart in the underlying code.

In another embodiment, it is possible to look for repeating patterns in the tree generated at step 1100. For example, identifying the order and the position of the text with reference to a headings (potentially in bold), to determine text which is linked.

For example, a common HTML pattern for listing product item(s) is:

<tdx/hl>, </img alt = ""> </p> </p>...</td>, such that the </hl> tag will contain a reference to an entity, <img alt = ""> will contain a link to an entity image, the 'alt' attribute may repeat the name of the entity, and the <p> tags will contain product descriptions and price information.

These tag sequences may be analysed in two ways. Firstly, following examination of the content it may be determined that <hl> and 'alt' attribute repeat an entity name, and the first few <p> tags contain good quality text followed by a <p> with price information indicated by the occurrence of the "£" sign followed by numerical information. In such a situation, it can be deduced that the price and description are associated with the entity. Secondly, in listing pages with multiple entities listed, the above pattern may be repeated for each item that is listed on the page. The fact that the pattern recurs is a strong indication that the page is a listing of numerous products. A repeating sequence of 'good' text followed by price text is a strong indicator that the page is a product listings page.

Repeating tag patterns may be analysed by traversing the DOM tree. However, such patterns may also be searched in a flat structure.

Alternatively, the relationships between components of the main content 30 determined at step 1170 could be performed before the main content 30 is extracted at step 1160.

A parse tree is generated for each sentence of the main content at step 1180. A parse tree is used to split each sentence into its component clauses and phrases, grouping the parts-of- speech (such as: noun phrases, verb phrases and prepositional phrases etc.) and then splitting out each individual part-of-speech (such as: nouns, determinates, verbs, prepositions etc.) in order to analyse the grammar of the sentence. The parse tree can also be used to identify the tense of each part-of-speech and identify whether each part-of- speech is plural or singular.

Figure 4 illustrates an exemplary parse tree for the sentence: "The cat sat on the mat.". The sentence is divided up into a noun phrase "the cat" and a verb phrase "sat on the mat". The noun phrase is then separated into a determinate "the" and a noun "cat". The verb phrase is separated into a verb "sat" and a prepositional phrase "on the mat", which itself is separated into the preposition "on" and a noun phrase "the mat". Finally, the noun phrase "the mat" is separated into a determinate "the" and a noun "mat". A parse tree is used to determine that grammar of each sentence.

The generation of parse trees is well known in the art and will not be described in further detail. Any type of parser could be used with the method and apparatus for generating advertisements. For example, a parser could attribute labels to each of the components of a sentence in order to determine the structure of the sentence, without generating a parse tree. Following generation of a parse tree, the statistics of the content 30 are determined at step 1190. The statistics indicate the probability of a word occurring in a particular context based on the sequence of words and the number of times a word appears. In addition, the statistics may be used to identify the number of times a word appears in the content 30. Known statistical and logical processors may be used to determine the statistics and the role each word plays in each sentence. The logical processor is used to check for grammar correctness and defines rules of grammar.

Referring again to Figure 1, following extraction of the main content and generation of the parse tree and statistics at step 110 the components of the advertisement are extracted at Step 120. One process for extracting the components is illustrated in Figure 5 and another process is illustrated in Figure 6. Figures 3 and 7 illustrates schematically apparatus for advertisement generation.

An advertisement may be compiled using the apparatus illustrated schematically in Figure 3. As can be seen from Figure 3 the apparatus 2000 comprises a pre-processor module 300 for extracting the main content 30 from a webpage 10 or a text document 20, and an advertisement generator 400 for extracting components from the main content 30 and compiling an advertisement. Either the pre-processor module 300 or the advertisement generator 400 may generate the parse trees and statistics in order to analyse the content 30.

The process for extracting components illustrated in Figure 5 is phrase extraction using a state-machine (which is a process well understood in the art). The process proceeds by iterating through a sentence word-by-word, maintaining a context (context is defined as a sequence of preceding words and/or succeeding words) of variable length, and deciding at each word whether or not to begin phrase extraction, terminate extraction or indeed to ignore the word altogether - and possibly also the phrase that the word is a component of.

As illustrated in figure 5, the component extraction process begins at Step 1200 and is defined by a set of complex rules. For example the begin extraction rule may search the content 30 for a capital letter at the beginning of the sentence in order to signify the beginning of the extraction or may look for the first word of the content 30. The extraction module will then examine the first word in the sentence. In one example the extraction module may be looking for a noun in order to begin extraction. If the word is a noun then the extraction module 422 will extract the word at Step 1220. The extraction module 422 then determines whether to terminate extraction. For example a termination rule may be the identification of a full stop signifying the end of a sentence. If the extraction is terminated then the extraction module 422 moves on to Step 1240. If the extraction is not terminated, the extraction module 422 iterates to the next word at Step 1210. In addition, if the word is not a noun (and thus not to be extracted), then the extraction module 42 will iterate to the next word at Step 1210.

Following termination of extraction at step 1230, the extracted component is then checked at step 1240. In order to be accepted at step 1240 the extracted component must comply with certain constraints, stored in the constraints module 440. For example, if the extracted component is an imperative phrase it must be capable of being reformulated into an explicit imperative. If the extracted component does not comply with the constraints and with natural language requirements set out in a linguistics module 426 then the extracted component is discarded. However, if it is acceptable then, if appropriate, the imperative expression is reformulated into an explicit imperative at Step 1250. Finally, an advert is compiled at Step 1260.

The following is an example signature of a method that could be used to define a rule used to begin extraction:

String startOflmperativeExpression(Word word, Word curPrev)

This rule returns either a first word to be included in the extraction (step 1220) or a flag to skip one or two words (step 1210). Two of its arguments are the current word and the previous word. When considering each, the apparatus references the preceding word(s) and subsequent word(s) in order to analyse the context of the current word. The precise context length may vary. The following is an example of a rule that is used to transform an extraction into an imperative form that is used in the final advert: if (curPrev!=null && curPrev.getContent()!=null

&& curPrev.getContentQ.equalslgnoreCasef'we") && word.getContentO.equalsIgnoreCaseC'sell") return "get";

This rule reads: if the previous word is 'we' and the current word is 'sell', convert the expression into get <something> from the original expression ' we sell <something>', where <something> is an unchangeable component to be included in the resultant advertisement. The verb "sell" is used here as an example. In reality there is a set of verbs which are treated in an identical way. The exact set of verbs that are used may be varied in order to alter the range and type of adverts that are generated. Similarly the pronoun "we" may be replaced with alternative terms which refer to the seller, including the name of the seller.

Below is another example of a rule used to begin extraction, but this rule is used to extract imperative verbs: if ( POS_First !=null && POS_First.startsWith("V")

&& properPosition && !POS_First.equals("VBN")

&& !POS_First.equals("VBG")

&& !POS_First.equals("VBZ")

&& !word.getContent().equalslgnoreCase({EXCLUDED_SET})

&& ( curPrev==null

1 1 ( curPrev.getPOSLabel()!=null

&& !curPrev.getPOSLabel().startsWith("MD")))

) { //Begin extraction process... }

Where POS_First represents the part of speech information for the current word, which will be the first word in the sequence of words which will form the extracted phrase. The "{EXCLUDED_SET}" is a set of words that are defined in the constraints module 440 as indicating a phrase that cannot be easily converted into an imperative advertisement line - as a default this set consists of the words, {"do", "is", "are"}, however this set may be altered in order to achieve variations on the advertisements that are produced. "MD" is a non- standard part of speech tag indicated a "modal verb".

The extraction module 422 may use a flag 'bExprBeingExtracted'. If bExprBeingExtracted=true then the current word is checked as a candidate for the first word of an extraction, otherwise, the extraction module 422 checks if the extraction should be terminated with the given word.

The following is an example of a rule used to terminate extraction. The signature of the process that implements the rule is given by:

Boolean[] endOflmperExpression(Word word, Word wordPrev, List<Word> accumedTerms).

This method takes as inputs the current and previous word, and the list of accumulated words from the beginning of the extraction. The method returns as an array three boolean variables which indicate:

1. Whether extraction should be terminated;

2. Whether extraction is acceptable;

3. Whether extraction contains an entity.

The following code fragment is another example of a rule used to terminate extraction:

// stop before gerund if (word.getPOSLabel().startsWith("VBG")

&& wordPrev!=null && wordPrev.getPOSLabel()!=null

&& wordPrev.getPOSLabel().startsWith("NN") ) {

//Complete extraction process ... return new Booleanf] {true, bExtractionAcceptable, false}; } The above shows a code fragment from a method which defines the extraction process. The code fragment itself defines a rule that terminates the extraction process when a gerund verb is encountered.

The flag, "bExtractionAcceptable", encodes the boolean result of a number of acceptability tests which are performed during the process of extraction, up until the termination conditions are encountered. If, and only if, all the tests have been found to be positive, this flag will be in a "true" state. The tests are stored as constraints in the constraints module 440.

Another process for extracting the components is based on traversal of the parse tree and is illustrated in figure 6. The process of figure 6 relies on the parsing of a sentence and the construction of a parse tree (syntactic tree) which divides a sentence into its component linguistic phrases - noun phrase, verb phrase, prepositional phrase etc. Extraction of specific advertisement phrases is then a matter of traversing the parse tree, extracting each phrase, deciding which type of advertisement phrase it is and storing it as one of the defined component phrases of an advertisement (i.e. as an "Entity Phrase", "Sentiment Phrase" etc.).

Both the process of figure 5 and the process of figure 6 lead to the same result: a set of phrases which are identified as belonging to one or none of the advertisement component phrases.

As discussed with reference to figure 4, the parse tree separates out the components of each sentence. It is possible to determine the location of specific components for extraction by examining the parse tree, and in particular the phrases (part-of-speech groupings each containing one or more words) within each sentence.

Using the process of figure 6, the phrases are identified and extracted based on the categories described above (Entity Phrases; Sentiment Phrases; Sentiment Phrases with Entity; Description Phrases; Description Phrases which explicitly mention the Entity; Imperative Phrases which do not mention the Entity; Imperative Phrases which do mention the Entity; Interrogative Phrases; Interrogative Phrases which mention the Entity; Price Phrases; Delivery Phrases; Stock Information and Returns Information Phrase). The extracted phrases may overlap, but can never be exactly the same, for example, "Acer Aspire" and "The extremely powerful Acer Aspire" may be considered as two separate phrases (the former is an "Entity Phrase", the latter is a "Description Phrase which contains an Entity") and both may be extracted from the sentence, "The extremely powerful Acer Aspire is the very latest in laptop technology".

Entity Phrases may be identified by two methods. 1) by having a lookup catalogue of products and services which may be stored in the linguistics storage device 430; or 2) identifying the noun phrase in which they occur and pruning away all non-noun words: i.e. adjectives and determiners etc. Weighted frequency counts and catalogues may be used to define and identify which words are entity words and which are regular nouns. The weightings are decided during the extraction phase.

In order to identify Sentiment Phrases a large dictionary of positive sentiment terms and phrases are referenced. This dictionary may be stored in the linguistics storage device 430. Identifying the sentiments in the web page 10/text document 20 enables the apparatus to bias these sentiments for use when compiling advertisement over the use of general sentiments which exist in linguistics storage device 430. However, some advertisements may be compiled using sentiments which are either synonymous with sentiment terms occurring in the page or are found to frequently co-occur with the sentiment terms found in the page. This is done by looking at text from other sources which are about the same entity (e.g. other webpages, user product reviews, product descriptions etc.).

Sentiment Phrases with Entity are identified as noun phrases from the parsed text and if they contain direct reference to the entity AND contain positive sentiment phrases they are extracted and stored in the extracted components storage device 424.

Description Phrases are identified as verb phrases from the parsed text. The verb phrases are stored in full in the extracted components storage device 424 and are pruned, adapted and re-worded depending on which verb category the "head verb" belongs to. The "head verb" is the main verb of a verb clause. In the above example, the two head verbs are "is" and "redefines".

Imperative Phrases which do not mention the Entity are identified from the parsed text when a verb phrase is the only phrase in a clause or sentence and the head verb is an imperative verb. Linguistically, imperative phrases are characterised by certain verbs in their base form (e.g. answer, buy, get, order). The Imperative phrases are augmented by a dictionary of imperative phrases stored in the linguistics storage device 430. As in the case of Sentiment Phrases, the extracted Imperative Phrases can be used to bias the apparatus to prefer the Imperative Phrases used in the web page 10/text document 20. We define Imperative Phrases in advertisement generation in way that is different to their more common linguistic definition. By "Imperative Phrases" we mean "any phrase which can be used to formulate an imperative (in the linguistic sense) line of an advertisement". Examples are any verb phrases which make an offer, guarantee or claim. For example, "[We] guarantee the best products" is a phrase (not including the pronoun "we") from which the advertisement line, "Get the best products", can be formulated. The exact set of verbs and phrase constructions which are considered to be imperative can be varied to vary the behaviour of the system. The pronoun "we", though not strictly part of the imperative verb phrase, is important because it insures that it is the vender of the product or service to be advertised who is making the offer. Such pronouns are ideally resolved to the referent. Hence an ideal "Imperative Phrase" as defined in our system would occur in the following format: "[Entity] ensures maximum satisfaction."

Imperative Phrases which do mention the Entity include both standard imperatives which mention the Entity and phrases which are identified as convertible into imperative form, for example: "We offer a wide range of loans" and "We guarantee the lowest rates of interest", which are convertible into "Get a wide range of loans" and "Get the lowest rates of interest Guaranteed!". Price Phrases do not belong to any linguistic group. Consequently, specific pattern matching algorithms which search for approximate matches to all known ways of expressing these concepts are used.

Stock Information and Returns Information Phrase also do not belong to any linguistic group. Consequently, specific pattern matching algorithms which search for approximate matches to all known ways of expressing these concepts are used.

For example, noun phrases are the main source of advertisement headlines, which are the main features introducing the entity and/or reference to its full name and brand. Below is an example of an entity name rule which is applied to the noun phrase: <brand> <product modifierxproduct name/entity> for/with

The entity may also be derived from the title and/or keyword lists.

In the context of advertisement generation sentences encouraging potential customers to perform an action, be it physical or mental, are searched for as imperative expressions. The start of an imperative expression is either an imperative verb ("sign up for ..."), or an expression indicating that certain activity is expected to be performed by a user ("we allow immediate cash withdrawal"). In the latter case the expression is reformulated to derive an explicit imperative: ("get immediate cash withdrawal", "take advantage of immediate cash withdrawal").

Below is an example of an imperative rule: if ( occurslnlmperativeList // we have an implicit list of verbs which are

//relevant to product webpages, //so we can be more specific than just using

// part of speech constraints

&& properPosition // we have a set of constraints to verify that

// this verb is in the part of expression // which can be referred to as 'beginning'; && !POS_Firstequals("VBN") && !POS_First.equals("VBG") && !POS_First.equals("VBZ")) // no conflict with other Verb-based rules

The proper position for an imperative is either at the beginning of a sentence or following a reference to the entity. Imperatives which occur within the scope of a direct reference to an entity gain the highest score or weight.

Extraction of an entity-based noun group is cross checked against 'conventional' linguistic noun groups (stored in the linguistics storage device 430) obtained, for example, from the construction of a parse tree using known methods and grammars.

Below is an example of a rule used to identify the start of a noun group: if (

(nounGroupCount==0

&& (POS_First.startsWith("NN") 1 1 POS_First.startsWith("JJ")))

&& (curPrev!=null && curPrev.getContent()!=null

&& ( curPrev.getPOSLabel().startsWith("NN")

1 1 curPrev.getPOSLabel().startsWith("JJ")

1 1 curPrev.getPOSLabel().startsWith("PRP"))))

{ nounGroupCount+=2; // we form two first words of noun group return curPrev.getContent()+"_"+word.getContent();

// this is a first word , combining two

//(NOUN, start of NOUN GROUP) of expression to be extracted

As the text is iterated through, typically a context length of one is maintained, meaning that the current word (curr) and the word that immediately precedes it (prev) are examined. Certain combinations will trigger the extraction of a noun phrase. Such a combination may be, for example, prev = "the" and curr = "cat"; this combination of determiner ("the") followed by a noun ("cat") indicates that we are in a noun group, and extraction begins.

It is possible that noun groups are composed of many words which may belong to different linguistic categories; for example, "The fat black cat". In such a case, as the word sequence is iterated through, the words that form the noun group are extracted, and preceding words are concatenated to form a context that is composed of many words. Hence at some stage in the extraction of the above noun group, prev = "The fat black" and curr = "cat". In such a way, the context length (number of words) is elastic.

Below is an example of a rule used to identify the end of a noun group:

// stopped being a noun, but could be CC (a number) if ( nounGroupCount>0 && (word==null 1 1 word.getPOSLabel()==null 1 1 (!word.getPOSLabel().startsWith("NN") && !word.getPOSLabel().startsWith("CC"))))

{ if (nounGroupCount>2 && bEntity) { // long enough and with entity nounGroupCount = 0; // end of extraction, and acceptable return new Boolean[] {true, true, true};

} else

{ nounGroupCount = 0; // end of extraction, but unacceptable return new Boolean[] {true, false, true};

}

} // end // not an end of extraction, noun group has started // and keep going: increment the count if (nounGroupCount>0 && word.getPOSLabel().startsWith("NN"))

{ nounGroupCount++;

}

Following extraction of the components, whether individual words or phrases, the advertisement is compiled. The advertisement may be compiled using a template created for the advertisement generation apparatus and stored in the template storage device 460 such as the templates discussed above, with the extracted words/phrases inserted into appropriate place holders of the template. Furthermore, as stated above, if some of the information required is not available from the base webpage it may be obtained from other sources. For example the description phrase> may be obtained from a different webpage, or a <sentiment> may be obtained from the linguistics storage device 430, which can store a plurality of positive sentiment terms appropriate for use in advertisements. The linguistics storage device 430 may also store a plurality of terms and phrase which are appropriate for advertisement generation, grouped into equivalent expressions. For example, if the base webpage recites "in stock now", the phrases "currently available" and "available now" are considered to be equivalents, and so can be substituted during advertisement generation.

Alternatively, an advertisement may be generated using templates which have been obtained from existing advertisements. For example, an existing advertisement may be identified for use in advertisement generation. The entity name may be removed from a copy of the existing advertisement and replaced with an extracted entity name in order to generate a new advertisement.

When compiling an advertisement, the advertisement compiler module 470 combines extracted components from the syntactic processor 420 with templates from the template processor 450 to generate the advertisement 60. These advertisements may then be abbreviated, if required by the abbreviation module 480. In some embodiments the advertisement will not require to be abbreviated, in which instance the advertisement is out put directly from the advertisement compiler 470.

The extracted components extracted by the extraction module 422 may be stored in the extracted components storage device 424. The extracted components may be assigned labels following extraction indicating the type of components (e.g. entity, description, sentiment etc.) or may be stored in tables of the storage device, each table storing different types of components. The compiler module 470 is then able to retrieve the appropriate type of component for placement in the placeholders within the templates. The same extracted components may be used in several generated advertisements, within different combinations, such that a plurality of different advertisements can be generated from the same base webpage.

Figure 7 illustrates schematically the advertisement generator 400. The advertisement generator 400 comprises a syntactic processor 420 connected to a linguistics storage device 430. The linguistic storage device 430 may comprise a plurality of information about language such as rules for sentence forming and grammar etc. The syntactic processor 420 is able to analyse the content 30 via the linguistics module 426 and can store updates to the linguistic data stored in the linguistic data storage device 430 as a result of the analysis of the content 30. Consequently it can be said that the system is self-learning in that it is constantly determining further information about language.

It is also possible for the advertisement generator 400 to analyse content 30 in order to determine templates for advertisements. For example the content 30 may be a plurality of webpages which already comprise advertisements for goods and/or services. These advertisements may be analysed by the linguistics model 426 and template advertisements derived from the existing advertisements. The template processor 450 may then analyse the advertisements and extract the structure leaving placeholders for the components. Such templates are then stored in the storage device 460. The extracted components can be stored in the storage device 424 by the extraction module 422. The advertisement compiler module 470 then receives data from the syntactic processor 420 and the template processor 450 in order to compile an advertisement 60. The advertisement compiler module 470 may receive a complete advertisement from the template processor with placeholders for the insertion of the goods/services name and name of the provider and price, or may receive extracted components such as lines of texts from the syntactic processor 420 for the formulation of an advertisement without a "fill in the blank" template advertisement.

The advertisement compiler module 470 also comprises the reformulation module 472 which reformulates imperative expression as explicit imperative. However, the reformulation module 472 may be provided at the syntactic processor 420.

The advertisement generator 400 also includes an abbreviation module 480, which abbreviates the advertisements, in order to meet specific requirements, such as number of characters in an advertisement etc. as described above.

The type of advertisement generated is not limited and the process and apparatus can be applied to numerous different types of advertisements, such as paragraph advisements, estate agent advertisements, financial service advertisements, electrical goods advertisements etc. In addition the process and apparatus can be applied to numerous different types of base webpages, for example, retail shopping sites, universities, governments, debt management companies, banks etc.

The apparatus can generate a plurality of advertisements for the same or different goods and services displayed on a webpage 10/text document 20. The apparatus 2000 illustrated in figure 3 may comprise an advertisement ranker (not illustrated) which ranks the plurality of advertisements based on a predefined quality score (as judged by the apparatus based on the advertisements relevance to the base webpage and more general quality of phrasing). The score may be based on an accumulation of the score assigned to the components during the extraction phase. The advertisements may then be ordered based on that score. For example, if the webpage 10 uses certain sentiments, for example "great", then advertisements generated using the same sentiment are given a higher ranking than advertisement generated with a similar sentiments, since it is known that the webpage owner favours the sentiment used in their webpage.

In addition, the proximity of phrases which act as different components in an advertisement also effects the score of the final advertisements. An advertisement which is composed of phrases drawn from the same sentence is given a higher ranking than an advertisement composed from two sentences taken from the beginning and end of a long passage of text.

The exact number of advertisements generated will depend on the content of the base webpage 10. The more content there is and the more the content is appropriate for advertisements, the more advertisements the apparatus will output. The apparatus 2000 may, in certain circumstances reject possible advertisements if they are considered to be of poor quality. However, the threshold for rejection may vary depending on the number of advertisements required and the amount of content available. Therefore, a lower threshold for quality will apply if a large number of advertisements are required from the same base webpage, or if there is only a limited amount of content available at the base webpage. A quality threshold may also be imposed by a parameter that may be set manually. Increasing this threshold would generally cause the system to output a fewer number of advertisements which would be of greater quality; but if set too high, the system may produce no advertisements. The reverse occurs if the threshold is lowered and may lead to advertisements of very poor quality being output by the system.

Furthermore, it is possible to learn which advertisements (and thus which templates are more successful) as a result of which advertisements from the plurality of advertisement generated, the webpage owner 10 selects. For example, one hundred advertisements may be generated for a webpage 10 which are ranked 1 to 100, with the advertisement ranked 100 considered the most desirable. Based on the webpage owners selections, the ranking of the advertisement may be altered. For example, if the owner does not select the advertisement ranked 100, then the score for that advertisement will be reduced. In addition, the score can also be adjusted based on how often each advertisements is clicked by an end user. For example, if one hundred advertisements are generated for a webpage 10 and are ranked 1 to 100, but the eightieth advertisement it clicked the most, then it is possible to change the scoring applied to future advertisements, increasing the scoring applied to the eightieth advertisement template.

Although advertisements are discussed in detail above, the apparatus and method disclosed can be used in order to generate summaries of text documents etc.

The invention has been described with particular illustrative embodiments. It is to be understood that the invention is not limited to the above-described embodiments and that various changes and modifications may be made by those of ordinary skill in the art without departing from the scope of the invention.

Claims

1. An apparatus for automatically generating advertisements, the apparatus comprising: a pre-processor module configured to identify and extract main content from an input document, the main content comprising information for generating advertisements; an extraction module configured to identify advertising components from the main content; and an advertisement generator configured to compile at least one advertisement, each advertisement comprising a template retrieved from a template data storage device and at least one identified advertising component.

2. The apparatus according to claim 1, wherein the pre-processor module is further configured to: divide the document into a plurality of sections; and one or more of the following: perform size based analyses on each section of the plurality of sections; perform position based analyses on each section of the plurality of sections; perform linguistic quality based analyses on text of each section of the plurality of sections; and identify the main content taking into account at least one of the said analyses.

3. The apparatus according to claim 2, wherein the step of dividing the document into the plurality of sections comprises: providing the document as a tree structure defining a logical structure of the document; and dividing the document into the sections, each section comprising all text containing nodes at a same relative position in the tree structure.

4. The apparatus according to claim 3, wherein the tree structure comprises a document object model (DOM) tree structure.

5. The apparatus according to any one of claims 2 to 4, wherein the step of performing size based analyses on each section, comprises: determining a size of each section; and ranking the sections based on the determined size, the section having the largest determined size being ranked first.

6. The apparatus according to claim 5, wherein the size comprises: a number of characters in each section; a number of words in each section; a number of sentences in each section; or a number of paragraphs in each section.

7. The apparatus according to any one of claims 2 to 6, wherein the step of performing position based analyses on each section, comprises: ranking the sections based on a position of each section in the document, the section appearing first in the document being ranked first.

8. The apparatus according to any one of claims 2 to 7, wherein the step of determining a quality score for the text of each section comprises: determining a quality score with reference to predetermined linguistic statistics.

9. The apparatus according to claim 8, wherein the step of determining the quality score for the text of each section comprises: assigning a score to each sentence of each section; assigning a score to each section based on a sum of the score assigned to each sentence; and ranking the sections based on the section scores.

10. The apparatus according to any one of claims 5 to 9, wherein the section with the highest score is considered to be the main content.

11. The apparatus according to any one of claims 1 to 10, wherein either the preprocessor module or the extraction module is further configured to: parse the main content in order to identify a structure of the main content; and wherein the extraction module is further configured to: identify the advertising components from the main content taking into account the structure of the main content.

12. The apparatus according to any one of claims 1 to 11, wherein either the preprocessor module or the extraction module is further configured to: derive relationships between text of the main content, and wherein the extraction module is further configured to: identify the advertising components from the main content taking into account the relationships between the text of the main content.

13. The apparatus according to any one of claims 1 to 12, wherein either the preprocessor module or the extraction module is further configured to: determine linguistic statistics of the main content, and wherein the extraction module is further configured to: identify the advertising components from the main content taking into account the linguistic statistics of the main content.

14. The apparatus according to any one of claims 1 to 13, further comprising: an advertising component storage device comprising additional advertising components, and wherein the advertisement generator is further configured to: compile the at least one advertisement using additional advertising components selected using at least one identified advertising component.

15. The apparatus according to claim 14, wherein the additional advertising component is substitutable for the at least one identified advertising component.

16. The apparatus according to any one of claims 1 to 15, further comprising: an advertising component extraction device configured to obtain additional advertising components from at least one other document, the additional advertising components comprising a relationship with at least one identified advertising component.

17. The apparatus according to claim 16, wherein the at least one other document comprises a webpage.

18. The apparatus according to any one of claims 1 to 17, further comprising: a linguistics storage device comprising linguistic data, and wherein the advertisement generator is further configured to: compile the at least one advertisement using linguistic data, such that the at least one advertisement is grammatically correct.

19. The apparatus according to any one of claims 1 to 18, further comprising: an abbreviation module for reducing the size of the at least one advertisement to a pre-determined size.

20. The apparatus according to claim 19, wherein the pre-determined size comprises a pre-determined number of characters.

21. The apparatus according to claim 19 or 20, wherein reducing the size of the at least one advertisement comprises one or more of: removing non-essential advertising components; replacing words with acronyms or abbreviations; replacing words with shorter words having a similar meaning.

22. The apparatus according to any one of claims 1 to 21, wherein each advertising component comprises at least one word.

23. The apparatus according to claim 22, wherein each advertising component comprises at least one of: an entity phrase; a sentiment phrase; a sentiment phrase with entity; a description phrase; a description phrase with entity; an imperative phrase without entity; an imperative phrase with entity; an interrogative phrase; an interrogative phrase with entity; a price phrase; a delivery phrase; a stock information and return information phrase.

24. The apparatus according to claim 23, wherein an entity comprises a product or a service.

25. The apparatus according to any one of claims 1 to 24, wherein the template storage device comprises a plurality of templates.

26. The apparatus according to any one of claims 1 to 25, wherein the at least one advertisement comprises a link to the input document.

27. The apparatus according to any one of claims 1 to 26, wherein the input document comprises a webpage.

28. The apparatus according to any one of claims 1 to 17, wherein the apparatus compiles a plurality of advertisements, and the apparatus further comprises: an ordering module configured to order the plurality of advertisements based on advertisement quality analyses.

29. A method of controlling a computer apparatus to automatically generate advertisements, the method comprising: identifying and extracting main content from an input document, the main content comprising information for generating advertisements; identifying advertising components from the main content; and compiling at least one advertisement, each advertisement comprising a template retrieved from a template storage device and at least one identified advertising component.

30. The method according to claim 29, wherein the step of identifying and extracting the main content comprises: dividing the document into a plurality of sections; and one or more of the following: performing size based analyses on each section of the plurality of sections; performing position based analyses on each section of the plurality of sections; performing linguistic quality based analyses on text of each section of the plurality of sections; and identifying the main content taking into account at least one of the said analyses.

31. The method according to claim 30, wherein the step of dividing the document into the plurality of sections comprises: providing the document as a tree structure defining a logical structure of the document; and dividing the document into the sections, each section comprising all text containing nodes at a same relative position in the tree structure.

32. The method according to claim 30 or 31, wherein the step of performing size based analyses on each section, comprises: determining a size of each section; and ranking the sections based on the determined size, the section having the largest determined size being ranked first.

33. The method according to any one of claims 30 to 32, wherein the step of performing position based analyses on each section, comprises: ranking the sections based on a position of each section in the document, the section appearing first in the document being ranked first.

34. The method according to any one of claims 30 to 33, wherein the step of determining a quality score for the text of each section comprises: determining a quality score with reference to predetermined linguistic statistics.

35. The method according to claim 34, wherein the step of determining the quality score for the text of each section comprises: assigning a score to each sentence of each section; assigning a score to each section based on a sum of the score assigned to each sentence; and ranking the sections based on the section scores.

36. The method according to any one of claims 32 to 35, wherein the section with the highest score is considered to be the main content.

37. The method according to any one of claims 29 to 36, wherein following the step of identifying and extracting the main content the method further comprises: parsing the main content in order to identify a structure of the main content; and wherein the step of identifying the advertising components comprises: identify the advertising components from the main content taking into account the structure of the main content.

38. The method according to any one of claims 29 to 37, wherein following the step of identifying and extracting the main content, the method further comprises: deriving relationships between text of the main content; and wherein the step of identifying the advertising components comprises: identifying the advertising components from the main content taking into account the relationships between the text of the main content.

39. The method according to any one of claims 29 to 38, wherein following the step of identifying and extracting the main content, the method further comprises: determining linguistic statistics of the main content; and wherein the step of identifying the advertising components comprises: identifying the advertising components from the main content taking into account the linguistic statistics of the main content.

40. The method according to any one of claims 29 to 39, further comprising: compiling the at least one advertisement using additional advertising components retrieved from an advertising component storage device, the additional advertising components selected using the at least one identified advertising component.

41. The method according to any one of claims 29 to 40, further comprising: compiling the at least one advertisement using additional advertising components retrieved from at least one other document, the additional advertising components comprising a relationship with the at least one identified advertising component.

42. The method according to any one of claims 29 to 41, further comprising: compiling the at least one advertisement using linguistic data retrieved from a linguistics storage device, such that the at least one advertisement is grammatically correct.

43. The method according to any one of claims 29 to 41, further comprising: reducing the size of the at least one advertisement to a pre-determined size.

44. The method according to claim 43, wherein the step of reducing the size of the at least one advertisement comprises one or more of: removing non-essential advertising components; replacing words with acronyms or abbreviations; replacing words with shorter words having a similar meaning.

45. The method according to any one of claims 29 to 44, wherein a plurality of advertisements are compiled, and the method further comprises: ordering the plurality of advertisements based on advertisement quality analyses.

46. An apparatus for identifying a main content of a document, the apparatus comprising a processor module configured to: divide the document into a plurality of sections; and one or more of the following: performing size based analyses on each section of the plurality of sections; performing position based analyses on each section of the plurality of sections; performing linguistic quality based analyses on text of each section of the plurality of sections; and identify the main content taking into account at least one of the said analyses.

47. The apparatus according to claim 46, wherein the step of dividing the document into the plurality of sections comprises: providing the document as a tree structure defining a logical structure of the document; and dividing the document into the sections, each section comprising all text containing nodes at a same relative position in the tree structure.

48. The apparatus according to claim 47, wherein the tree structure comprises a document object model (DOM) tree structure.

49. The apparatus according to any one of claims 46 to 48, wherein the step of performing size based analyses on each section, comprises: determining a size of each section; and ranking the sections based on the determined size, the section having the largest determined size being ranked first.

50. The apparatus according to claim 49, wherein the size comprises: a number of characters in each section; a number of words in each section; a number of sentences in each section; or a number of paragraphs in each section.

51. The apparatus according to any one of claims 46 to 50, wherein the step of performing position based analyses on each section, comprises: ranking the sections based on a position of each section in the document, the section appearing first in the document being ranked first.

52. The apparatus according to any one of claims 46 to 50, wherein the step of determining a quality score for the text of each section comprises: determining a quality score with reference to predetermined linguistic statistics.

53. The apparatus according to claim 52, wherein the step of determining the quality score for the text of each section comprises: assigning a score to each sentence of each section; assigning a score to each section based on a sum of the score assigned to each sentence; and ranking the sections based on the section scores.

54. The apparatus according to any one of claims 49 to 53, wherein the section with the highest score is considered to be the main content.

55. The apparatus according to any one of claims 46 to 54, wherein the input document comprises a webpage.

56. A method of controlling a computer apparatus for identifying a main content of an input document, the method comprising: dividing the document into a plurality of sections; and one or more of the following: performing size based analyses on each section of the plurality of sections; performing position based analyses on each section of the plurality of sections; performing linguistic quality based analyses on text of each section of the plurality of sections; and identifying the main content taking into account at least one of the said analyses.

57. The method according to claim 56, wherein the step of dividing the document into the plurality of sections comprises: providing the document as a tree structure defining a logical structure of the document; and dividing the document into the sections, each section comprising all text containing nodes at a same relative position in the tree structure.

58. The method according to claim 56 or 57 ', wherein the step of performing size based analyses on each section, comprises: determining a size of each section; and ranking the sections based on the determined size, the section having the largest determined size being ranked first.

59. The method according to any one of claims 56 to 58, wherein the step of performing position based analyses on each section, comprises: ranking the sections based on a position of each section in the document, the section appearing first in the document being ranked first.

60. The method according to any one of claims 56 to 59, wherein the step of determining a quality score for the text of each section comprises: determining a quality score with reference to predetermined linguistic statistics.

61. The method according to claim 60, wherein the step of determining the quality score for the text of each section comprises: assigning a score to each sentence of each section; assigning a score to each section based on a sum of the score assigned to each sentence; and ranking the sections based on the section scores.

62. The method according to any one of claims 58 to 61, wherein the section with the highest score is considered to be the main content.

63. An apparatus for reducing a size of an automatically generated advertisement comprising at least one advertising component to a pre-determined size, the apparatus comprising: an abbreviation module configured to remove non-essential advertising components; and/or replace advertising components with acronyms or abbreviations; and/or replace advertising components with shorter advertising components having a similar meaning.

64. The apparatus according to claim 63, wherein each advertising component comprises at least one word.

65. The apparatus according to claim 63 or 64, wherein each advertising component comprises at least one of: an entity phrase; a sentiment phrase; a sentiment phrase with entity; a description phrase; a description phrase with entity; an imperative phrase without entity; an imperative phrase with entity; an interrogative phrase; an interrogative phrase with entity; a price phrase; a delivery phrase; a stock information and return information phrase.

66. The apparatus according to claim 65, wherein an entity comprises a product or a service.

67. The apparatus according to claim 65 or 66, wherein the non-essential advertising components comprise: sentiment phrases; description phrases; imperative phrases without entity; interrogative phrases; price phrases; delivery phrases; stock information and returns information phrase.

68. The apparatus according to any one of claims 63 to 67, wherein the predetermined size comprises a pre-determined number of characters.

69. A method of controlling a computer apparatus for reducing a size of an automatically generated advertisement comprising at least one advertising component to a pre-determined size, the method comprising: removing non-essential advertising components; and/or replacing advertising components with acronyms or abbreviations; and/or replacing advertising components with shorter advertising components having a similar meaning.

70. An apparatus for automatically generating a summary of a document, the apparatus comprising: a pre-processor module configured to identify and extract main content from an input document, the main content comprising information for generating the summary; an extraction module configured to identify summary components from the main content; and an advertisement generator configured to compile at least one summary, each summary comprising a template retrieved from a template data storage device and at least one identified summary component.

71. A method of controlling a computer apparatus to automatically generate a summary of a document, the method comprising: identifying and extracting main content from an input document, the main content comprising information for generating the summary; identifying summary components from the main content; and compiling at least one summary, each summary comprising a template retrieved from a template storage device and at least one identified summary component.

72. A computer program product comprising programme code means for performing the method according to any one of claims 29 to 45, 56 to 64, 69 and 71.

73. A computer readable medium recorded with computer readable code arranged to cause a computer to perform the method according to any one of claims 29 to 45, 56 to 64, 69 and 71.

74. A computer programme code means for performing the method according to any one of claims 29 to 45, 56 to 64, 69 and 71.