GB2458490A

GB2458490A - Displaying the summary of a text file

Info

Publication number: GB2458490A
Application number: GB0805156A
Authority: GB
Inventors: Ian Matthew Haynes
Original assignee: Triad Group PLC
Current assignee: Triad Group PLC
Priority date: 2008-03-20
Filing date: 2008-03-20
Publication date: 2009-09-23
Also published as: GB0805156D0

Abstract

A method of processing a text file comprises receiving a text file, detecting a plurality of terms within the text file, calculating the importance of each detected term within the text file and displayed a generated summary of the text file. The displayed summary comprises one or more of the detected terms, with the text size of each detected term being in proportion to the calculated importance of the term. The importance can be calculated by identifying the frequency of each detected term within the file. Preferably, the summary of the text file comprises only the n most important detected terms, where n is an integer such as n = 3.

Description

DESCRIPTION

PROCESSING A TEXT FILE

This invention relates to a method of, and system for processing a text file.

It is known to provide a summary of a text file. For example, if an individual searches electronically through a database of files, via a suitable io searching interface, then the documents that are returned as the result of the search are commonly abbreviated to their title and/or possibly their abstract.

The title and abstract are usually created by the author of the document either when the document is first created or when the document is added to the database. In some circumstances, an abstract is composed by a specialist author, separate from the original author. All of the known techniques for producing summaries of documents rely on human input, which has a cost implication and also relies on the expertise of the author of the summary.

It is therefore an object of the invention to improve upon the known art.

According to a first aspect of the present invention, there is provided a method of processing a text file comprising receiving a text file, detecting a plurality of terms within the text file, calculating the importance of each detected term within the text file, generating a summary of the text file, the summary comprising one or more of the detected terms, and displaying the summary of the text file, the text size of each detected term being in proportion to the calculated importance of the term.

According to a second aspect of the present invention, there is provided a system for processing a text file comprising a processor arranged to receive a text file, to detect a plurality of terms within the text file, to calculate the importance of each detected term within the text file, and to generate a summary of the text file, the summary comprising one or more of the detected terms, and a display device arranged to display the summary of the text file, the text size of each detected term being in proportion to the calculated importance of the term.

According to a third aspect of the present invention, there is provided a computer program product on a computer readable medium for processing a text file, the product comprising instructions for receiving a text file, detecting a plurality of terms within the text file, calculating the importance of each detected term within the text file, generating a summary of the text file, the summary comprising one or more of the detected terms, and displaying the summary of the text file, the text size of each detected term being in proportion io to the calculated importance of the term.

Owing to the invention, it is possible to provide an automated summary of a text file that will provide a result that is clear and concise, and also one that provides further information visibly about the importance of the detected terms within the original text file.

Advantageously, the step of calculating the importance of each detected term within the text file comprises calculating the frequency of each detected term within the text file. This method of determining the importance of a term within the text file is very robust and easily executed, as all that is needed is to count the occurrences of the specified terms within the text file. These terms are then displayed in the summary with a size that is dependent on the frequency with which they are found within the original text file. Other methods of determining the importance of the terms within the document are possible.

For example, the context with which terms are used or other detected components (such as numbers) that are linked to the terms can be used to determine the relative (or absolute) importance of terms within the text file.

For example, if a database contains curriculum vitaes (CVs) of individuals, then when a user searches for particular skills, they are presented with a list of suitable candidates. Each candidate is shown with the full list of the skills contained in their CV. In order to emphasise the relative strength of each skill, as represented by the number of occurrences in the CV or number of years experience, the size of the font (or Point size) changes. The more times a skill is mentioned in the CV, the larger the size of the font.

Preferably, the method further comprises accessing a list of terms for detecting within the text file. By providing a list of terms that are relevant to the context of the text file which are used to search the file, there is no need to use any complicated processing of the text file to determine the terms that are to be used in the summary. A list is provided of terms, and the text file is then parsed to find the importance of these terms within the document (for example using the frequency to determine the importance).

Ideally, the summary of the text file comprises only the n most important detected terms, where n is an integer. For example n = 3. In many situations it will be appropriate to limit the number of terms that are contained within the summary to avoid the summary becoming too large and unwieldy. By having a predetermined limit on the number of terms within the summary, then only the most relevant information is provided to the user.

Advantageously, the step of displaying the summary of the text file is comprises displaying the text size of each detected term in direct proportion to the calculated importance of the term. This is one clever way in which the summary can be provided to the user. The terms that are presented in the summary are sized in direct proportion to their calculated importance. For example, it a first term is mentioned eight times in the text file, and a second term is mentioned four times in the text file, then in the summary, the first term will have a text size that is twice that of the text size of the second term.

Embodiments of the present invention will now be described, by way of example only, with reference to the accompanying drawings, in which:-Figure 1 is a schematic diagram of a system for processing a text file, Figure 2 is a schematic diagram of the processing of the text file, Figure 3 is a flowchart of the method of processing the text file, and Figure 4 is a schematic diagram of a graphical display on a display device.

Figure 1 illustrates a system which can be used to process text files.

The system of this Figure is a conventional personal computer (PC) used in a desktop environment, but could equally be a networked workstation for example. The system comprises a display device 10, which can be any suitable display for viewing documents, such as a CRT display or flat panel display capable of displaying text. The system also comprises a processing component 12, which in turn comprises a large number of processing, storage and input/output elements. Two such elements are illustrated, being a processor 14 and a database 16. The processor 14 is arranged to carry out process tasks and to control the image shown by the display device 10. The database 16 is a local storage device that stores information for use by the io processor 14. In the Figure, the database 16 is shown connected to the processor 14 by a local bus. The system also includes conventional user interface devices 18, being a keyboard 18a and a mouse 18b.

The processor 14 has access to one or more text files. These could be stored locally by the database 16, or they could be accessed via a network connection to a remote database. For example, a user of the system of Figure 1 may wish to carry out a search of CVs in a remote database. These CVs are represented as individual text files within the remote database. As text files are recalled from the remote database, they are processed by the processor 14 which generates a summary of each text file that is returned, for display to the user.

Figure 2 illustrates schematically the processing of a text file 20, which is handled by the processor 14. The file 20, as discussed above, can be recalled locally from an existing database, or can be received from a remote database. The processor 14 is arranged to receive the text file 20 and to detect a plurality of terms within the text file 20. In one embodiment, the processor 14 derives the terms that it is detecting directly from the document itself, for example simply looking for the most common terms within the document 20. In this embodiment, the local database 16 stores a list 24 which specifies those terms 26 that are to be detected within the file 20. For example, if the file 20 is a CV, then the list 24 of terms 26 will specify precise terms 26 to be detected within the text file 20 such as skill labels and so on.

--

Once the processor 14 has detected the plurality of terms within the file 20, then the processor 14 is arranged to calculate the importance of the detected terms. The importance can be defined in many different ways. One simple way is for the processor 14 to calculate the frequency of each detected term 26 within the file 20. Other methods, for example in the CV embodiment, might relate to the number of years that are linked to a specific job or skill.

once the processor 14 has calculated the importance of the terms 26, then the processor 14 is arranged to generate a summary 22 of the text file 20, the summary 22 comprising one or more of the detected terms 26.

io The method of processing the text file 20 is summarised in Figure 3 and comprises, firstly, at step Si, receiving the text file, The second step S2 is the step of detecting the plurality of terms within the text file 20, and this is followed by the step S3 of calculating the importance of each detected term within the text file 20. As discussed above, a simple mode of working out the is importance of the detected terms is to work out the frequency with which those terms appear in the file 20. The next step is the step S4 of generating the summary 22 of the text file 20, the summary 22 comprising one or more of the detected terms, and finally, the system, at step S5 is arranged to display the summary 22 of the text file 20, the text size of each detected term being in proportion to the calculated importance of the term.

The method of Figure 3 may also include accessing the list 24 of terms for detecting within the text file 20. This extra step will take place between steps Si and S2 of Figure 3. The processor 14 is arranged to execute steps Si to S4 in turn for the file 20. The processor 14 may concurrently or consecutively process more than one text file 20, depending upon the application that is being used to generate the summary 22. For example, if the user is executing a query (such as an SQL query) against a database, then each of the hits that are returned by the query will be processed according to the flowchart of Figure 3.

An example of the type of result that would be generated using the system of Figure 1, according to the methodology of Figures 2 and 3, is shown in Figure 4, which shows an application window 28 on the display device 10.

The user has made a search, for example, for candidates within a specific geographic locality. The search has returned three results, candidates 1 to 3.

Each candidate has a text file (their CV) associated with them, but the processor 14 has generated the respective summaries 22a, 22b and 22c for s each of the text files returned. These summaries 22 can be generated as and when they are needed, or they can be pre-generated and simply recalled from a suitable storage medium. The layout shown on the screen 10 is only one example, a wide variety of different arrangements are possible. It is sufficient that at least one summary 22 is displayed to the user via the display device 10.

The advantage of the system is delivered by the fact that in displaying the summary 22 of the text files, the text size of each detected term is in proportion to the calculated importance of the term. In the summaries 22 shown in Figure 4, the terms "ORACLE", "SQL", "JAVA" and "PHP" have been detected in the files that make up the CVs of the different candidates. These terms could have been recalled from the list 24, as described above, or could have been detected directly in the text files. Each term is displayed at a size that reflects the calculated importance of the respective term. For example, it is clear that candidate 1 has the most importance with respect to the term "ORACLE". Similar judgements can readily be made about other terms. The processing carried out by the processor 14 supports the display of the summaries that are shown in the Figure. The size of the term, as displayed, depends upon the importance (such as the number of times) that the term is assigned by the processor 14. 7 -

Claims

CLAIMS1. A method of processing a text file comprising * receiving a text file, * detecting a plurality of terms within the text file, * calculating the importance of each detected term within the text file, * generating a summary of the text file, the summary comprising one or more of the detected terms, and * displaying the summary of the text file, the text size of each detected term being in proportion to the calculated importance of the term.
2. A method according to claim 1, wherein the step of calculating the importance of each detected term within the text file comprises calculating the frequency of each detected term within the text file.
3. A method according to claim 1 or 2, and further comprising accessing a list of terms for detecting within the text file.
4. A method according to claim 1, 2 or 3, wherein the summary of the text file comprises only the n most important detected terms, where n is an integer.
5. A method according to any preceding claim, wherein the step of displaying the summary of the text file comprises displaying the text size of each detected term in direct proportion to the calculated importance of the term.
6. A system for processing a text file comprising * a processor arranged to receive a text file, to detect a plurality of terms within the text file, to calculate the importance of each detected term within the text file, and to generate a summary of the text file, the summary comprising one or more of the detected terms, and a display device arranged to display the summary of the text file, the text size of each detected term being in proportion to the calculated importance of the term.
7. A system according to claim 6, wherein the processor is further arranged, when calculating the importance of each detected term within the io text file, to calculate the frequency of each detected term within the text file.
8. A system according to claim 6 or 7, and further comprising a database arranged to store a list of terms for detecting within the text file.
9. A system according to claim 6, 7 or 8, wherein the summary of the text file comprises only the n most important detected terms, where n is an integer.
10. A system according to any one of claims 6 to 9, wherein the display device is further arranged, when displaying the summary of the text file, to display the text size of each detected term in direct proportion to the calculated importance of the term.
11. A computer program product on a computer readable medium for processing a text file, the product comprising instructions for * receiving a text file, * detecting a plurality of terms within the text file, * calculating the importance of each detected term within the text file, * generating a summary of the text file, the summary comprising one or more of the detected terms, and displaying the summary of the text file, the text size of each detected term being in proportion to the calculated importance of the term.
12. A method according to claim 11, wherein the instructions for calculating the importance of each detected term within the text file comprise instructions for calculating the frequency of each detected term within the text file.
13. A method according to claim 11 or 12, and further comprising instructions for accessing a list of terms for detecting within the text file.
14. A method according to claim 11, 12 or 13, wherein the summary of the text file comprises only the n most important detected terms, where n is an integer.
15. A method according to any one of claims 11 to 14, wherein the instructions for displaying the summary of the text file comprise instructions for displaying the text size of each detected term in direct proportion to the calculated importance of the term.