GB1602591A

GB1602591A - Data comparison apparatus

Info

Publication number: GB1602591A
Application number: GB2489878A
Authority: GB
Original assignee: Xerox Corp
Current assignee: Xerox Corp
Priority date: 1978-05-31
Filing date: 1978-05-31
Publication date: 1981-11-11

Description

(54) DATA COMPARISON APPARATUS (71) We, XEROX CORPORATION of Xerox Square, Rochester, New York, United States of America, a corporation organised under the laws of the State of New York, United States of America, do hereby declare the invention, for which we pray that a patent may be granted to us, and the method by which it is to be performed, to be particularly described in and by the following statement:- This invention relates to a data comparison apparatus, and in particular to an apparatus for enabling records to be rapidly retrieved from a store of information by their contents.

In order to store large amounts of information efficiently and economically, several types of memory devices have been developed in recent years. Two such memory devices are the magnetic drum and the magnetic disc, which store information in the form of binary bits represented by the magnetisation state of material coating the surface of a moving member. Transducer heads are provided in close proximity to the moving surface to read or detect changes of magnetic flux as the surface passes by. The heads convert the changes of flux into electrical signals and also can be energised to write, ie. change the state of the magnetic flux, on the moving surface. The bits are stored on "tracks" on the drum or disc, a track being the area passing underneath a particular head during rotation of the drum or disc. The reading operation does not destroy the magnetic state of the material passing under the heads, this state being changed only by a writing operation.

Although such devices can store large amounts of information, the retrieval of information from the store has presented problems.

One way of accessing information stored in large capacity memories is one in which a record is accessed or identified by any portion of its contents. The memory device, when this accessing technique is used, is sometimes referred to as an associative memory device. Where the record contains human language, minor spelling errors or variations occur which would prevent a record being accessed in this way.

Thus it would be desirable to be able to access a record by an approximate match to a portion of its contents.

Accessing a record by any portion of its contents is most conveniently done by examining the whole of each record. To do this in a reasonable time with files of a practical size, it is necessary to perform the examination by the fastest means possible.

It is therefore an object of the present invention to provide a data comparison apparatus of the associative memory kind with tolerance to minor spelling errors and variations.

According to the present invention there is provided an apparatus for comparing a profile string of bit patterns with a source string of bit patterns to determine whether the profile string is matched to a predetermined extent with any succession of bit patterns forming part of the source string, comprising a first addressable store containing the profile string and arranged to provide in each of a succession of time periods an output representing the presence and the position in, or the absence from, the profile string of each bit pattern of the source string, the first addressable store being addressable by each bit pattern of the source string combined with the previously identified position, a second addressable store containing data representative of said predetermined extent of matching, the second addressable store being addressable by the output of the first addressable store, and means responsive to the output of the second addressable store to provide an indication when matching to said predetermined extent has occurred.

An arrangement of this kind allows the degree of approximation in matching to be set up at will be placing binary integers in a random access memory, and is capable of searching a source string faster than a conventional computer.

The general principles of operation of such an arrangement will now be described.

The degree of similarity between two strings of characters A and B can be broken down into two factors. The first factor is the number of characters in string A which occur in string B. The second factor is the degree to which the sequence of characters in string A matches the sequence of characters in string B. By means of a suitable arrangement of logic circuitry, it is possible to generate a score by accumulating integers selected from a table according to these two factors. The factors are determined in the following way. The source string of characters is scanned character by character from left to right.

Each character of the source string will either occur in the profile string or not. If we take a character which does occur in the profile string and move on to the next one which occurs in the profile string, each of these two will occupy a particular position in the profile string (which may be different from its position in the source string).

Between these two "matched characters in the source string will be some number of characters which do not occur in the profile string (this number may, of course, be zero).

By counting the number of characters between these matched characters, we have a measure of the first factor, and by looking at their relative positions in the profile string we have a measure of the second factor. These three quantities are used together to select an addition to the score from the table, that is: (a) The position in the profile string of the first character (b) The position in the profile string of the second character (c) The number of characters between them.

Now the object is to discover if there is a section of the source string which is similar to the profile string rather than to assess the degree of similarity between the source string as a whole and the profile string. The score is therefore accumulated repeatedly over fixed lengths of the source string, a new length starting further along the source string each time by one character.

The successive scores thus apply to successive overlapping sections of the source string, rather like a window moving along the source string. To avoid repeatedly adding in the contribution of each character over and over a ain, each contribution is added in once when it occurs and is then subtracted a fixed number of characters later, the number of characters being equal to the matching length desired.

One of the problems of this approach is the existence of duplicated characters in the profile string. When this happens a character in the source string will occur in two or more positions in the profile string.

This situation can be dealt with by allowing one of the positions to be selected according to the position occupied by the last source string character occurring in the profile string.

An embodiment of the device in accordance with the invention will now be described, by way of example, with reference to the accompanying drawings, in which: Figure 1 is a circuit diagram in block form of the device; Figure 2 is a table showing the contents of a random access memory which is included in the circuit of Figure 1; Figure 3 is a chart showing a simulation of a matching operation being carried out by an apparatus according to the invention; and Figure 4 is a chart showing the scores resulting from performing the matching operation on a number of spelling variations of a word.

Referring to the drawings, it will be assumed that it is desired to search a store of information for the word "PROCEEDING". This profile word is loaded into a random access memory 1, to establish the table depicted in Figure 2. In the simple example shown, the characters of the alphabet (in upper case only) constitute the co-ordinates for the rows, and the columns are for the different previous positions. Thus, the letters P, R, 0, and C always have positions 1, 2, 3 and 4 respectively, since each of these letters appears only once in the profile word. Thus, no matter what the previous position, a P will always have position I attributed to it.

Thus, the whole row in the table designated by P is filled with 1. Similarly, the row for R is filled with 2, and so on. When we come to E, however, which occurs twice in the profile word, the row of-5s extends as far as the previous position 4, the remainder of the row being filled with 6s. Thus, a first E would be given position 5, and any following E would be given position 6. The remaining letters of the word PROCEEDING are allocated positions 7 to 10. Letters not occurring in the word are allocated a 0.

Referring now to Figure 1, the characters of a source string are fed in succession, along input lines 2, to the random access memory (RAM) 1. Also forming part of the address is the position in the profile word of the preceding source string character. This appears on input lines 3. The output of RAM 1, which represents the position of the current character as determined by the table of Figure 2, is passed via lines 4 to a second random access memory 5. The output of the first RAM 1 is also passed to a position register 6 which stores the present position for one clocking period, and then passes it on lines 3 to the first RAM 1 as mentioned above. Zeros in the output of RAM 1 are detected by a zero detector 7.

Zeros recorded by the detector 7 are counted by a non-match counter 8.

If the character occurs, the present position, the previous position, and the contents of the non-match counter 8 are used as the address for the second random access memory 5 and the non-match counter 8 is zeroed. If the character does not occur the second memory is not accessed and instead the non-match counter is incremented.

The output word from the second RAM 5 is treated as a signed binary integer which is passed to an adder 9. The output word of RAM 5 is also passed via delay store 10 and negator 11 to the adder 9. The output of adder 9 goes to a score register 12, which also provides an input to the adder 9. The delay store 10 has the effect of reducing by one the accumulated score in the score register 12 a preselected number of score increments after a given score increment was first added to the score. The contents of score register 12 are compared in a comparator 13 with the contents of a match limit register 14. If the content of the score register 12 is greater than that of the match limit register 14, a hit is indicated on line 15.

Note that the position of the character as read from RAM 1 is selected not only by the character code, but by the position of the last character stored in the position register 6. This fulfils two functions: 1. It deals with duplicated characters in the pattern by selecting the position nearest to the position of the last character.

2. It allows one to do "wild card" matching (matching in which any characters are accepted in the designated position) by filling up RAM 1 such that for the wild card position all characters advance the position by one.

The circuitry shown will be connected to a computer and a storage device by a suitable interface. For a given type of search, the second RAM 5 will be loaded by the computer with a predetermined set of integers to programme the required matching characteristics. If the characteristics do not change, a read only memory might advantageously be used.

Prior to each search, the first RAM 1 will be loaded by the computer with the position that each character of the alphabet occupies in the search pattern, as described above, the match limited register 14 loaded with the score which indicates a hit, and a delay register 16 will be loaded with the length over which the socre is to be accumulated.

By making a few basic assumptions, it is possible to calculate the sizes of the various registers. In what follows all the variables are integers.

Let Maximum length of pattern be: p=2q-1 Let size of alphabet be: a=2b , No. of locations in first RAM: 2b+q 2b .Length of word in first RAM: q bits Length of Position Register 6: q bits Let length of Non-Match Counter 8 be: u bits ..No. of locations in second RAM: 22q+U Let length of word in second RAM be: d bits Let length of Score Register 12 be: v bits Taking a typical example for interactive English text retrieval: Length of pattern p=15 .-. q=4 Alphabet a=64 b=6 No. of locations in first RAM: =960 Length of word in first RAM: =4 bits Length of Position Register 6=4 bits Assume that we are not interested in more than 8 non-matching characters in a row: Length of Non-Match Counter 8 u=3 bits No. of locations in second RAM: 2048 Assume that we do not want to alter the score by more than +4 at any time.

Length of word in second RAM: 3 bits Assume that the maximum score we would want to accumulate corresponding to an exact match, is 4xp=64.

.Length of Score Register 12=6 bits.

The speed of operation of an apparatus in accordance with the invention will now be considered. The total time to complete one character and to be ready for the next depends on the time taken to perform each step in the sequence. Some operations can be overlapped, leading to the following sequence of elapsed time in the cycle.

1. Access the first RAM 1 for present character. Compare score register 12 with match limit register 14 for previous character and signal hit if required. Store integer into delay store 10 for previous character.

2. Access second RAM 5 for present character or increment non-match counter 8.

3. Add integer from second RAM 5 and negated integer from the delay store 10 into score register 12, set position register 6.

The three operations in the first of these steps occur simultaneously as do the two operations in the third step, so that the duration of the step is the longest of the operation. The total time for one character will be the sum of the duration of these three steps.

The apparatus of the invention provides a number of advantageous features. Two strings of bit patterns, such as characters, can of course be compared by means of a sequence of ordinary digital computer instructions. The logic will perform the process faster than a computer implemented in equivalent logic components. The number of separate steps to handle one character is about the same as the number of steps to execute a single computer instruction. A computer would require at least three or four such instructions to handle one character.

The apparatus has the further advantage that the way of computing the degree of similarity is adjustable over a wide range. In particular, the number of characters in the source string which occur in the profile string can be weighted separately to the extent of their being out of the correct sequence. Weights can be assigned according to the position in the profile string thus allowing the first part of the profile string to be given prominence if required. As mentioned above, wild card matching can be implemented.

Furthermore, designated sets of characters, such as space, comma, full stop, can be treated as if identical by loading the first RAM approximately.

Referring now to Figure 3, the chart shows an example of a matching operation using the algorithm: IF ((present position-previous position)= 1 )AND (intervening character=0) THEN score increment=l ELSE score increment=0 In this example, the profile string is the word "PROCEEDING", and the word "PRXXEEDING" appears on the source string preceded by a random set of characters. The first complete line of the chart represents the arrival on input lines 2 of a succession of individual characters in the source string. The second line indicates the position accorded to that character by the first RAM 1. The third line represents the score increment, as read from the second RAM 5 when a character occurs in the profile string and the fourth line represents the score accumulated over the current character and the preceding nine characters occurring in the profile string.

In this particular example the maximum score accumulated is 6, which can be arranged to cause a "match" to be indicated, or not, depending on the threshold score selected.

In Figure 4, there are listed a number of examples of variations on the profile string "PROCEEDING", and the scores which result when the matching operation is carried out using the same algorithm as described above in connection with Figure 3.

WHAT WE CLAIM IS: 1. Apparatus for comparing a profile string of bit patterns with a source string of bit patterns to determine whether the profile string is matched to a predetermined extent with any succession of bit patterns forming part of the source string, comprising a first addressable store containing the profile string and arranged to provide in each of a succession of time periods an output representing the presence and the position in, or the absence from, the profile string of each bit pattern of the source string, the first addressable store being addressable by each bit pattern of the source string combined with the previously identified position, a second addressable store containing data representative of said predetermined extent of matching, the second addressable store being addressable by the output of the first addressable store, and means responsive to the output of the second addressable store to provide an indication when matching to said predetermined extent has occurred.

2. The apparatus of Claim I wherein the second addressable store is addressable by the output of a position register and a nonmatch counter both of which are responsive to the output of the first addressable store, in addition to the direct output of the first addressable store.

3. The apparatus of Claim 2 wherein the first addressable store is addressed by said position register to provide said previously identified position.

4. The apparatus of any one of Claims 1 to 3 wherein the means responsive to the output of the second addressable store includes a counting arrangement for accumulating a count over a succession of said time periods.

5. The apparatus of Claim 4 wherein the counting arrangement includes a delay store and a negator whereby the count accumulated by the counting arrangement is decremented by one, a pre-selected number of count increments after a given increment was first added to the count.

6. An apparatus for comparing a profile string of bit patterns with a source string of bit patterns, substantially as hereinbefore described with reference to the accompanying drawings.

**WARNING** end of DESC field may overlap start of CLMS **.

Claims

**WARNING** start of CLMS field may overlap end of DESC **. number of advantageous features. Two strings of bit patterns, such as characters, can of course be compared by means of a sequence of ordinary digital computer instructions. The logic will perform the process faster than a computer implemented in equivalent logic components. The number of separate steps to handle one character is about the same as the number of steps to execute a single computer instruction. A computer would require at least three or four such instructions to handle one character. The apparatus has the further advantage that the way of computing the degree of similarity is adjustable over a wide range. In particular, the number of characters in the source string which occur in the profile string can be weighted separately to the extent of their being out of the correct sequence. Weights can be assigned according to the position in the profile string thus allowing the first part of the profile string to be given prominence if required. As mentioned above, wild card matching can be implemented. Furthermore, designated sets of characters, such as space, comma, full stop, can be treated as if identical by loading the first RAM approximately. Referring now to Figure 3, the chart shows an example of a matching operation using the algorithm: IF ((present position-previous position)= 1 )AND (intervening character=0) THEN score increment=l ELSE score increment=0 In this example, the profile string is the word "PROCEEDING", and the word "PRXXEEDING" appears on the source string preceded by a random set of characters. The first complete line of the chart represents the arrival on input lines 2 of a succession of individual characters in the source string. The second line indicates the position accorded to that character by the first RAM 1. The third line represents the score increment, as read from the second RAM 5 when a character occurs in the profile string and the fourth line represents the score accumulated over the current character and the preceding nine characters occurring in the profile string. In this particular example the maximum score accumulated is 6, which can be arranged to cause a "match" to be indicated, or not, depending on the threshold score selected. In Figure 4, there are listed a number of examples of variations on the profile string "PROCEEDING", and the scores which result when the matching operation is carried out using the same algorithm as described above in connection with Figure 3. WHAT WE CLAIM IS:

1. Apparatus for comparing a profile string of bit patterns with a source string of bit patterns to determine whether the profile string is matched to a predetermined extent with any succession of bit patterns forming part of the source string, comprising a first addressable store containing the profile string and arranged to provide in each of a succession of time periods an output representing the presence and the position in, or the absence from, the profile string of each bit pattern of the source string, the first addressable store being addressable by each bit pattern of the source string combined with the previously identified position, a second addressable store containing data representative of said predetermined extent of matching, the second addressable store being addressable by the output of the first addressable store, and means responsive to the output of the second addressable store to provide an indication when matching to said predetermined extent has occurred.