SYSTEM AND METHOD OF FILTERING ELECTRONIC MESSAGES
CROSS REFERENCE TO RELATED APPLICATIONS
[01] This application claims priority from U.S. provisional patent application, serial no. 60/633,434, entitled "System and Method of Filtering Electronic Messages," filed December 7, 2004. The disclosure of this provisional patent application is incorporated herein in its entirety by this reference.
BACKGROUND OF THE INVENTION
1. Field of the Invention
[02] The present invention relates to a method and system of communicating information over a data network. More particularly, it relates to filtering unwanted and/or unsolicited information based on a user's social network of contacts, and even more particularly, to efficient methods and systems of filtering unwanted and/or unsolicited messages using dynamically created and updated structures, such as Bloom filters. These filters also serve to preserve the privacy of the social contact network data.
2. Description of the Related Art
[03] In today's communications environment, there exists a significant and costly issue of constantly receiving unsolicited and unwanted information from outside sources. Much of the unsolicited information is sent by other companies and/or individuals in the form of electronic mail messages (e-mails) desiring to promote their products and/or services. Such e- mails, sent in bulk and containing unsolicited information, are often referred to as spam.
[04] Spam is flooding the Internet with many copies of the same message, in an attempt to force the message on people who would not otherwise choose to receive it. In 2003, it was reported that spam accounts for roughly 40 percent of all e-mail traffic in the U.S. This flood of unsolicited messages sent over the Internet is growing so fast, that by some estimates,
spam may soon account for half of all U.S. e-mail traffic. See Jonathon Krim, "Spam's Cost to Business Escalates; Bulk E-mail Threatens Communication Arteries," Washington Post, pg. AOl, March 13, 2003. Most spam is for commercial advertising, often promoting dubious products, get-rich-quick schemes, or quasi-legal services. Spam costs the sender very little to send — most of the costs are paid for by the recipient or the carriers rather than by the sender and due to the volume of spam these unwanted costs are becoming significant for both individuals and organizations alike.
[05] To restrict the amount of spam that is received, users or companies have employed, with mixed success, spam blocking software that attempts to identify spam using various techniques to identify the messages as containing information unwanted by the recipient. For example, a black-list technique compares an e-mail address of a sender of a message with a list of e-mail addresses from which messages are to be blocked. This type of list is referred to as a "black-list." A white-list technique compares the e-mail address in an incoming message with a list of e-mail addresses from which e-mail messages will be accepted.
[06] However, these conventional filtering mechanisms often are not effective, since many times unsolicited e-mail still finds its way into the e-mail boxes of users or certain e- mail messages are blocked when they should be allowed to pass to a recipient. Accordingly, there has been a long-felt and unresolved need to reduce the amount of spam that continues to defy most legal and technical efforts to stamp it out.
SUMMARY OF THE INVENTION
[07] Methods and systems are described here that efficiently determine whether a piece of source information lies within some small number of degrees of separation between members of a social network. A social network can consist of people who know each other, and have information relating to each other in their respective address books. Conventional techniques for filtering information, specifically spam, do not utilize a user's social network of contacts to filter out unwanted information.
[08] In an exemplary embodiment, the relationships between individuals and/or organizations, expressed as a social network, are exploited in order to combat spam. That is, certain filters, such as Bloom filters, can be efficiently used to express the aggregate of the contact identifiers in a user's personal electronic address book. Those filters can be shared with other contacts and used so that a user can identify specific wanted e-mails in order to determine whether to filter those e-mail messages as spam.
[09] Yet another exemplary embodiment involves dynamically exchanging filters, such as Bloom filters, and using such filters to form further aggregations according to the number of degrees of separation to be tested.
[10] According to yet another exemplary embodiment, the source data from which filters are established can be obfuscated and hence the privacy of an individual's source information can be preserved.
[11] Yet another exemplary embodiment relates to identifying wanted messages in terms of the number of degrees of separation within a social network of the sender from the recipient, as a large proportion of wanted messages are, many times, within two degrees of separation.
[12] According to yet another exemplary embodiment, there is provided a method of filtering including: applying a filter to identity information associated with unwanted and/or unsolicited information sent to a user; and inhibiting passage of the unwanted and/or unsolicited information if a result of applying the filter indicates that the identity information does not correspond to a desired sender, wherein the filter is based on a plurality of identities of entities and the filter does not contain information that identifies said entities.
[13] Also, according to yet another exemplary embodiment, a filtering apparatus includes a user's electronic address book, in which the electronic address book comprises an identity of at least one trusted source of the user. The apparatus also includes a filter based on a
plurality of identities of entities and which does not contain information that identifies said entities. The filter is applied to identity information associated with unwanted and/or unsolicited information received from a sender. The filter inhibits passage of the unwanted and/or unsolicited information if the filter indicates that the identity information does not correspond to a desired sender.
[14] According to yet another exemplary embodiment, there is provided a method of making a filter based on a first filter containing information representing identities of a first set of entities, and a second filter containing information representing identities of a second set of entities, the method includes: combining the first filter with the second filter to produce a combined filter, wherein the information contained in the first filter is based on unique identifiers of the first set of entities and the information contained in the second filter is based on unique identifiers of the second set of entities, and wherein the combined filter does not contain information that identifies any particular entity of the first or second sets of entities.
[15] According to yet another exemplary embodiment of the present invention, there is provided a method of filtering using a combined filter based on a first filter containing information representing identities of a first set of entities and a second filter containing information representing identities of a second set of entities, the method including: receiving messages and filtering the messages using the combined filter to inhibit messages from entities not represented by the combined filter, wherein the combined filter does not contain information that identifies any particular entity of the first or second sets of entities.
[16] In the exemplary embodiments, the filter can have a size that is independent of the number of the plurality of identities on which the filter is based. Also, the filter can be a Bloom filter.
[17] A computer-readable recording medium for recording a program can be used for executing methods related to exemplary embodiments of the present invention. Also, a
computer program product for enabling a computer to execute methods related to exemplary embodiments of the present invention, can be used.
BRIEF DESCRIPTION OF THE DRAWINGS
[18] Embodiments of the invention are described below in detail with reference to the following drawings in which like reference numerals refer to like elements wherein:
[19] Fig. 1 is a diagram that illustrates a physical arrangement of user terminals in a wide area network and a local area network, where the user terminals are connected to each other and the terminals belong to users in a social network.
[20] Fig. 2 is a conceptual diagram of a social network of contacts within two degrees of separation of a user U.
[21] Fig. 3 illustrates an exemplary embodiment of an electronic address book having contacts and a Bloom filter therein.
[22] Fig. 4 represents a Bloom filter that contains hash bits representing the contacts shown in the address book of Fig. 3.
[23] Fig. 5 is a flowchart illustrating an exemplary method of creating a Bloom filter.
[24] Fig. 6 is a flowchart illustrating an exemplary method of applying a Bloom filter to received information.
[25] Fig. 7 illustrates three Bloom filters having hash bits that represent the identifiers of various different contacts Al, A2, and A3 in a social network.
[26] Fig. 8 illustrates a composite Bloom filter that is a combination of the Bloom filters shown in Fig. 7.
[27] Fig. 9 illustrates the list of contacts that correspond to the Bloom filters shown in Fig. 7, respectively.
[28] Fig. 10 shows a list of contacts that correspond to the composite Bloom filter shown in Fig. 8.
[29] Fig. 11 is a flowchart illustrating an exemplary method of combining Bloom filters.
DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS [30] The following describes in detail exemplary embodiments of the invention, with reference to the accompanying drawings.
[31] A typical networked computer communication system 1 is illustrated in Fig. 1.
In such a system, users communicate via e-mail or by way of other methods of networked communications, such as instant messaging, by sending electronic messages from one computer, through one or more networks, to another computer connected to the communication system. Here, a local area network (LAN) 2 connects two computers 3 and 4 from which users can send and receive electronic messages, such as e-mail messages. The LAN typically is connected via a gateway 5 to a wide area network (WAN) 6 for communication with computers 7 and 8 that are located remotely from local computers 3 and 4. Computers 7 and 8 can be located either locally with one another or remotely from one another.
[32] Figure 2 is a conceptual diagram illustrating electronic address books in a network of nodes in a data communications network, such as is shown in Fig. 1. The electronic address books in Fig. 2 illustrate a social network of contacts in which the depicted users have up to two degrees of separation in the social network. A user U has an electronic address book 10 that contains contact information for users Al, A2, A3 and other users (not shown). These users for which contact information is contained within user U's address book are deemed to be within one degree of separation from user U in the social network.
[33] User Al, for example, also has an electronic address book 12 that contains contact information for users Bl, B2, B3 and other persons (not illustrated). The entries in user
Al' s address book are deemed to be within two degrees of separation from user U. Similarly, any entries in user A2's and user A3's address books, 14 and 16, respectively, would also be considered to be within two degrees of separation of user U. Likewise, if users Bl, B 2 or B3 have entries in their address books, 18, 20 and 22, respectively, those entries are considered to be within three degrees of separation from user U. It will be appreciated that the number of degrees of separation from a particular user within a social network is related to the number of address books that the user would have to consult to find the desired contact information. For example, the number of address books to be searched can be exponentially related to the number of degrees of separation between two users of the social network.
[34] An exemplary address book 300 is shown in Fig. 3. The address book can contain a list of people or other contacts (300A through 3001) known by the user of that address book. The list of contacts in the electronic address book 300 can represent the user's social contacts. A Bloom filter 320, representing the identities of the contacts 300A - 3001 present in the user's address book can be stored in the address book as shown in Fig. 3. Alternatively, the filter can be stored separately.
[35] Fig. 4 shows an example Bloom filter 400. The Bloom filter 400 shown in
Fig. 4 includes hash bits that correspond to the identifiers of the contacts (e.g., e-mail addresses, telephone numbers) shown in an address book, such as the address book of Fig. 3. The number of false-positive responses of a Bloom filter is related to the length of the filter: the longer the Bloom filter, the lower the ratio of false-positive responses. A false-positive response occurs when the Bloom filter indicates a positive response to an input from an identity that is not one of the identities encoded by the Bloom filter.
[36] As an example, the Bloom filter can have a length of 16 kilobytes (KB). That is, if a user has 100 contacts, and each of the user's contacts has 100 contacts, there would be 10,000 possible contacts for the user. If a false positive rate of l-in-1000 (i.e., the rate at which a SPAM message would pass through the filter) is desirable, then a filter length of about 16 KB
should be used. The Bloom filter 400, as discussed above, can be compressed - especially tor transmission across a network, making it even more manageable. See "Compressed Bloom Filters, " Mitzenmacher, Michael, Harvard University, which is incorporated herein by reference.
[37] This example assumes, at most, 2-degrees of separation between a user and his/her possible contacts, with a 0.1% false positive rate. As the degrees of separation increase, so does the number of possible contacts, and the false positive rate can tolerably increase. For example, if the number of possible contacts increases to 100,000, a tolerable false positive rate might be 3.15%. The size of the Bloom filter in such an instance could be increased to a manageable size of about 64 KB. In the instance that the number of possible contacts rises to 1,000,000, a tolerable false positive rate might be 6.25%, and the corresponding size of the Bloom filter could be set to approximately 1 MB. Therefore, as indicated above, the size of the Bloom filter is not prohibitively large and is, in fact, manageable and easily shared considering that it can be compressed.
[38] Participants within a social network can exchange contact information to better identify messages from other participants within the social network and therefore minimize receipt of unwanted messages. These participants can use a variety of techniques to exchange such contact information. One such technique, such as that described in the following example, uses the concept of electronic contracts to facilitate the exchange of contact information.
[39] Each user in a social network has a digital identity, for example, an email address, that is unique within the social network. The digital identities of the contacts can be reflected in an electronic address book, such as the electronic address book 300 in Fig. 3. For example, the entry 300A in the electronic address book can contain the digital identifier of Mary J. Smith (not shown). The owner of electronic address book 300 can have an agreement with Mary J. Smith to share specific information that the two have agreed to electronically exchange. In fact, the two can enter into a contractual agreement to share that information so that when the
information is changed it is automatically sent, according to the terms and conditions of the contract, to the other person. An identity server associated with the user of electronic address book 300 manages the contractual relationships between the user and others in the user's social network by using the digital identities to automatically exchange data that the parties have agreed to share along with any changes to that data as the changes occur. These automatic data exchanges can be accomplished using techniques described, for example, in WO 2005015470, entitled "Social Network of Identities and Query Method Therefor," filed on July 15, 2004, and which is incorporated by reference herein. Accordingly, when a member of the user's social network adds, deletes or changes an entry in that member's address book, the new information is automatically shared with other users who have a contract in place with the member to share that information.
[40] One aspect of using a Bloom filter to represent entities known among contacts is that these filters can be exchanged without exchanging the actual contents of each others address books, thereby maintaining the privacy of the data in the contact's address book. Thus, the contents of one's address book will not be revealed to another by exchanging Bloom filters since the Bloom filters do not specify any particular identifier. Accordingly, the Bloom filters constructed to represent known entities, do not contain information that identifies any of those entities. Rather, in response to inputting an identifier into the Bloom filter, the filter merely will indicate if the identifier is one for which the Bloom filter has been designed to produce a positive output. Since Bloom filters are virtually impossible to descramble, they provide a level of security in that the filter does not contain information that identifies any particular entity.
[41] Yet another aspect of using Bloom filters is that these filters are relatively small in size. Being relatively small in size allows Bloom filters to be easily exchanged or transferred amongst users, for example. The size of the filter can be maintained relatively small at least because the filter's size is independent of the number of identities represented therein.
The size is predetermined, such as 10 kilobytes, and it remains that size regardless of the number of identities it is designed to represent.
[42] It is believed, that a vast majority of e-mail that a user desires to receive is sent either by someone the user knows or from someone who is known by a person the user knows. For example, e-mail sent from a user's friends/acquaintances or from friends/acquaintances of the user's friends/acquaintances is likely to account for most of the e- mail the user would like to receive.
[43] By using a user's social network of digital identities, incoming electronic messages are filtered based on the degree of separation between the recipient of the message and the sender. For example, as a user receives an incoming electronic message, such as an e-mail message, a filter is used to compare the sender's identity (e.g., the sender's e-mail address) with identities (e.g., e-mail addresses) held in the user's electronic address book and with identities (e.g., e-mail addresses) held in the electronic address books of the contacts listed in the user's address book. If the sender's identity is found to be within two degrees of separation from the user, by the filter having a positive response to the sender's identity, then the filter passes the message to the user. However, if the filter provides a negative response indicating that the sender's identity is not represented by the filter, and hence is separated from the user by more than two degrees of separation, the e-mail is not passed to the user. Alternatively, the e-mail can be flagged as possible spam and passed to the user if the filter gives a negative response to the sender's identity.
[44] A filter can be derived from the aggregated set of identifiers of interest from the address book. For example, in the address book of user U, a Bloom filter can be created that represents the e-mail identifiers (e.g., e-mail addresses) of all contacts in user U's address book.
[45] Bloom filters are randomized data-structures for concisely representing a set in order to support membership queries. Bloom filters were developed by Burton Bloom in 1970 for the purpose of spell checking and for many years it was seldom mentioned in other contexts,
except for database optimization. The basic function of a Bloom filter is to recognize data globs that it has seen before. That is, a Bloom filter can be used to quickly test membership in a large set using multiple hash functions to convert the large set into an array of bits. Bloom filters are described at http://www.cs.wisc.edu/~cao/papers/summary-cache/node8.html, in Li Fan, Pei Cao, and Jussara Almeida, "Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol", Department of Computer Science, University of Wisconsin-Madison (http://citeseer.ist.psu.edu/463143.html'), and in Bloom, Burton, "Space/time Tradeoffs in Hash Coding with Allowable Errors," Communications of the ACM (Volume 13 / Number 7 / July 1970), pp. 422-426, which are incorporated herein by reference.
[46] A property of all Bloom filters is that they do not produce any false negatives, and only produce a finite possibility of false positives - the probability of false positives is related to the predefined bit string length used for the Bloom filter. Utilizing this Bloom filter feature, user U can effectively establish an efficient white list of contacts, which represents the contacts from which e-mails will be accepted.
[47] According to an exemplary embodiment, a Bloom filter can be created as set forth in Fig. 5, using well known techniques for generating Bloom filters from the identifiers held in a user's electronic address book. A first e-mail address, as an identifier character string, is obtained from the address book S500, and is used to generate a Bloom filter by passing the identifier through one or more hashing algorithms that yields a bit string of a pre-designated length S510. Then, it is determined whether the bit string produced in S510 is the initial bit string produced for the filter S520. If the produced bit string is the initial bit string, then a filter is generated based on the initially produced bit string S530. Otherwise, the produced bit string is logically OR'ed with a previously established filter of hash bits S540, to form a combined filter. The address book is checked to determine if any contacts remain to be processed. S550. If so, the identifier for the next contact in the address book to be processed is obtained (S560) and the process flow returns to S510. If no more contacts are to be processed, the process ends.
[48] For example, an e-mail address that is obtained for the next contact to be included in the filter is passed through the same hashing algorithm referred to in S510, to yield a second bit string of the same predefined length which is then logically OR'ed with the initial string to create an aggregate (combined) bit string. Yet another e-mail address identifier can be added to the filter by passing the identifier through the same hashing algorithm to create yet another bit string of the same predefined length which is then logically OR'ed with the aggregate string, and so on, thereby creating a combined filter. The resultant aggregate string, or Bloom filter 11, as created, for example, according to the method illustrated in Fig. 5, for user U can be used to test whether the sending address for an incoming e-mail message is a member of user U's address book.
[49] To test an incoming message from a sender against the created filter, the e- mail address of the sender is passed through the same hashing algorithm described above to yield a test bit string [T] of the same predefined length as the Bloom filter 11, as illustrated in S610 of Fig. 6. The test bit string is compared to the Bloom filter 11 to determine if they match, S620. If the set bits of the test bit string are determined in S630 to be identical to the set bits of the aggregate string, or Bloom filter 11, then there is high probability that the sending address is within user U's address book. Accordingly, the e-mail is received S650. If any of the set test bits are at variance with the set bits of the aggregate string 11 then the sending address is determined not to be in user U's address book, and the e-mail is rejected S640. In practice the following sequence can be used to perform the operations shown in Fig. 6:
[R] = [U] AND [T]
IF [R] XOR [T] <> 0 then [T] is not a member of the address book, where [U] represents the set bits of the Bloom filter of user U and [T] represents the set hashed bits of a sender's e-mail address.
[50] Figure 7 illustrates three Bloom filters 13, 15, and 17, one for each of the different users Al, A2 and A3 shown in Fig. 2. The respective filters contain hash bits representing the identifiers of the contacts in the respective users' address books. As illustrated,
each of the filters of the respective users contains different bits of hash, which reflects the different sets of identifiers of the contacts in each the users' address books. Although the Bloom filters 13, 15 and 17 are shown, for ease of illustration, with each having a length of 8 bits, it will be understood that filters having other lengths, for example, 16k bits, can be used. The three Bloom filters 13, 15 and 17 shown in Fig. 7, can be combined into a single combined filter 39, shown in Fig. 8. The combined filter 30 is generated by logically "ORing" the bit strings of filters 13, 15 and 17.
[51] Fig. 9 illustrates electronic address book contacts of users Al, A2, and A3, respectively, and these lists of contacts correspond, respectively, to Bloom filters 13, 15, and 17 shown in Fig. 7. With respect to user Al, the Bloom filter 13 for user Al, shown in Fig. 7, represents identifiers for two people: John Smith and Mary Doe. Filter 13 can be used to filter all incoming information, including filtering out e-mail that is not from John Smith and/or Mary Doe. Similarly, Fig. 9 shows a Bloom filter for user A2 that represents identifiers for three people: Kay Bee, John Smith and Jane Marie. Likewise, Fig. 9 also shows a Bloom Filter for user A3 that represents identifiers for three people: Calvin Bernard, Sheldon Smith, and Marie Calendar. While the filters illustrated in Fig. 9 represent identifiers of people, the filters can represent identifiers of other entities, such as businesses, organizations or groups, for example.
[52] Fig. 10 illustrates a combined filter that is a combination of the three Filters shown in Fig. 9. As can be seen in Fig. 10, the combined filter represents identifiers for the seven people listed in that figure. Of course, it will be understood that the Bloom filters illustrated in Figs. 9 and 10 do not contain the identifiers of the people listed in those figures, but rather merely represent identifiers of those people, since a hash function is applied to those identifiers. By using hash functions applicable to Bloom filters, the identifiers used to create the filters can not be determined by examining the filters, since the filters do not contain information that identifies any particular one of the contacts. Accordingly, a level of security is provided by using Bloom filters to create the combined filter 30 shown in Fig. 8.
[53] A user can utilize the Bloom filters of each of his or her contacts for his/her own benefit. Referring to Fig. 11, which illustrates one example of exchanging contact information, user U can determine whether contracts are established that provide for the exchange of Bloom filters with that user's contacts (SHOO in Fig. 11). If such contracts exist, user U receives each of the Bloom filters of its contacts (Sl 120) and logically ORs (Sl 130) those filters together with user U's own filter, to establish a filter that includes hash bits representing the aggregate contacts of Al, A2, A3 etc. This aggregate of contacts represents all contacts within two degrees of separation from user U. Therefore, by using U's own filter together with the aggregate filter received from all contacts (Al, A2, A3 ...) user U can establish a very efficient mechanism for testing the addresses of received e-mail to determine if the senders are contacts who are within two degrees of separation from user U.
[54] By using Bloom filters to represent e-mail contacts within two degrees of separation from the user, the size of the exchanged filters can be much smaller than the total set of e-mail address identifiers used in their construction. Accordingly, a node within a social network need only transmit its Bloom filter representation of its address book to the other nodes in the network, rather than send each of the identifiers in its address book. Transmitting only the node's Bloom filter instead of each of the identifiers in the node's address book can significantly reduce the amount of data that must be exchanged between nodes in the network to facilitate the reduction of spam.
[55] According to an exemplary embodiment, a filter mechanism can also operate to determine whether e-mails are received within a specified number of degrees of separation from user U.
[56] In another exemplary embodiment, the filters can be updated constantly with little impact on the network. For example, if user Al adds a new contact to his address book, then user Al can recompute the filter 13 representing identifiers of interest in his address book. Al must then send this updated filter across a communications system to user U, who must then
recompute the aggregate filter of all his contacts (Al, A2, A3...)- User U may cache individual filters for Al, A2, A3 etc. to make the recomputation more efficient. Use of an underlying data exchange communications network, such as a network using techniques based on contracts to exchange specific items of information, can make this communication exchange highly effective.
[57] Also, the invention may be embodied in a computer program product, as will now be explained.
[58] On a practical level the software, that enables the computer system to perform the operations described in detail herein, may be supplied on any one of a variety of media. Furthermore, the actual implementation of the approach and operations of the invention are actually statements written in a computer language. Such computer language statements, when executed by a computer, cause the computer to act in accordance with the particular content of the statements. Furthermore, the software that enables a computer system to act in accordance with the invention may be provided in any number of forms including, but not limited to, original source code, assembly code, object code, machine language, compressed or encrypted versions of the foregoing, and any and all equivalents.
[59] One of skill in the art will appreciate that "media", or "computer-readable media", as used here, may include a diskette, a tape, a compact disc, an integrated circuit, a ROM, a CD, a cartridge, a memory stick or card, a remote transmission via a communications circuit, or any other medium useable by computers, including those now known or hereafter developed. For example, to supply software for enabling a computer system to operate in accordance with the invention, the supplier might provide a disc or might transmit the software in some form via satellite transmission, via a direct telephone link, or via the Internet. Thus, the term, "computer readable medium" is intended to include all of the foregoing and any other medium by which software may be provided to a computer.
[60] Although the enabling software might be "written on" a disc, "embodied in" an integrated circuit, "carried over" a communications circuit, "stored in" a memory chip, or
"loaded in" a cache memory, it will be appreciated that, for the purposes of this application, the software will be referred to simply as being "in" or "on" the computer readable medium. Thus, the terms "in" or "on" are intended to encompass the above mentioned and all equivalent and possible ways in which software can be associated with a computer readable medium.
[61] For the sake of simplicity, therefore, the term "computer program product" is thus used to refer to a computer readable medium, as defined above, which has on it any form of software to enable a computer system to operate according to any embodiment of the invention.
[62] The foregoing embodiments and advantages are merely exemplary and are not to be construed as limiting the present invention. The description of the present invention is intended to be illustrative, and is not intended to limit the scope of the claims. Many alternatives, modifications, and variations will be apparent to those skilled in the art.