CN100589453C - An anti-spam processing device and method - Google Patents
An anti-spam processing device and method Download PDFInfo
- Publication number
- CN100589453C CN100589453C CN200610001083A CN200610001083A CN100589453C CN 100589453 C CN100589453 C CN 100589453C CN 200610001083 A CN200610001083 A CN 200610001083A CN 200610001083 A CN200610001083 A CN 200610001083A CN 100589453 C CN100589453 C CN 100589453C
- Authority
- CN
- China
- Prior art keywords
- spam
- template
- send
- legitimate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
本发明公开了一种反垃圾邮件的处理装置和方法。其中,所述装置包括邮件接收/投递单元、通用邮件控制单元、反垃圾处理单元和邮件数据库,所述邮件数据库预先保存垃圾邮件模板和合法邮件模板;邮件接收/投递单元收到邮件接收/发送请求后,将待接收/发送的邮件送至反垃圾处理单元;反垃圾处理单元将所述邮件与垃圾邮件模板、合法邮件模板分别进行比较,再根据比较结果为所述邮件打上类型标识,送至通用邮件控制单元进行处理控制,或返回给邮件接收/投递单元。本发明的这种装置和方法在接收/发送邮件时,能够基于动态生成的邮件模板识别和过滤垃圾邮件,从而达到较佳的反垃圾效果。
The invention discloses an anti-spam processing device and method. Wherein, the device includes a mail receiving/delivery unit, a general mail control unit, an anti-spam processing unit and a mail database, and the mail database pre-saves spam templates and legitimate mail templates; the mail receiving/delivery unit receives the mail receiving/sending After the request, the mail to be received/sent is sent to the anti-spam processing unit; the anti-spam processing unit compares the mail with the spam template and the legal mail template respectively, and then marks the type of the mail according to the comparison result, and sends it to the spam processing unit. To the general mail control unit for processing control, or back to the mail receiving/delivery unit. The device and method of the present invention can identify and filter junk emails based on dynamically generated email templates when receiving/sending emails, thereby achieving a better anti-spam effect.
Description
技术领域 technical field
本发明涉及电子邮件(E-mail)处理技术,尤指一种反垃圾邮件的处理装置和方法。The invention relates to electronic mail (E-mail) processing technology, in particular to an anti-spam processing device and method.
背景技术 Background technique
目前,垃圾邮件在互联网上横行肆虐,给许多网络用户造成极大困扰。为了解决上述问题,网络服务提供商正在研究反垃圾邮件的方法,以便更好地滤除网络上的垃圾邮件。At present, spam is rampant on the Internet, causing great troubles to many network users. In order to solve the above problems, network service providers are studying anti-spam methods in order to better filter out spam on the network.
现有反垃圾邮件的方法主要有以下几种:1)基于规则的关键字符串过滤垃圾邮件;2)采用统计算法分类识别垃圾邮件,比如Bayesian算法等;3)设置发送者黑白名单等用于拦截垃圾邮件;4)对发送行为进行流量限制等控制方法。上述反垃圾邮件的处理方法可以单独或结合使用,并适于应用在邮件服务器和/或个人用户端的邮件处理装置上。图1为现有邮件处理装置的组成结构,包括邮件接收/投递单元11和通用邮件控制单元12。The method of existing anti-spam mainly contains following several: 1) filter spam based on the key character string of rule; 2) adopt statistical algorithm to classify and identify spam, such as Bayesian algorithm etc.; Intercepting spam; 4) Control methods such as flow limitation on sending behavior. The above anti-spam processing methods can be used alone or in combination, and are suitable for application on mail servers and/or mail processing devices at individual client terminals. FIG. 1 shows the structure of an existing mail processing device, including a mail receiving/
所述通用邮件控制单元12用于对邮件进行编/解码和存储,并显示邮件内容给用户等。该控制单元又包括:合法邮件处理模块和垃圾邮件处理模块。合法邮件处理模块负责对携带有合法邮件标识的邮件进行后续处理,比如将该邮件存放到收件箱,或者发送出去等;垃圾邮件处理模块负责对携带有垃圾邮件标识的邮件进行后续处理,比如将该邮件放置在垃圾箱,不发送或限制发送该邮件等。The general
所述邮件接收/投递单元11接收通用邮件控制单元12的指令,并根据现有的邮件传送协议,在邮件服务器和/或个人用户端之间接收/发送新邮件。The mail receiving/
通过实际应用发现,上述反垃圾邮件的方法只是被动地应对垃圾邮件,垃圾邮件发送者很容易采用相应的对策,对这些反垃圾邮件技术进行干扰。Through practical application, it is found that the above-mentioned anti-spam method is only a passive response to spam, and spammers can easily adopt corresponding countermeasures to interfere with these anti-spam technologies.
比如,针对信息摘要算法(MD5,Message-Digest algorithm 5)等校验类方法,垃圾邮件发送者在发送垃圾邮件时,引入少量随机变化的内容,使得垃圾邮件之间非严格相同。For example, for verification methods such as message digest algorithm (MD5, Message-Digest algorithm 5), spammers introduce a small amount of randomly changing content when sending spam, so that the spam is not strictly the same.
针对Bayesian等统计类方法,垃圾邮件发送者在邮件中插入大量扰乱字/字符,以此扰乱用于垃圾识别的特征比例。For statistical methods such as Bayesian, spammers insert a large number of scrambled words/characters into the email, thereby disturbing the feature ratio used for spam identification.
针对黑白名单、发送地址过滤等技术,垃圾邮件发送者变换发件人/发送地址,以便通过不同的服务器IP发送雷同邮件。For technologies such as black and white lists and sending address filtering, spammers change the sender/sending address in order to send similar emails through different server IPs.
针对语言相关性方法,垃圾邮件发送者改变邮件正文的语言、文字字符集等,使得汉语分词等技术无法有效地滤出垃圾邮件。For the language correlation method, the spammers change the language and character set of the email body, so that technologies such as Chinese word segmentation cannot effectively filter out spam.
针对流量控制的方法,垃圾邮件发送者放慢自身发送垃圾邮件的频率,或者对同一个帐号发送的邮件数量进行控制,使得垃圾邮件无法被识别。For flow control methods, spammers slow down the frequency of sending spam, or control the number of emails sent by the same account, so that spam cannot be identified.
从上述描述看出,几乎所有常用的反垃圾邮件技术都已被垃圾邮件发送者破解,即反垃圾技术的更新频率远远滞后于垃圾邮件的发送技术,这使得网络始终无法提供令人满意的反垃圾效果。It can be seen from the above description that almost all commonly used anti-spam technologies have been cracked by spammers, that is, the update frequency of anti-spam technology is far behind that of spam sending technology, which makes the network unable to provide satisfactory Anti-spam effect.
发明内容 Contents of the invention
有鉴于此,本发明的主要目的在于提供一种反垃圾邮件的处理装置,使得邮件处理装置在接收/发送邮件时,能够基于邮件模板识别和过滤垃圾邮件,从而达到较佳的反垃圾效果。In view of this, the main purpose of the present invention is to provide an anti-spam processing device, so that the mail processing device can identify and filter spam based on the mail template when receiving/sending mail, so as to achieve a better anti-spam effect.
本发明的又一目的在于提供一种反垃圾邮件的处理方法,利用邮件模板识别和过滤垃圾邮件,并进一步通过主动提取邮件特征生成和调整邮件模板。Another object of the present invention is to provide an anti-spam processing method, which utilizes mail templates to identify and filter spam, and further generates and adjusts mail templates by actively extracting mail features.
为达到上述目的,本发明的技术方案具体是这样实现的:In order to achieve the above object, the technical solution of the present invention is specifically realized in the following way:
一种反垃圾邮件的处理装置,包括邮件接收/投递单元和通用邮件控制单元;该邮件处理装置还包括:反垃圾处理单元和邮件数据库;An anti-spam processing device, including a mail receiving/delivery unit and a general mail control unit; the mail processing device also includes: an anti-spam processing unit and a mail database;
所述邮件数据库包括合法邮件列表库、垃圾邮件模板库和全体邮件样本库,所述合法邮件列表库用于保存合法邮件模板,所述垃圾邮件模板库用于保存垃圾邮件模板;The mail database includes a legal mailing list database, a spam template database and a sample database of all mails, the legal mailing list database is used to preserve legal mail templates, and the spam template database is used to preserve spam templates;
邮件接收/投递单元收到邮件接收/发送请求后,将待接收/发送的邮件送至反垃圾处理单元,并保存在全体邮件样本库中;After the mail receiving/delivery unit receives the mail receiving/sending request, it sends the mail to be received/sent to the anti-spam processing unit and saves it in the sample library of all mails;
反垃圾处理单元从合法邮件列表库获取合法邮件模板与待接收/发送的邮件进行比较,如果合法邮件模板中存在待接收/发送的邮件的相似邮件,则为待接收/发送的邮件打上合法邮件标识;The anti-spam processing unit obtains the legal mail template from the legal mailing list library and compares it with the mail to be received/sent. If there is a similar mail to the mail to be received/sent in the legitimate mail template, it will mark the mail to be received/sent as a legal mail logo;
否则,从垃圾邮件模板库获取垃圾邮件模板与待接收/发送的邮件进行比较,如果垃圾邮件模板中存在待接收/发送的邮件的相似邮件,则为待接收/发送的邮件打上垃圾邮件标识;Otherwise, obtain the spam template from the spam template library and compare it with the mail to be received/sent, if there is a similar mail to the mail to be received/sent in the spam template, then mark the spam logo on the mail to be received/sent;
否则,在全体邮件样本库中查找待接收/发送的邮件的相似邮件,在相似邮件数量超过预定值时根据Alignment比较算法提取待接收/发送的邮件的邮件模板,并且根据预先设置的信息为所述待接收/发送的邮件打上垃圾邮件标识或合法邮件标识,根据类型标识将提取的待接收/发送的邮件的邮件模板保存到对应的邮件数据库中;Otherwise, look for similar emails of the emails to be received/sent in the entire email sample library, and extract the email templates of the emails to be received/sent according to the Alignment comparison algorithm when the number of similar emails exceeds a predetermined value, and set the email template for all emails according to the preset information. Mark the mail to be received/sent with a spam mark or a legitimate mail mark, and save the mail template of the mail to be received/sent to the corresponding mail database according to the type mark;
反垃圾处理单元将打上类型标识的待接收/发送的邮件送至通用邮件控制单元进行处理控制,或返回给邮件接收/投递单元。The anti-spam processing unit sends the mail to be received/sent marked with type identification to the general mail control unit for processing control, or returns to the mail receiving/delivery unit.
所述反垃圾处理单元包括:垃圾判定模块和邮件模板提取模块;The anti-spam processing unit includes: a garbage judgment module and an email template extraction module;
垃圾判定模块接收到所述待接收/发送的邮件后,分别向合法邮件列表库和垃圾邮件模板库下发查询请求,以确定所述邮件的邮件类型,并在待接收/发送的邮件的相似邮件数量超过预定值时向邮件模板提取模块发出模板提取指令;After receiving the mails to be received/sent, the junk judgment module sends query requests to the legitimate mailing list database and the spam template database respectively to determine the mail type of the mails, and to determine the mail type of the mails to be received/sent. Send a template extraction instruction to the email template extraction module when the number of emails exceeds a predetermined value;
邮件模板提取模块接收到上述指令后,利用Alignment比较算法提取出邮件模板,保存到对应类型的邮件数据库中。After the mail template extraction module receives the above instruction, it uses the Alignment comparison algorithm to extract the mail template, and saves it in the mail database of the corresponding type.
该邮件处理装置还包括:全体邮件库维护模块、邮件列表库维护模块和垃圾模板库维护模块;The mail processing device also includes: a maintenance module for all mail databases, a maintenance module for a mailing list database, and a maintenance module for a garbage template database;
则对全体邮件样本库、合法邮件列表库和垃圾邮件模板库执行的操作分别通过全体邮件库维护模块、邮件列表库维护模块和垃圾模板库维护模块进行。The operations performed on the sample database of all emails, the database of legal mailing lists and the template database of spam are carried out through the maintenance module of the database of all emails, the maintenance module of the mailing list database and the maintenance module of the spam template database respectively.
该邮件处理装置还包括:错误反馈处理模块,用于根据用户反馈信息向反垃圾处理单元发出修改请求;反垃圾处理单元接收到上述请求后,向邮件列表库维护模块和/或垃圾模板库维护模块发出修改通知,修改合法邮件列表库和/或垃圾邮件模板库中的对应数据记录;The mail processing device also includes: an error feedback processing module, which is used to send a modification request to the anti-spam processing unit according to user feedback information; The module sends a modification notice to modify the corresponding data records in the legal mailing list library and/or spam template library;
或者,错误反馈处理模块直接向邮件列表库维护模块和/或垃圾模板库维护模块发出修改通知,修改合法邮件列表库和/或垃圾邮件模板库中的对应数据记录。Alternatively, the error feedback processing module directly sends a modification notice to the mailing list database maintenance module and/or spam template database maintenance module to modify the corresponding data records in the legal mailing list database and/or spam template database.
所述邮件处理装置设置在个人用户端,则通用邮件控制单元根据用户请求,向邮件列表库维护模块和/或垃圾模板库维护模块发出修改通知,修改合法邮件列表库和/或垃圾邮件模板库中的对应数据记录。The mail processing device is set on the personal client side, and the general mail control unit sends a modification notice to the mail list database maintenance module and/or the spam template database maintenance module according to the user request, and revises the legitimate mail list database and/or the spam template database The corresponding data record in .
当邮件接收/投递单元接收到邮件时,所述反垃圾处理单元对该邮件进行类型判定,并添加上对应的类型标识送至通用邮件控制单元;When the mail receiving/delivery unit receives the mail, the anti-spam processing unit determines the type of the mail, and adds the corresponding type identification to the general mail control unit;
通用邮件控制单元根据所述类型标识,将该邮件保存在对应位置显示给用户,如果为合法邮件标识则保存在收件箱,如果为垃圾邮件标识则保存在垃圾箱。The general mail control unit saves the mail in the corresponding position and displays it to the user according to the type identification, if it is a legal mail identification, it is stored in the inbox, if it is a spam identification, it is stored in the trash bin.
所述邮件处理装置设置在邮件服务器。The mail processing device is set on a mail server.
一种反垃圾邮件的处理方法,应用于包含邮件接收/投递单元、通用邮件控制单元、反垃圾处理单元和邮件数据库的邮件处理装置中,所述邮件数据库包括保存合法邮件模板的合法邮件列表库、保存垃圾邮件模板的垃圾邮件模板库和保存未经处理的邮件样本的全体邮件样本库,该方法包括以下步骤:A processing method for anti-spam, applied in a mail processing device comprising a mail receiving/delivery unit, a general mail control unit, an anti-spam processing unit and a mail database, the mail database including a legal mail list library for storing legal mail templates , a spam template storehouse for storing spam templates and a whole mail sample storehouse for storing unprocessed mail samples, the method comprises the following steps:
a、邮件接收/投递单元在接收/发送邮件时,将待接收/发送的邮件送至反垃圾处理单元并保存在全体邮件样本库中;a. When the mail receiving/delivery unit receives/sends mails, it sends the mails to be received/sent to the anti-spam processing unit and stores them in the sample library of all mails;
b、反垃圾处理单元根据比较策略将合法邮件列表库中的合法邮件模板与待接收/发送邮件逐一进行比较,并判断是否存在待接收/发送邮件的相似邮件,如果存在则为所述待接收/发送邮件打上合法邮件标识,否则执行步骤c;b. The anti-spam processing unit compares the legitimate mail templates in the legitimate mailing list library with the mails to be received/sent one by one according to the comparison strategy, and judges whether there are similar mails to be received/sent, and if it exists, it is said to be received /Send the email with a legal email logo, otherwise perform step c;
c、反垃圾处理单元根据比较策略将垃圾邮件模板库中的垃圾邮件模板与待接收/发送邮件逐一进行比较,并判断是否存在待接收/发送邮件的相似邮件,如果存在则为所述待接收/发送邮件打上垃圾邮件标识,否则执行步骤d;c. The anti-spam processing unit compares the spam templates in the spam template library with the mails to be received/sent one by one according to the comparison strategy, and judges whether there are similar mails to be received/sent, and if it exists, it is the mail to be received /Send mail with a spam mark, otherwise execute step d;
d、反垃圾处理单元查找全体邮件样本库中保存的所述待接收/发送邮件的相似邮件,并判断相似邮件数量是否超过预定值,如果超过则根据Alignment比较算法提取待接收/发送邮件和其相似邮件的相同内容生成新的邮件模板,并且根据预先设置信息确定所述待接收/发送邮件的类型,并打上对应的类型标识,根据确定的邮件类型,将所述邮件模板保存到对应的邮件数据库中;d. The anti-spam processing unit searches for the similar mails of the mails to be received/sent stored in the entire mail sample database, and judges whether the number of similar mails exceeds a predetermined value, and if it exceeds, extracts the mails to be received/sent and other mails according to the Alignment comparison algorithm Generate a new mail template with the same content of the similar mail, and determine the type of the mail to be received/sent according to the preset information, and mark the corresponding type logo, and save the mail template to the corresponding mail according to the determined mail type in the database;
e、反垃圾处理单元将打上类型标识的待接收/发送邮件送至通用邮件控制单元进行处理控制,或返回给邮件接收/投递单元。e. The anti-spam processing unit sends the mail to be received/sent with the type identification to the general mail control unit for processing and control, or returns it to the mail receiving/delivery unit.
步骤b或c中,所述反垃圾处理单元根据Alignment比较算法获得比较结果。In step b or c, the anti-spam processing unit obtains the comparison result according to the Alignment comparison algorithm.
步骤d中,反垃圾处理单元根据黑白名单确定邮件类型。In step d, the anti-spam processing unit determines the email type according to the black and white lists.
所述比较策略包括:比较邮件内容,或比较邮件格式,或比较发送邮件的通讯指令序列,或采用上述三种方式的任意组合。The comparison strategy includes: comparing email contents, or comparing email formats, or comparing communication instruction sequences for sending emails, or using any combination of the above three methods.
步骤b或步骤c所述判断是否存在待接收/发送邮件的相似邮件的方法具体为:预先设置相似度阈值,并判断所述逐一进行比较的比较结果是否超过对应的相似度阈值,如果超过则为待接收/发送邮件的相似邮件,否则不是。The method for judging whether there are similar emails to be received/sent in step b or step c is specifically: setting a similarity threshold in advance, and judging whether the comparison results of the one-by-one comparisons exceed the corresponding similarity threshold, and if so, then Similar mail for incoming/sent mail, otherwise not.
由上述技术方案可见,本发明的这种反垃圾邮件的处理装置,在现有的邮件处理装置中增加反垃圾处理单元,将待接收/发送的邮件与预先保存在邮件数据库中的模板进行比较,根据已有信息区分垃圾邮件和合法邮件,并在确定邮件类型后提取所述邮件特征,保存在邮件数据库中,作为后续过程中用作比较的新邮件模板,使得邮件数据库能够得到动态更新,从而提高邮件处理装置识别垃圾邮件的可靠度,增强该装置反垃圾邮件的灵活性,达到较佳的反垃圾效果。其中,邮件处理装置可以指邮件服务器或个人用户端中设置的邮件程序。It can be seen from the above-mentioned technical scheme that this anti-spam processing device of the present invention adds an anti-spam processing unit to the existing mail processing device, and compares the mail to be received/sent with the template stored in the mail database in advance , according to existing information to distinguish spam emails from legitimate emails, and after determining the email type, extract the email features, save them in the email database, and use them as a new email template for comparison in the subsequent process, so that the email database can be dynamically updated, Therefore, the reliability of the mail processing device for identifying spam is improved, the flexibility of the device for anti-spam is enhanced, and a better anti-spam effect is achieved. Wherein, the mail processing device may refer to a mail server or a mail program set in a personal client terminal.
此外,本发明反垃圾邮件的处理方法,通过提取大量邮件的发送行为、发送内容和结构等特征,分别生成垃圾邮件模板和合法邮件模板,并保持上述邮件模板的动态调整,然后利用Alignment比较算法将待接收/发送的邮件与上述邮件模板进行比较,通过判断所述邮件和邮件模板的相似度,区分和过滤出垃圾邮件,故该方法具有相当程度的可行性,能够提高垃圾邮件识别的准确度。In addition, the anti-spam processing method of the present invention generates spam templates and legal email templates respectively by extracting the characteristics of sending behavior, sending content and structure of a large number of emails, and maintains the dynamic adjustment of the above-mentioned email templates, and then uses the Alignment comparison algorithm Compare the mail to be received/sent with the above mail template, and distinguish and filter out spam by judging the similarity between the mail and the mail template, so this method has a considerable degree of feasibility and can improve the accuracy of spam identification Spend.
基于上述反垃圾技术,本发明能够更好地拦截垃圾信息,确保邮件系统用户的正常通信不受垃圾邮件的干扰。进一步地,邮件系统可以利用上述方法设置垃圾邮件预警、处理策略,比如对VIP收费用户和普通免费用户提供不同粒度的反垃圾服务等。Based on the above anti-spam technology, the present invention can better intercept spam information and ensure that the normal communication of mail system users will not be disturbed by spam mails. Furthermore, the mail system can use the above method to set spam warning and processing strategies, such as providing anti-spam services of different granularities for VIP paid users and ordinary free users.
附图说明 Description of drawings
图1为现有技术中邮件处理装置的组成结构;Fig. 1 is the composition structure of mail processing device in the prior art;
图2为本发明一个较佳实施例中利用Alignment比较算法进行字符串比较的示意图;Fig. 2 is the schematic diagram that utilizes Alignment comparison algorithm to carry out character string comparison in a preferred embodiment of the present invention;
图3为本发明中反垃圾邮件处理装置的组成结构;Fig. 3 is the composition structure of anti-spam processing device among the present invention;
图4为本发明中反垃圾处理的具体流程。FIG. 4 is a specific flow of anti-garbage processing in the present invention.
具体实施方式 Detailed ways
为使本发明的目的、技术方案及优点更加清楚明白,以下参照附图并举实施例,对本发明进一步详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and examples.
本发明的反垃圾处理装置和方法能够主动寻找邮件在发送行为、发送内容和结构等方面的规律性,并在邮件系统中引入Alignment技术来提取垃圾邮件模板和/或合法邮件模板,作为判别垃圾邮件的标准。The anti-spam processing device and method of the present invention can actively search for the regularity of mail in sending behavior, sending content and structure, etc., and introduce Alignment technology in the mail system to extract spam templates and/or legal mail templates as a spam discrimination mail standard.
所述Alignment技术是应用在生物信息领域的一种方法,被广泛地用于查找DNA序列中的相同字符串,以揭示DNA数据库中是否大量出现某段序列等问题。其中,DNA序列中的字符指的是ATCG等生物学字符,空格、换行、特殊符号、非英文字母等其它字符在生物信息领域都会被忽略。在利用Alignment技术比较两个输入字符串时,允许对输入字符串执行删除/插入等操作,以便将两个字符串对齐后再行比较,这样就能寻找出二者的最大可能匹配字符串。The Alignment technology is a method applied in the field of bioinformatics, and is widely used to search for identical character strings in DNA sequences to reveal whether a certain sequence appears in large numbers in DNA databases. Among them, the characters in the DNA sequence refer to biological characters such as ATCG, and other characters such as spaces, newlines, special symbols, and non-English letters will be ignored in the field of bioinformatics. When using the Alignment technology to compare two input strings, it is allowed to perform operations such as deletion/insertion on the input strings, so that the two strings can be aligned and then compared, so that the maximum possible matching string between the two can be found.
为了使上述Alignment技术能够适用于邮件系统,本发明中将Alignment技术的处理对象进行了扩展,使之覆盖全部的ASCII字符、特殊字符(比如回车、换行和空白字符),以及汉字(字符127~256)等,从而使Alignment技术具备处理任意中英文字符串的能力。图2显示的是Alignment技术进行字符串比较的示意图,上下两行是需要对比的输入字符串,‘-’表示插入空格;中间行是比较结果,其中‘|’表示相符、‘.’表示不相符。在获得比较结果后,可以进一步根据设定的标准判断这两个输入字符串是否相似,比如共有25个字符的字符串,如果比较结果中的‘|’超过13个,则判定这两个输入字符串相似。In order to make the above-mentioned Alignment technology applicable to the mail system, the processing object of the Alignment technology is expanded in the present invention to cover all ASCII characters, special characters (such as carriage return, line feed and blank characters), and Chinese characters (character 127 ~256), etc., so that the Alignment technology has the ability to process any Chinese and English character strings. Figure 2 shows a schematic diagram of Alignment technology for string comparison. The upper and lower lines are the input strings that need to be compared, and '-' means inserting a space; the middle line is the comparison result, where '|' means match, and '.' means no match. After obtaining the comparison result, you can further judge whether the two input strings are similar according to the set standard. For example, if there are 25 character strings in total, if there are more than 13 '|' in the comparison result, then judge the two input strings The strings are similar.
从图2看出,Alignment技术能够显示两个字符串中的相同内容,故在E-mail环境下,可以采用Fasta、Blast等具体的Alignment比较算法,提取出隐藏在大量邮件中的垃圾邮件模板和/或合法邮件模板,比如某种网页和结构的XML文本,或文章标题模板等,使得反垃圾技术具有更强的针对性,提高垃圾邮件的拦截准确率。仍以图2为例,由这两个输入字符串比较后,取其相同内容就能生成一个新的模板。It can be seen from Figure 2 that the Alignment technology can display the same content in two strings, so in the E-mail environment, specific Alignment comparison algorithms such as Fasta and Blast can be used to extract spam templates hidden in a large number of emails And/or legitimate email templates, such as XML text of a certain web page and structure, or article title templates, etc., make the anti-spam technology more targeted and improve the accuracy of spam interception. Still taking Figure 2 as an example, after comparing the two input strings, a new template can be generated by taking the same content.
为获取垃圾邮件模板和/或合法邮件模板,需要设置邮件数据库用于保存大量E-mail作为比较对象,并通过一定时间形成大量比较结果作为参考。在该方法的实现初期,所设置的全体邮件样本库、合法邮件列表库和垃圾邮件模板库等邮件数据库都为空。此后,邮件处理装置一旦接收/发送邮件,就将该邮件作为样本保存在全体邮件样本库中。In order to obtain spam templates and/or legitimate email templates, it is necessary to set up an email database to save a large number of E-mails as comparison objects, and form a large number of comparison results as references after a certain period of time. At the initial stage of realization of the method, the set mail databases such as the sample database of all mails, the database of legitimate mailing lists and the database of spam templates are all empty. Thereafter, when the mail processing apparatus receives/sends mail, it stores the mail as a sample in the entire mail sample library.
当全体邮件样本库中的邮件样本达到一定数量时,邮件处理装置开始工作,采用Alignment比较算法将后续接收到的新邮件与全体邮件样本库中的邮件样本进行比较,并根据相似程度判定该邮件的类型。为了提高效率,可以仅选择与该邮件大小相近的邮件样本进行比较。When the number of email samples in the entire email sample library reaches a certain number, the email processing device starts to work, and uses the Alignment comparison algorithm to compare the subsequent received new emails with the email samples in the entire email sample library, and judge the email according to the degree of similarity type. To improve efficiency, you can select only mail samples that are similar in size to the mail for comparison.
图3显示的是本发明中反垃圾邮件处理装置的组成结构,该邮件处理装置包括:邮件接收/投递单元31、通用邮件控制单元32、反垃圾处理单元33、全体邮件样本库34、合法邮件列表库35和垃圾邮件模板库36。What Fig. 3 shows is the composition structure of anti-spam processing device in the present invention, and this mail processing device comprises: mail receiving/delivery unit 31, common mail control unit 32, anti-spam processing unit 33, whole mail sample storehouse 34, legal mail List library 35 and spam template library 36.
邮件接收/投递单元31和通用邮件控制单元32的工作原理与现有技术相同,此处不再赘述。The working principle of the mail receiving/delivery unit 31 and the general mail control unit 32 is the same as that of the prior art, and will not be repeated here.
所述反垃圾处理单元33叉包括:垃圾判定模块331和邮件模板提取模块332。The anti-spam processing unit 33 includes: a spam determination module 331 and an email template extraction module 332 .
所述全体邮件样本库34中保存的是未经处理的邮件样本;合法邮件列表库35中保存的每条数据记录是一个合法邮件模板,每个合法邮件模板都是由一封以上合法邮件通过比较后提取出其中的相同内容生成的;垃圾邮件模板库36中保存的是垃圾邮件模板,该模板的生成方式与合法邮件模板类似,此处不再赘述。What preserved in the described whole mail sample storehouse 34 is unprocessed mail sample; Every piece of data records preserved in the legal mail list storehouse 35 is a legal mail template, and each legal mail template is all passed by more than one legal mail. After the comparison, the same content is extracted and generated; the spam template library 36 stores a spam template, and the generation method of this template is similar to that of a legitimate email template, and will not be repeated here.
为了避免对全体邮件样本库34、合法邮件列表库35和垃圾邮件模板库36等邮件数据库进行直接操作,可以为每个数据库设置对应的维护模块,分别是:全体邮件库维护模块341、邮件列表库维护模块351和垃圾模板库维护模块361。这样,对每个数据库的操作指令都可经由对应的维护模块下达,所述操作包括获取邮件数据库中的数据记录,或者对其中的数据记录添加、删除和修改等。In order to avoid direct operations on mail databases such as the whole mail sample library 34, the legal mail list library 35 and the spam template library 36, corresponding maintenance modules can be set for each database, which are respectively: the overall mail library maintenance module 341, the mailing list Library maintenance module 351 and garbage template library maintenance module 361. In this way, the operation instructions for each database can be issued through the corresponding maintenance module, and the operations include obtaining data records in the mail database, or adding, deleting and modifying data records therein.
邮件接收/投递单元31收到邮件接收/发送请求后,将待接收/发送邮件送至垃圾判定模块331进行处理,同时通过全体邮件库维护模块341将所述待接收/发送邮件保存在全体邮件样本库34中。After the mail receiving/delivery unit 31 receives the mail receiving/sending request, the mails to be received/sent are sent to the junk judging module 331 for processing, and the mails to be received/sent are stored in the whole mail database by the overall mail library maintenance module 341 simultaneously. Sample library34.
垃圾判定模块331接收到上述新邮件后,通过全体邮件库维护模块341、邮件列表库维护模块351和垃圾模板库维护模块361,分别向全体邮件样本库34、合法邮件列表库35和垃圾邮件模板库36发出查询请求,并根据上述数据库中的数据记录判断所述新邮件的邮件类型。然后,垃圾判定模块331为该邮件打上合法/垃圾邮件标识,并将其送至通用邮件控制单元32进行处理控制,或返回给邮件接收/投递单元31。Garbage judging module 331 after receiving above-mentioned new mail, by all mail storehouse maintenance module 341, mail list storehouse maintenance module 351 and rubbish template storehouse maintenance module 361, to all mail sample storehouse 34, legal mail list storehouse 35 and spam template storehouse respectively The library 36 issues a query request, and judges the mail type of the new mail according to the data records in the above-mentioned database. Then, the junk judging module 331 marks the mail as legal/junk mail, and sends it to the general mail control unit 32 for processing and control, or returns it to the mail receiving/delivery unit 31 .
此外,邮件模板提取模块332会根据垃圾判定模块331的判断结果,提取出新的邮件模板送入合法邮件列表库35或垃圾邮件模板库36,如果所述新邮件为合法邮件,则获得新的合法邮件模板,对于垃圾邮件的情况亦然,故本发明的反垃圾邮件装置能够动态地生成和调整自身的邮件模板,保证其提供的反垃圾技术不易被破解。In addition, the email template extracting module 332 will extract new email templates according to the judgment result of the garbage judging module 331 and send them to the legitimate email list library 35 or the junk email template library 36. If the new email is a legitimate email, then a new email template will be obtained. Legal email templates are also the same for spam, so the anti-spam device of the present invention can dynamically generate and adjust its own email templates to ensure that the anti-spam technology it provides is not easy to be cracked.
错误反馈处理模块37负责根据用户/管理员的反馈信息,比如邮件误判指示等,调整相应数据库的记录。其中,错误反馈处理模块37可以将调整记录请求送至垃圾判定模块331,再由垃圾判定模块331完成相应数据库的记录调整;或者,错误反馈处理模块37直接将调整记录请求发送至邮件列表库维护模块351和/或垃圾模板库维护模块361,以完成相应数据库的记录调整。The error feedback processing module 37 is responsible for adjusting the records in the corresponding database according to the feedback information of the user/administrator, such as an indication of an email misjudgment. Wherein, the error feedback processing module 37 can send the adjustment record request to the garbage judging module 331, and then the garbage judging module 331 completes the record adjustment of the corresponding database; or, the error feedback processing module 37 directly sends the adjustment record request to the mailing list library maintenance Module 351 and/or Garbage Template Library Maintenance Module 361 to complete the record adjustment of the corresponding database.
除此之外,如果本发明的邮件处理装置设置在个人用户端,所述通用邮件控制单元32也可以直接根据用户请求,向邮件列表库维护模块351和/或垃圾模板库维护模块361发出调整记录请求,对相应数据库的记录进行操作。In addition, if the mail processing device of the present invention is set on a personal client, the general mail control unit 32 can also send adjustments directly to the mail list maintenance module 351 and/or spam template database maintenance module 361 according to the user's request. Record requests to operate on records in the corresponding database.
基于上述的反垃圾邮件处理装置,本发明反垃圾技术的处理流程见图4,具体包括以下步骤:Based on the above-mentioned anti-spam processing device, the processing flow of the anti-spam technology of the present invention is shown in Figure 4, specifically comprising the following steps:
步骤401、邮件处理装置产生待接收/发送的新邮件后,将所述邮件保存在全体邮件样本库中,并将该邮件送至反垃圾处理单元进行类型判定。
步骤402、反垃圾处理单元中的垃圾判定模块接收到上述邮件后,向邮件列表库维护模块发出查询请求,通过合法邮件列表库的数据记录判断所述邮件是否为合法邮件,如果是则执行步骤409,否则执行步骤403。
所述合法邮件列表库中保存的每条数据记录都是一个合法邮件模板,这些合法邮件模板是通过Alignment比较算法对合法邮件进行比较后,提取出邮件中的相同内容生成的。垃圾判定模块根据Alignment比较算法,将所述待接收/发送邮件的原始内容与合法邮件模板逐一进行比较。其中,待接收/发送邮件的原始内容表现为字母文本格式,故Alignment比较算法能够将其作为字符串进行处理。Each data record stored in the legal mailing list database is a legal mail template, and these legal mail templates are generated by extracting the same content in the mail after comparing the legal mails through the Alignment comparison algorithm. The junk judging module compares the original content of the to-be-received/sent mail with the legal mail template one by one according to the Alignment comparison algorithm. Among them, the original content of the mail to be received/sent is in the form of alphabetic text, so the Alignment comparison algorithm can process it as a character string.
在具体进行比较时,可以针对E-mail的特点设置比较策略,比如:比较E-mail内容(包含邮件头)的相似度;比较E-mail格式的相似度;或者,在分析垃圾邮件的发送行为时,比较发送E-mail的通讯指令序列的相似度等。When making a specific comparison, you can set a comparison strategy based on the characteristics of E-mail, such as: comparing the similarity of E-mail content (including the header); comparing the similarity of E-mail format; or, analyzing the sending of spam During the behavior, compare the similarity of the communication command sequence for sending E-mail, etc.
如果要比较E-mail格式,就将其按照出现顺序转化为格式序列,再将格式序列作为字符串两两比较。一般情况下,仅将TAB(\t)、回车(\r)、换行(\n)这三个字符视为格式字符,那么一个可能的格式序列为“\t\r\n\r\n\r\n”。实际上,构成E-mail格式的格式字符可以根据需要进行设置,比如将标点符号也设置为格式字符等。同样地,Email内容和/或通讯指令序列也可以采用类似的方法得出比较结果。If you want to compare E-mail formats, convert them into format sequences according to the order of appearance, and then compare the format sequences as strings. Under normal circumstances, only the three characters TAB (\t), carriage return (\r), and newline (\n) are regarded as format characters, so a possible format sequence is "\t\r\n\r\ n\r\n". In fact, the format characters constituting the E-mail format can be set as required, for example, punctuation marks can also be set as format characters. Similarly, Email content and/or communication instruction sequence can also use a similar method to obtain the comparison result.
对于不同的比较内容,邮件系统可以设定不同的相似度阈值,比如邮件头的相似度阈值为95%,邮件正文的相似度阈值为80%,图片等附件的相似度阈值为98%等。只有比较结果超过设定的相似度阈值,才能判定二者相似。上述的相似度阈值可以根据用户的反馈进行调整和修改,此处不再赘述。For different comparison contents, the mail system can set different similarity thresholds, such as 95% similarity threshold for email headers, 80% similarity threshold for email text, 98% similarity threshold for attachments such as pictures, etc. Only when the comparison result exceeds the set similarity threshold can it be determined that the two are similar. The above similarity threshold can be adjusted and modified according to the user's feedback, which will not be repeated here.
此外,还可以将不同部分的比较结果按照指定规则计算出综合相似指标,作为识别合法/垃圾邮件的标准。比如,如果两个输入邮件的邮件头相似度>95%,邮件正文相似度>80%,并且都包含有相同的图片附件,则判定这两个输入邮件相似。In addition, the comparison results of different parts can also be calculated according to the specified rules to calculate a comprehensive similarity index, which can be used as a standard for identifying legitimate/spam emails. For example, if the header similarity of two input emails is >95%, the email body similarity is >80%, and both contain the same picture attachment, then it is determined that the two input emails are similar.
经过上述过程,垃圾判定模块能够很容易地获知该邮件是否合法。Through the above process, the garbage judging module can easily know whether the email is legitimate.
步骤403、垃圾判定模块向垃圾模板库维护模块发出查询请求,根据垃圾邮件模板库的数据记录判断所述邮件是否为垃圾邮件,如果是则执行步骤404,否则执行步骤405。
该步骤中,垃圾邮件的判断过程与合法邮件类似,此处不再赘述。In this step, the judging process of spam emails is similar to that of legitimate emails, and will not be repeated here.
步骤404、垃圾判定模块将所述邮件标为垃圾邮件、给出高危垃圾指标,并将其送至通用邮件控制单元,然后执行步骤410。Step 404, the spam judging module marks the email as spam, gives a high-risk spam indicator, and sends it to the general mail control unit, and then executes
所述高危垃圾指标包括:垃圾邮件相似度、垃圾邮件模板的覆盖邮件数等。如果覆盖邮件数为一万封,表明该垃圾邮件模板是根据一万封垃圾邮件提取出来的。The high-risk spam indicators include: similarity of spam emails, number of emails covered by spam email templates, and the like. If the number of covered emails is 10,000, it indicates that the spam template is extracted based on 10,000 spam emails.
步骤405、垃圾判定模块向全体邮件库维护模块发出查询请求,在全体邮件样本库保存的邮件样本中查找所述邮件的相似邮件,如果所述相似邮件数量超过预设门限值T则执行步骤406,否则直接执行步骤407。
步骤406、垃圾判定模块向邮件模板提取模块发出模板提取指令。邮件模板提取模块接收到上述指令后,从全体邮件样本库中选取至少一个与所述待接收/发送邮件相似的邮件样本,并用Alignment比较算法对该邮件与邮件样本进行比较,生成新邮件模板。
步骤407、垃圾判定模块根据自身设置的辅助信息判断所述邮件类型,如果为垃圾邮件则通知邮件模板提取模块将新邮件模板保存到垃圾邮件模板库中,并执行步骤408;如果为合法邮件,则将新邮件模板保存到合法邮件列表库中,并执行步骤409。
所述辅助信息指的是黑白名单等,比如某邮件是由可信任的地址投递的,就将其列在白名单上,垃圾判定模块在后续过程中接收到与该发送地址相同的邮件,就会判定其为合法邮件。垃圾邮件信息也可以采用同样的方法设定,此处不再赘述。The auxiliary information refers to black and white lists, etc. For example, if a certain mail is delivered by a trusted address, it will be listed on the white list, and the garbage determination module will receive the same mail as the sending address in the follow-up process. It will be judged as legitimate mail. Spam information can also be set in the same way, which will not be repeated here.
步骤408、垃圾判定模块将所述邮件标为垃圾邮件、给出中等垃圾指标,并将其送至通用邮件控制单元,然后执行步骤410。
步骤409、垃圾判定模块将所述邮件标为合法邮件、给出合法邮件指标,并将其送至通用邮件控制单元。
步骤410、通用邮件控制单元根据所述邮件的类型标识,对其执行后续处理,此处不再赘述。
对于携带有垃圾邮件标识的邮件,邮件处理装置将拒绝发送或限量发送,而对于判定为合法邮件的则保持正常发送。For the mail carrying the spam mark, the mail processing device will refuse to send or limit the sending, and keep sending normally for judging as legal mail.
上述反垃圾邮件的处理过程可以在接收/发送邮件时触发,从而加大邮件收发时的反垃圾处理力度。对于接收和发送邮件这两种情况,邮件处理装置在执行类型判定时可以区别对待。比如,对于接收到的邮件,需要查询全体邮件样本库;而发送邮件时则不必执行上述过程。又比如,当邮件处理装置设置在邮件服务器时,接收邮件使用规模较大的邮件数据库进行比较,而发送邮件仅需比较小范围的邮件模板等。The above-mentioned anti-spam processing process can be triggered when receiving/sending emails, thereby increasing the anti-spam processing strength when sending and receiving emails. For the two cases of receiving and sending mail, the mail processing device can treat them differently when performing type determination. For example, for received emails, it is necessary to query the entire email sample library; but it is not necessary to perform the above process when sending emails. For another example, when the mail processing device is installed on the mail server, a large-scale mail database is used for comparison when receiving mail, while sending mail only needs to compare a small range of mail templates.
此外,邮件服务器和个人用户端可以同时设置上述邮件处理装置,使得整个邮件系统具有更强的反垃圾能力。当个人用户端向邮件服务器发出一封邮件时,会根据自身邮件数据库的数据记录,执行一次邮件扫描以判定该邮件的类型;邮件服务器在接收到该封邮件后,可以再执行一次类型判定,并根据情况提取新的邮件模板。实际应用中,邮件系统的反垃圾邮件处理过程不限于此,可以在所需的任意邮件处理装置上触发邮件扫描过程,此处不再赘述。In addition, the above-mentioned mail processing device can be installed on the mail server and the personal client side at the same time, so that the entire mail system has a stronger anti-spam capability. When an individual client sends an email to the mail server, it will perform an email scan to determine the type of the email according to the data records in its own email database; after the email server receives the email, it can perform another type determination. And extract new mail templates according to the situation. In practical applications, the anti-spam processing process of the mail system is not limited to this, and the mail scanning process can be triggered on any desired mail processing device, which will not be repeated here.
由上述的实施例可见,本发明的这种反垃圾邮件的处理装置和方法,将待接收/发送的邮件与预先保存在邮件数据库中的模板进行比较,根据已有信息区分垃圾邮件和合法邮件,并在确定邮件类型后提取所述邮件特征,作为后续过程中的新邮件模板,从而达到较佳的反垃圾效果。As can be seen from the above-mentioned embodiments, this anti-spam processing device and method of the present invention compares the mail to be received/sent with the template stored in the mail database in advance, and distinguishes spam and legitimate mail according to existing information , and extract the features of the email after determining the email type, and use it as a new email template in the subsequent process, so as to achieve a better anti-spam effect.
Claims (12)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200610001083A CN100589453C (en) | 2006-01-16 | 2006-01-16 | An anti-spam processing device and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200610001083A CN100589453C (en) | 2006-01-16 | 2006-01-16 | An anti-spam processing device and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101005462A CN101005462A (en) | 2007-07-25 |
CN100589453C true CN100589453C (en) | 2010-02-10 |
Family
ID=38704332
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN200610001083A Active CN100589453C (en) | 2006-01-16 | 2006-01-16 | An anti-spam processing device and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN100589453C (en) |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101330476B (en) * | 2008-07-02 | 2011-04-13 | 北京大学 | Method for dynamically detecting junk mail |
WO2010135861A1 (en) * | 2009-05-25 | 2010-12-02 | Chiao Hakfung | Mail system, junk mail processor and method for marking junk mails |
CN101699818B (en) * | 2009-11-11 | 2012-07-04 | 海南电力试验研究所 | Anti-spam management system and method thereof |
CN102098332B (en) * | 2010-12-30 | 2014-04-16 | 北京新媒传信科技有限公司 | Method and device for examining and verifying contents |
CN103188136B (en) * | 2011-12-30 | 2016-04-27 | 盈世信息科技(北京)有限公司 | A kind of filtrating mail information saving method, mail server and e-mail system |
CN103841094B (en) * | 2012-11-27 | 2017-04-12 | 阿里巴巴集团控股有限公司 | Method and device for judging mail types |
CN103795612B (en) * | 2014-01-15 | 2017-09-12 | 五八同城信息技术有限公司 | Rubbish and illegal information detecting method in instant messaging |
CN105306342B (en) * | 2015-09-29 | 2019-04-09 | 武汉钢铁(集团)公司 | A kind of processing method and system of non-standard mailing system information errors |
CN105871701A (en) * | 2016-05-30 | 2016-08-17 | 周奇 | Email handling method and device |
CN106066884A (en) * | 2016-06-06 | 2016-11-02 | 珠海市小源科技有限公司 | A kind of information security recognition methods and device |
CN107819664A (en) * | 2016-09-12 | 2018-03-20 | 阿里巴巴集团控股有限公司 | A kind of recognition methods of spam, device and electronic equipment |
CN107171937A (en) * | 2017-05-11 | 2017-09-15 | 翼果(深圳)科技有限公司 | The method and system of anti-rubbish mail |
CN107171944B (en) * | 2017-06-27 | 2020-06-16 | 北京二六三企业通信有限公司 | Junk mail identification method and device |
CN108769140A (en) * | 2018-05-09 | 2018-11-06 | 国家计算机网络与信息安全管理中心 | A kind of realtime graphic Text region caching acceleration system |
CN115567476B (en) * | 2022-09-28 | 2024-10-15 | 建信金融科技有限责任公司 | Junk mail detection method, device, storage medium and computer program product |
-
2006
- 2006-01-16 CN CN200610001083A patent/CN100589453C/en active Active
Non-Patent Citations (1)
Title |
---|
基于ART神经网络的垃圾邮件过滤技术. 马凤云,刘培玉.信息技术与信息化,第4期. 2005 * |
Also Published As
Publication number | Publication date |
---|---|
CN101005462A (en) | 2007-07-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN100589453C (en) | An anti-spam processing device and method | |
US10042919B2 (en) | Using distinguishing properties to classify messages | |
JP4598774B2 (en) | Method and apparatus for filtering email spam based on similarity measures | |
EP1492283B1 (en) | Method and device for spam detection | |
US9906554B2 (en) | Suspicious message processing and incident response | |
US8935348B2 (en) | Message classification using legitimate contact points | |
US20050050150A1 (en) | Filter, system and method for filtering an electronic mail message | |
US20190319905A1 (en) | Mail protection system | |
US7984029B2 (en) | Reliability of duplicate document detection algorithms | |
US7725475B1 (en) | Simplifying lexicon creation in hybrid duplicate detection and inductive classifier systems | |
US20070283000A1 (en) | Method and system for phishing detection | |
US20060259551A1 (en) | Detection of unsolicited electronic messages | |
RU2710739C1 (en) | System and method of generating heuristic rules for detecting messages containing spam | |
US8473556B2 (en) | Apparatus, a method, a program and a system for processing an e-mail | |
CN103001848B (en) | Rubbish mail filtering method and device | |
EP1733521B1 (en) | A method and an apparatus to classify electronic communication | |
HK1176187A (en) | Managing unwanted communications using template generation and fingerprint comparison features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |