CN115473676A - Phishing mail detection method and device, electronic equipment and storage medium - Google Patents
Phishing mail detection method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN115473676A CN115473676A CN202210946321.5A CN202210946321A CN115473676A CN 115473676 A CN115473676 A CN 115473676A CN 202210946321 A CN202210946321 A CN 202210946321A CN 115473676 A CN115473676 A CN 115473676A
- Authority
- CN
- China
- Prior art keywords
- mailbox
- detected
- sender
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1416—Event detection, e.g. attack signature detection
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/07—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail characterised by the inclusion of specific contents
- H04L51/08—Annexed information, e.g. attachments
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1408—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
- H04L63/1425—Traffic logging, e.g. anomaly detection
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/14—Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
- H04L63/1441—Countermeasures against malicious traffic
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Information Transfer Between Computers (AREA)
Abstract
本发明实施例提供一种钓鱼邮件检测方法、装置、电子设备及存储介质,涉及网络安全技术领域,其中方法包括:获取预先训练得到的钓鱼邮件检测模型、与企业邮箱业务相关的企业内部信息、以及邮件网关日志;基于企业内部信息以及邮件网关日志,确定待检测邮件的邮件特征;将待检测邮件的邮件特征输入至钓鱼邮件检测模型,得到钓鱼邮件检测模型输出的待检测邮件的邮件类型;所述钓鱼邮件检测模型为基于已标记的历史邮件对应的邮件特征以及所述已标记的历史邮件的标记值,进行二分类模型训练得到。本发明能够降低钓鱼邮件漏报和误报概率,提高钓鱼邮件检测的可靠性。
Embodiments of the present invention provide a method, device, electronic device, and storage medium for detecting phishing emails, which relate to the technical field of network security, wherein the method includes: obtaining a pre-trained phishing email detection model, internal information related to the enterprise mailbox business, and the mail gateway log; based on the internal information of the enterprise and the mail gateway log, determine the mail characteristics of the mail to be detected; input the mail characteristics of the mail to be detected into the phishing mail detection model, and obtain the mail type of the mail to be detected output by the phishing mail detection model; The phishing email detection model is obtained by training a binary classification model based on the email features corresponding to the marked historical emails and the tag values of the marked historical emails. The invention can reduce the probability of missing and false positives of phishing emails and improve the reliability of phishing emails detection.
Description
技术领域technical field
本发明涉及网络安全技术领域,尤其涉及一种钓鱼邮件检测方法、装置、电子设备及存储介质。The invention relates to the technical field of network security, in particular to a phishing email detection method, device, electronic equipment and storage medium.
背景技术Background technique
电子邮件是人们日常工作交流的重要工具。钓鱼邮件主要是攻击者通过发送伪装的正常邮件,诱导收件人访问恶意链接或者打开恶意附件,从而达到控制收件人主机或者盗取收件人隐私数据的目的。对于攻击者而言,通过钓鱼邮件突破安全防护的边界是常用攻击手段之一。鉴于此,企业为了降低企业安全风险,需要从海量邮件中及时检测出钓鱼邮件。E-mail is an important tool for people to communicate in their daily work. Phishing emails are mainly used by attackers to send disguised normal emails to induce recipients to visit malicious links or open malicious attachments, so as to achieve the purpose of controlling the recipient's host or stealing the recipient's private data. For attackers, using phishing emails to break through the boundaries of security protection is one of the common attack methods. In view of this, in order to reduce corporate security risks, enterprises need to detect phishing emails in a timely manner from mass emails.
现有技术中,一种方式是通过使用关键词匹配的方式对邮件信息进行匹配,将匹配到关键词的邮件判定为钓鱼邮件,但是这种方式仅能检测出邮件信息中包含了关键词库中关键词的邮件,而且容易产生误报。另一种方式是使用邮件沙箱对包含附件的邮件进行附件执行,然后对附件执行的行为进行分析与监测,根据分析和监测的结果判定其是否为钓鱼邮件,进而发现含恶意代码的附件,但是攻击者如果对附件进行加密压缩就可以绕过邮件沙箱的检测,导致检测可靠性差。In the prior art, one way is to match the email information by using keyword matching, and judge the email matching the keyword as a phishing email, but this method can only detect that the email information contains the keyword library Keyword emails are prone to false positives. Another way is to use the email sandbox to perform attachment execution on emails containing attachments, then analyze and monitor the behavior of attachment execution, and determine whether it is a phishing email based on the results of the analysis and monitoring, and then find attachments containing malicious code. However, if the attacker encrypts and compresses the attachment, he can bypass the detection of the email sandbox, resulting in poor detection reliability.
发明内容Contents of the invention
针对现有技术中的问题,本发明实施例提供一种钓鱼邮件检测方法、装置、电子设备及存储介质。Aiming at the problems in the prior art, embodiments of the present invention provide a method, device, electronic device and storage medium for detecting phishing emails.
具体地,本发明实施例提供了以下技术方案:Specifically, the embodiments of the present invention provide the following technical solutions:
第一方面,本发明实施例提供了一种钓鱼邮件检测方法,方法包括:In a first aspect, an embodiment of the present invention provides a method for detecting phishing emails, the method comprising:
获取预先训练得到的钓鱼邮件检测模型、与企业邮箱业务相关的企业内部信息、以及邮件网关日志;Obtain the pre-trained phishing email detection model, enterprise internal information related to the enterprise mailbox business, and mail gateway logs;
基于所述企业内部信息以及所述邮件网关日志,确定待检测邮件的邮件特征;Determine the email characteristics of the email to be detected based on the internal information of the enterprise and the email gateway log;
将所述待检测邮件的邮件特征输入至所述钓鱼邮件检测模型,得到所述钓鱼邮件检测模型输出的所述待检测邮件的邮件类型;所述邮件类型包括钓鱼邮件和非钓鱼邮件;所述钓鱼邮件检测模型为基于已标记的历史邮件对应的邮件特征以及所述已标记的历史邮件的标记值,进行二分类模型训练得到。Inputting the email feature of the email to be detected into the phishing email detection model to obtain the email type of the email to be detected output by the phishing email detection model; the email type includes phishing emails and non-phishing emails; the The phishing email detection model is obtained by training a binary classification model based on the email features corresponding to the marked historical emails and the tag values of the marked historical emails.
进一步地,所述邮件特征包括以下至少一项:Further, the email features include at least one of the following:
用于区分邮件是否为伪装内部邮件的特征,包括以下至少一项:邮件是否包含附件、邮件附件类型对应异常等级、邮件附件名称是否包含中文、邮件附件名称与内网邮件附件名相似度、邮件主题与内网邮件主题相似度、发件人邮箱域名与内网邮箱域名相似度、发件人邮箱名称与内网邮箱域名相似度、发件人昵称与内网邮箱昵称相似度及发件人昵称与企业内部组织相似度;The characteristics used to distinguish whether an email is a fake internal email, including at least one of the following: whether the email contains attachments, whether the email attachment type corresponds to the abnormal level, whether the email attachment name contains Chinese, the similarity between the email attachment name and the intranet email attachment name, the email The similarity between the subject and the intranet email subject, the similarity between the sender's mailbox domain name and the intranet mailbox domain name, the similarity between the sender's mailbox name and the intranet mailbox domain name, the similarity between the sender's nickname and the intranet mailbox nickname, and the sender The similarity between the nickname and the internal organization of the enterprise;
用于区分邮件中收发人关系是否为正常收发人关系的特征,包括以下至少一项:外网邮箱历史发送邮件数量、邮件收件人数量、邮件收件人对应部门数量、收件人所属部门历史收到此发件人邮件数量、收件人历史收到此发件人邮件数量及收件人历史收到发件人是外部邮箱的数量。The characteristics used to distinguish whether the relationship between the sender and receiver in the email is a normal relationship between the receiver and the receiver, including at least one of the following: the number of emails sent by the external network mailbox in history, the number of email recipients, the number of departments corresponding to the email recipients, and the department to which the recipient belongs The number of emails received from this sender in history, the number of emails received from this sender in history by recipients, and the number of emails received by recipients in history from external mailboxes.
进一步地,所述基于所述企业内部信息以及所述邮件网关日志,确定所述待检测邮件的邮件特征,包括以下至少一项:Further, the determining the email characteristics of the email to be detected based on the internal information of the enterprise and the email gateway log includes at least one of the following:
在所述邮件网关日志中所述待检测邮件对应的附件字段的值为非空的情况下,确定所述待检测邮件包含附件;在所述待检测邮件对应的附件字段的值为空的情况下,确定所述待检测邮件未包含附件;When the value of the attachment field corresponding to the mail to be detected in the mail gateway log is not empty, it is determined that the mail to be detected contains an attachment; when the value of the attachment field corresponding to the mail to be detected is empty Next, confirm that the email to be detected does not contain any attachments;
基于所述待检测邮件的邮件附件的文件后缀,以及预置的文件后缀与异常等级的对应关系,确定所述待检测邮件的邮件附件类型对应的异常等级;Based on the file suffix of the email attachment of the email to be detected, and the preset correspondence between the file suffix and the abnormality level, determine the abnormality level corresponding to the email attachment type of the email to be detected;
在所述待检测邮件的邮件附件名称与预置的正则表达式匹配的情况下,确定所述邮件附件名称包含中文;所述正则表达式用于匹配邮件附件名称是否包含中文字符;在所述邮件附件名称与预置的正则表达式不匹配的情况下,确定所述邮件附件名称不包含中文;In the case that the name of the mail attachment of the mail to be detected matches a preset regular expression, it is determined that the name of the mail attachment contains Chinese; the regular expression is used to match whether the name of the mail attachment contains Chinese characters; in the If the name of the mail attachment does not match the preset regular expression, it is determined that the name of the mail attachment does not contain Chinese;
从所述历史日志中提取发件人邮箱为企业内部邮箱的至少一个历史邮件的邮件附件名称;对各所述历史邮件的邮件附件名称进行分词得到词组集合;计算各所述词组集合中每个词语的词频,得到词频集合;对所述待检测邮件的邮件附件名称进行分词得到文本词组;使用所述文本词组与所述词频集合进行词频匹配,得到所述文本词组中各个词语的词频;计算所述文本词组中各个词语的词频的平均值;对所述平均值进行归一化处理,得到所述邮件附件名称与内网邮件附件名相似度;From the historical log, extracting sender mailbox is the mail attachment name of at least one historical mail of the internal mailbox of the enterprise; the mail attachment name of each described historical mail is word-segmented to obtain the phrase set; calculate each described phrase set The word frequency of word obtains word frequency set; The mail attachment name of described to-be-detected mail is carried out participle to obtain text phrase; Use described text phrase and described word frequency set to carry out word frequency matching, obtain the word frequency of each word in described text phrase; Calculate The average value of the word frequency of each word in described text phrase; Described average value is carried out normalization process, obtains described mail attachment name and Intranet mail attachment name similarity;
从所述历史日志中提取发件人邮箱为企业内部邮箱的历史邮件的邮件主题;对各所述历史邮件的邮件主题进行分词,得到词组集合;计算各所述词组集合中每个词组中词语的词频,得到词频集合;对所述待检测邮件的邮件主题进行分词,得到文本词组;使用所述文本词组与所述词频集合进行词频匹配,得到所述文本词组中各个词语的词频;计算所述文本词组中各个词语的词频的平均值;对所述平均值进行归一化处理,得到所述邮件主题与内网邮件主题相似度;Extract sender's mailbox from described history log and be the mail theme of the historical mail of enterprise internal mailbox; Carry out word segmentation to the mail subject of each described historical mail, obtain phrase set; Calculate the words in each phrase in each described phrase set The word frequency of the word frequency is obtained word frequency set; The mail subject of described to-be-detected mail is carried out word segmentation, obtains text phrase; Use described text phrase and described word frequency set to carry out word frequency matching, obtain the word frequency of each word in the described text phrase; Calculate all The average value of the term frequency of each word in the text phrase; Described average value is carried out normalization process, obtains described mail subject and Intranet mail subject similarity;
从所述待检测邮件的发件人邮箱中提取发件人邮箱域名;确定所述发件人邮箱域名与内网邮箱域名相似度;Extracting the sender's mailbox domain name from the sender's mailbox of the mail to be detected; determining the similarity between the sender's mailbox domain name and the intranet mailbox domain name;
从所述待检测邮件的发件人邮箱中提取发件人邮箱名称;确定所述发件人邮箱名称与内网邮箱域名相似度;Extracting the sender's mailbox name from the sender's mailbox of the mail to be detected; determining the similarity between the sender's mailbox name and the intranet mailbox domain name;
从所述历史日志中提取发件人邮箱为企业内部邮箱的历史邮件的发件人昵称;对各所述历史邮件的发件人昵称进行分词,得到词组集合;计算各所述词组集合中每个词组中词语的词频,得到词频集合;对所述待检测邮件的发件人昵称进行分词,得到文本词组;使用所述文本词组与所述词频集合进行词频匹配,得到所述文本词组中各个词语的词频;计算所述文本词组中各个词语的词频的平均值;对所述平均值进行归一化处理,得到所述发件人昵称与内网邮箱昵称相似度;From described history log, extracting sender's mailbox is the sender's nickname of the historical mail of internal mailbox of enterprise; The sender's nickname of each described historical mail is carried out participle, obtains phrase collection; Calculate each described phrase collection The word frequency of word in word group, obtain word frequency set; The sender's nickname of described to-be-detected mail is carried out participle, obtain text phrase; Use described text phrase and described word frequency set to carry out word frequency matching, obtain each in described text phrase The word frequency of word; Calculate the average value of the word frequency of each word in described text phrase; Described average value is carried out normalization process, obtains described sender's nickname and intranet mailbox nickname similarity;
基于企业内部组织信息集合对各内部组织进行分词,得到词组集合;计算各所述词组集合中每个词组中词语的词频,得到词频集合;对所述待检测邮件的发件人昵称进行分词,得到文本词组;使用所述文本词组与所述词频集合进行词频匹配,得到所述文本词组中各个词语的词频;计算所述文本词组中各个词语的词频的平均值;对所述平均值进行归一化处理,得到所述发件人昵称与企业内部组织相似度;Carry out word segmentation to each internal organization based on the enterprise internal organization information set, obtain the phrase set; calculate the word frequency of words in each phrase in each described phrase set, obtain the word frequency set; carry out word segmentation to the sender's nickname of the described mail to be detected, Obtain text phrase; Use described text phrase and described word frequency set to carry out word frequency matching, obtain the word frequency of each word in described text phrase; Calculate the average value of the word frequency of each word in described text phrase; Described average value is normalized Combined processing to obtain the similarity between the sender's nickname and the internal organization of the enterprise;
从所述历史日志中提取发件人邮箱不是企业内部邮箱的历史邮件数量,得到所述外网邮箱历史发送邮件数量;Extracting the sender's mailbox from the historical log is not the historical number of emails in the internal mailbox of the enterprise, and obtaining the historical number of emails sent by the external network mailbox;
基于所述待检测邮件的收件人邮箱的个数,确定所述邮件收件人数量;Determine the number of recipients of the email based on the number of recipient mailboxes of the email to be detected;
从所述待检测邮件的收件人邮箱中提取收件人邮箱名称,基于所述收件人邮箱名称以及企业员工与部门映射信息集合,确定所述收件人邮箱对应部门;对所述收件人邮箱对应部门进行去重统计,得到所述邮件收件人对应部门数量;Extract the recipient mailbox name from the recipient mailbox of the mail to be detected, and determine the corresponding department of the recipient mailbox based on the recipient mailbox name and the enterprise employee and department mapping information set; The corresponding department of the sender's mailbox is deduplicated and counted to obtain the number of the department corresponding to the recipient of the email;
从所述待检测邮件的收件人邮箱中提取收件人邮箱名称,基于所述收件人邮箱名称以及企业员工与部门映射信息集合,确定所述收件人邮箱对应部门;从所述历史日志中统计所述收件人邮箱对应部门在目标历史时间内,收到的来自所述待检测邮件的发件人邮箱的邮件数量;Extract the recipient mailbox name from the recipient mailbox of the mail to be detected, and determine the corresponding department of the recipient mailbox based on the recipient mailbox name and the enterprise employee and department mapping information set; from the history Count the number of mails received from the sender's mailbox of the mail to be detected within the target historical time by the corresponding department of the recipient's mailbox in the log;
从所述历史日志中统计所述待检测邮件的收件人邮箱收到的来自所述待检测邮件的发件人邮箱的邮件数量;Counting the number of mails received by the recipient mailbox of the mail to be detected from the mailbox of the sender of the mail to be detected from the historical log;
从所述历史日志中统计所述待检测邮件的收件人邮箱收到的发件人邮箱为外部邮箱的邮件数量。Counting the number of emails whose sender mailboxes are external mailboxes received by the recipient mailboxes of the emails to be detected from the historical logs.
进一步地,所述基于所述企业内部信息以及所述邮件网关日志,确定待检测邮件的邮件特征,包括:Further, the determining the email characteristics of the email to be detected based on the internal information of the enterprise and the email gateway log includes:
在所述待检测邮件的邮件属性信息中包括发件人邮箱名称的情况下,基于所述待检测邮件的发件人邮箱名称确定所述待检测邮件的邮箱域名信息;In the case that the email attribute information of the email to be detected includes a sender's mailbox name, determining the mailbox domain name information of the email to be detected based on the sender's mailbox name of the email to be detected;
在所述企业内部信息包括企业内部邮箱域名集合,且所述待检测邮件的邮箱域名信息与所述企业内部邮箱域名集合不匹配的情况下,基于所述企业内部信息、所述邮件网关日志以及所述待检测邮件的邮件属性信息,确定所述待检测邮件的邮件特征。If the internal information of the enterprise includes a set of internal mailbox domain names of the enterprise, and the mailbox domain name information of the mail to be detected does not match the set of internal mailbox domain names of the enterprise, based on the internal information of the enterprise, the mail gateway log and The email attribute information of the email to be detected determines the email feature of the email to be detected.
进一步地,在所述获取预先训练得到的钓鱼邮件检测模型、与企业邮箱业务相关的企业内部信息、以及邮件网关日志之前,所述方法还包括:Further, before the acquisition of the pre-trained phishing email detection model, enterprise internal information related to the enterprise mailbox business, and mail gateway logs, the method further includes:
获取已标记的历史邮件以及所述企业内部信息;Obtain the marked historical emails and the internal information of the enterprise;
基于所述已标记的历史邮件以及所述企业内部信息,确定所述已标记的历史邮件对应的邮件特征;Based on the marked historical emails and the internal information of the enterprise, determine the email characteristics corresponding to the marked historical emails;
基于所述已标记的历史邮件对应的邮件特征,以及所述已标记的历史邮件的标记值,进行二分类模型训练,得到所述钓鱼邮件检测模型;Based on the email features corresponding to the marked historical emails and the tag value of the marked historical emails, perform binary classification model training to obtain the phishing email detection model;
其中,所述已标记的历史邮件的标记值用于表示所述已标记的历史邮件是否为钓鱼邮件。Wherein, the flag value of the marked historical email is used to indicate whether the marked historical email is a phishing email.
进一步地,所述企业内部信息包括以下至少一项:Further, the internal information of the enterprise includes at least one of the following:
企业内部组织信息集合;Collection of organizational information within the enterprise;
企业员工与部门映射信息集合;A collection of employee and department mapping information;
企业内部邮箱集合;Collection of mailboxes within the enterprise;
企业内部邮箱域名集合。A collection of internal mailbox domain names.
进一步地,所述邮件网关日志中包括N个邮件的邮件属性信息,所述邮件属性信息包括以下至少一项:Further, the mail gateway log includes mail attribute information of N mails, and the mail attribute information includes at least one of the following:
发件人昵称;sender nickname;
发件人邮箱;sender email;
收件人邮箱;Recipient's email address;
邮件主题;Email Subject;
邮件附件名称;email attachment name;
邮件附件类型。Email attachment type.
第二方面,本发明实施例还提供了一种钓鱼邮件检测装置,包括:In the second aspect, the embodiment of the present invention also provides a device for detecting phishing emails, including:
获取模块,用于获取预先训练得到的钓鱼邮件检测模型、与企业邮箱业务相关的企业内部信息、以及邮件网关日志;The obtaining module is used to obtain the pre-trained phishing email detection model, enterprise internal information related to the enterprise mailbox business, and mail gateway logs;
确定模块,用于基于所述企业内部信息以及所述邮件网关日志,确定待检测邮件的邮件特征;A determining module, configured to determine the email characteristics of the email to be detected based on the internal information of the enterprise and the email gateway log;
检测模块,用于将所述待检测邮件的邮件特征输入至所述钓鱼邮件检测模型,得到所述钓鱼邮件检测模型输出的所述待检测邮件的邮件类型;所述邮件类型包括钓鱼邮件和非钓鱼邮件;所述钓鱼邮件检测模型为基于已标记的历史邮件对应的邮件特征以及所述已标记的历史邮件的标记值,进行二分类模型训练得到。A detection module, configured to input the mail feature of the mail to be detected into the phishing mail detection model, and obtain the mail type of the mail to be detected output by the phishing mail detection model; the mail type includes phishing mail and non-phishing mail Phishing emails; the phishing email detection model is obtained by training a binary classification model based on the email features corresponding to the marked historical emails and the tag values of the marked historical emails.
第三方面,本发明实施例还提供了一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现如第一方面所述钓鱼邮件检测方法。In the third aspect, the embodiment of the present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and operable on the processor. When the processor executes the program, the first The method for detecting phishing emails described in the aspect.
第四方面,本发明实施例还提供了一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现如第一方面所述钓鱼邮件检测方法。In a fourth aspect, an embodiment of the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method for detecting phishing emails as described in the first aspect is implemented.
第五方面,本发明实施例还提供了一种计算机程序产品,其上存储有可执行指令,该指令被处理器执行时使处理器实现第一方面所述钓鱼邮件检测方法。In the fifth aspect, the embodiment of the present invention further provides a computer program product, on which executable instructions are stored, and when the instructions are executed by a processor, the processor implements the phishing email detection method described in the first aspect.
本发明实施例提供的钓鱼邮件检测方法、装置、电子设备及存储介质,通过基于企业内部信息以及邮件网关日志,确定待检测邮件的邮件特征,使用预先训练得到的钓鱼邮件检测模型进行钓鱼邮件实时检测,判断待检测邮件是否为钓鱼邮件,由于钓鱼邮件检测模型为基于已标记的历史邮件对应的邮件特征以及所述已标记的历史邮件的标记值进行二分类模型训练得到,这就使得本方法具有更强的泛化能力,可以适用于不同的钓鱼邮件变体,能够降低漏报和误报概率,提高钓鱼邮件检测的可靠性。The phishing email detection method, device, electronic device, and storage medium provided by the embodiments of the present invention determine the email characteristics of the email to be detected based on the internal information of the enterprise and the email gateway log, and use the pre-trained phishing email detection model to perform real-time phishing email detection. Detecting, judging whether the email to be detected is a phishing email, because the phishing email detection model is obtained by training a binary classification model based on the corresponding email features of the marked historical emails and the tag value of the marked historical emails, this makes the method It has stronger generalization ability, can be applied to different phishing email variants, can reduce the probability of false positives and false positives, and improve the reliability of phishing email detection.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description These are some embodiments of the present invention. Those skilled in the art can also obtain other drawings based on these drawings without creative work.
图1是本发明实施例提供的钓鱼邮件检测方法的流程示意图之一;Fig. 1 is one of the schematic flow charts of the phishing email detection method provided by the embodiment of the present invention;
图2是本发明实施例提供的钓鱼邮件检测模型的训练方法示意图;2 is a schematic diagram of a training method for a phishing email detection model provided by an embodiment of the present invention;
图3是本发明实施例提供的钓鱼邮件检测方法的流程示意图之二;Fig. 3 is the second schematic flow diagram of the phishing email detection method provided by the embodiment of the present invention;
图4是本发明实施例提供的钓鱼邮件检测系统的结构示意图;FIG. 4 is a schematic structural diagram of a phishing email detection system provided by an embodiment of the present invention;
图5为本发明实施例提供的钓鱼邮件检测装置的结构示意图;5 is a schematic structural diagram of a phishing email detection device provided by an embodiment of the present invention;
图6为本发明实施例提供的电子设备的实体结构示意图。FIG. 6 is a schematic diagram of a physical structure of an electronic device provided by an embodiment of the present invention.
具体实施方式detailed description
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments It is a part of embodiments of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.
图1是本发明实施例提供的钓鱼邮件检测方法的流程示意图之一,如图1所示,该钓鱼邮件检测方法包括以下步骤:Fig. 1 is one of the flow diagrams of the phishing email detection method provided by the embodiment of the present invention. As shown in Fig. 1, the phishing email detection method includes the following steps:
步骤101、获取预先训练得到的钓鱼邮件检测模型、与企业邮箱业务相关的企业内部信息、以及邮件网关日志。
需要说明的是,本发明实施例提供的钓鱼邮件检测方法,可适用于钓鱼邮件实时检测的场景中。该方法的执行主体可以为钓鱼邮件检测装置,例如电子设备、或者该钓鱼邮件检测装置中的用于执行钓鱼邮件检测方法的控制模块。其中,电子设备可以包括手机、平板电脑或台式计算机等。It should be noted that the method for detecting phishing emails provided in the embodiment of the present invention is applicable to the scene of real-time detection of phishing emails. The subject of execution of the method may be a phishing email detection device, such as an electronic device, or a control module in the phishing email detection device for executing the phishing email detection method. Wherein, the electronic device may include a mobile phone, a tablet computer, or a desktop computer, and the like.
现有技术中,攻击者为了提高诱导收件人访问恶意链接或者打开恶意附件的成功率,往往会基于已收集的攻击对象的基础信息,从发件人昵称、发件人邮箱、邮件主题、邮件内容、邮件附件名称等信息去伪装钓鱼邮件。In the existing technology, in order to increase the success rate of inducing recipients to visit malicious links or open malicious attachments, attackers often use the collected basic information of the attack target to select from the sender's nickname, sender's email address, email subject, Email content, email attachment names and other information to disguise phishing emails.
本发明实施例通过预先训练得到钓鱼邮件检测模型进行钓鱼邮件实时检测。钓鱼邮件检测模型是基于已标记的历史邮件以及企业内部信息训练得到的,属于机器学习模型。钓鱼邮件检测模型用于对待检测邮件的邮件类型进行评估。In the embodiment of the present invention, a phishing email detection model is obtained through pre-training to detect phishing emails in real time. The phishing email detection model is trained based on marked historical emails and enterprise internal information, and belongs to the machine learning model. The phishing email detection model is used to evaluate the email type of the email to be detected.
可选地,本发明实施例中提及的企业内部信息可以包括以下至少一项:企业内部组织信息集合;企业员工与部门映射信息集合;企业内部邮箱集合;企业内部邮箱域名集合。Optionally, the enterprise internal information mentioned in the embodiment of the present invention may include at least one of the following: a collection of internal organization information of the enterprise; a collection of mapping information of enterprise employees and departments; a collection of internal mailboxes of the enterprise; a collection of domain names of internal mailboxes of the enterprise.
可选地,邮件网关日志可以包括邮件网关的目标日志和历史日志;所述目标日志中包括N个邮件的邮件属性信息,所述N个邮件包括待检测邮件,N为正整数。邮件网关的目标日志例如为邮件网关实时生成的日志。邮件网关的历史日志例如为邮件网关在目标历史时间内的日志;实际中,根据设定的目标历史时间范围条件从数据库中提取历史日志,例如若指定目标历史时间为从当前日期往前的6个月内,则从数据库中提取近6个月内的邮件日志,作为历史日志。可以理解的是,目标日志所包括的N个邮件中的任意一个邮件都可以作为待检测邮件。使用本发明实施例提供的钓鱼邮件检测方法,可以对目标日志中所包括的N个邮件中的每一个邮件都分别进行钓鱼邮件检测,分别判断各个邮件是否为钓鱼邮件。Optionally, the email gateway log may include a target log and a history log of the email gateway; the target log includes email attribute information of N emails, the N emails include emails to be detected, and N is a positive integer. The target log of the mail gateway is, for example, the log generated by the mail gateway in real time. The historical log of the mail gateway is, for example, the log of the mail gateway within the target historical time; in practice, the historical log is extracted from the database according to the set target historical time range conditions, for example, if the specified target historical time is 6 years before the current date Within a month, the email logs of the past 6 months are extracted from the database as historical logs. It can be understood that any one of the N emails included in the target log can be used as the email to be detected. Using the method for detecting phishing emails provided by the embodiment of the present invention, each of the N emails included in the target log can be detected for phishing emails respectively, and whether each email is a phishing email can be judged respectively.
可选地,本发明实施例中提及的邮件属性信息可以包括以下至少一项:发件人昵称;发件人邮箱;收件人邮箱;邮件主题;邮件附件名称;邮件附件类型。Optionally, the email attribute information mentioned in the embodiment of the present invention may include at least one of the following: sender's nickname; sender's email address; recipient's email address; email subject; email attachment name; email attachment type.
步骤102、基于所述企业内部信息以及邮件网关日志,确定待检测邮件的邮件特征。
其中,所述邮件特征用于区分邮件是否为伪装内部邮件,和/或区分邮件中收发人关系是否为正常收发人关系。Wherein, the email feature is used to distinguish whether the email is a fake internal email, and/or distinguish whether the relationship between the sender and receiver in the email is a normal relationship between the sender and receiver.
可选地,邮件特征用于所述钓鱼邮件检测模型对所述待检测邮件的邮件类型进行评估。邮件特征可以包括以下至少一项:1)用于区分邮件是否为伪装内部邮件的特征;2)用于区分邮件中收发人关系是否为正常收发人关系的特征。表1示出了本发明实施例中提及的邮件特征的具体内容。Optionally, the email features are used by the phishing email detection model to evaluate the email type of the email to be detected. The email feature may include at least one of the following: 1) a feature for distinguishing whether the email is a fake internal email; 2) a feature for distinguishing whether the sender-receiver relationship in the email is a normal sender-receiver relationship. Table 1 shows the specific content of the email features mentioned in the embodiment of the present invention.
其中,用于区分邮件是否为伪装内部邮件的特征,包括以下至少一项:邮件是否包含附件、邮件附件类型对应异常等级、邮件附件名称是否包含中文、邮件附件名称与内网邮件附件名相似度、邮件主题与内网邮件主题相似度、发件人邮箱域名与内网邮箱域名相似度、发件人邮箱名称与内网邮箱域名相似度、发件人昵称与内网邮箱昵称相似度及发件人昵称与企业内部组织相似度;Among them, the features used to distinguish whether an email is a fake internal email include at least one of the following: whether the email contains attachments, whether the email attachment type corresponds to the abnormal level, whether the email attachment name contains Chinese, and the similarity between the email attachment name and the intranet email attachment name , the similarity between the email subject and the intranet email subject, the similarity between the sender’s email domain name and the intranet email domain name, the similarity between the sender’s email name and the intranet email domain name, the similarity between the sender’s nickname and the intranet email The similarity between the nickname of the sender and the internal organization of the enterprise;
用于区分邮件中收发人关系是否为正常收发人关系的特征,包括以下至少一项:外网邮箱历史发送邮件数量、邮件收件人数量、邮件收件人对应部门数量、收件人所属部门历史收到此发件人邮件数量、收件人历史收到此发件人邮件数量及收件人历史收到发件人是外部邮箱的数量。The characteristics used to distinguish whether the relationship between the sender and receiver in the email is a normal relationship between the receiver and the receiver, including at least one of the following: the number of emails sent by the external network mailbox in history, the number of email recipients, the number of departments corresponding to the email recipients, and the department to which the recipient belongs The number of emails received from this sender in history, the number of emails received from this sender in history by recipients, and the number of emails received by recipients in history from external mailboxes.
表1Table 1
步骤103、将待检测邮件的邮件特征输入至所述钓鱼邮件检测模型,得到所述钓鱼邮件检测模型输出的所述待检测邮件的邮件类型;所述邮件类型包括钓鱼邮件和非钓鱼邮件;所述钓鱼邮件检测模型为基于已标记的历史邮件对应的邮件特征以及所述已标记的历史邮件的标记值,进行二分类模型训练得到。
需要说明的是,本发明实施例中提及的邮件属性信息中并未包括邮件内容,即在对待检测邮件进行邮件特征提取时,并未使用邮件内容,这就可以有效地保证待检测邮件的隐私性,并且基于待检测邮件的发件人昵称、发件人邮箱、收件人邮箱、邮件主题、邮件附件名称、邮件附件类型和企业内部信息,提取到能够区分邮件是否为伪装内部邮件、和/或区分邮件中收发人关系是否为正常收发人关系的邮件特征,进而基于预先训练得到的钓鱼邮件检测模型,判断待检测邮件的邮件类型。It should be noted that the mail attribute information mentioned in the embodiment of the present invention does not include the mail content, that is, the mail content is not used when the mail feature extraction is performed on the mail to be detected, which can effectively ensure the identity of the mail to be detected. Privacy, and based on the sender's nickname, sender's email address, recipient email address, email subject, email attachment name, email attachment type, and internal information of the email to be detected, it can be extracted to distinguish whether the email is a fake internal email, And/or to distinguish whether the sender-receiver relationship in the email is the email feature of the normal sender-receiver relationship, and then judge the email type of the email to be detected based on the pre-trained phishing email detection model.
本发明实施例提供的钓鱼邮件检测方法,通过基于企业内部信息以及邮件网关日志,确定待检测邮件的邮件特征,使用预先训练得到的钓鱼邮件检测模型进行钓鱼邮件实时检测,判断待检测邮件是否为钓鱼邮件,由于钓鱼邮件检测模型为基于已标记的历史邮件对应的邮件特征以及所述已标记的历史邮件的标记值进行二分类模型训练得到,这就使得本方法具有更强的泛化能力,可以适用于不同的钓鱼邮件变体,能够降低漏报和误报概率,提高钓鱼邮件检测的可靠性。The phishing email detection method provided by the embodiment of the present invention determines the email characteristics of the email to be detected based on the internal information of the enterprise and the email gateway log, and uses the pre-trained phishing email detection model to detect the phishing email in real time to determine whether the email to be detected is For phishing emails, since the phishing email detection model is obtained by training a binary classification model based on the email features corresponding to the marked historical emails and the tag values of the marked historical emails, this method has a stronger generalization ability, It can be applied to different variants of phishing emails, which can reduce the probability of false negatives and false positives, and improve the reliability of phishing email detection.
可选地,针对不同的邮件特征,采用适用于该邮件特征的确定方法。具体地,基于企业内部信息以及邮件网关日志,确定待检测邮件的邮件特征的具体方法可以包括以下至少一种方式:Optionally, for different email features, a determination method suitable for the email feature is adopted. Specifically, based on the internal information of the enterprise and the email gateway log, the specific method for determining the email characteristics of the email to be detected may include at least one of the following methods:
方式1、邮件特征包括邮件是否包含附件:Method 1. The characteristics of the email include whether the email contains attachments:
在目标日志中待检测邮件对应的附件字段的值为非空的情况下,确定待检测邮件包含附件;在待检测邮件对应的附件字段的值为空的情况下,确定待检测邮件未包含附件。If the value of the attachment field corresponding to the mail to be detected in the target log is not empty, it is determined that the mail to be detected contains an attachment; when the value of the attachment field corresponding to the mail to be detected is empty, it is determined that the mail to be detected does not contain an attachment .
方式2、邮件特征包括邮件附件类型对应异常等级:Method 2. Email characteristics include email attachment types corresponding to abnormal levels:
基于待检测邮件的邮件附件的文件后缀,以及预置的文件后缀与异常等级的对应关系,确定所述待检测邮件的邮件附件类型对应的异常等级。Based on the file suffix of the mail attachment of the mail to be detected and the preset correspondence between the file suffix and the abnormality level, the abnormality level corresponding to the type of the mail attachment of the mail to be detected is determined.
实际中,当待检测邮件包含邮件附件时,提取邮件附件的文件后缀。例如,当邮件附件字段值为“产品使用手册.zip”时,提取的后缀值为“zip”,然后从预置的文件后缀与异常等级的对应关系中匹配得到邮件附件类型对应异常等级。文件后缀与异常等级的对应关系可以参见表2所示的异常等级表。异常等级表可以是安全分析人员提供的文件后缀的类型与异常等级的映射表。待检测邮件的异常等级越高,表示待检测邮件为钓鱼邮件概率增大。In practice, when the email to be detected contains an email attachment, the file suffix of the email attachment is extracted. For example, when the value of the email attachment field is "product manual.zip", the extracted suffix value is "zip", and then the abnormal level corresponding to the email attachment type is obtained from the preset correspondence between the file suffix and the abnormal level. For the correspondence between file suffixes and exception levels, see the exception level table shown in Table 2. The exception level table may be a mapping table between types of file suffixes and exception levels provided by security analysts. The higher the abnormal level of the email to be detected, the higher the probability that the email to be detected is a phishing email.
表2Table 2
方式3、邮件特征包括邮件附件名称是否包含中文:Method 3. The characteristics of the email include whether the name of the email attachment contains Chinese:
在待检测邮件的邮件附件名称与预置的正则表达式匹配的情况下,确定所述邮件附件名称包含中文;所述正则表达式用于匹配邮件附件名称是否包含中文字符;在所述邮件附件名称与预置的正则表达式不匹配的情况下,确定所述邮件附件名称不包含中文。In the case that the name of the mail attachment of the mail to be detected matches the preset regular expression, it is determined that the name of the mail attachment contains Chinese; the regular expression is used to match whether the name of the mail attachment contains Chinese characters; in the mail attachment If the name does not match the preset regular expression, it is determined that the name of the email attachment does not contain Chinese.
例如,当待检测邮件存在邮件附件时,使用正则表达式匹配邮件附件名称是否包含中文字符,若邮件附件名称中包含中文字符,则匹配结果为是;若邮件附件名称中不包含中文字符,则匹配结果为否。For example, when there are email attachments in the email to be detected, use a regular expression to match whether the email attachment name contains Chinese characters. If the email attachment name contains Chinese characters, the matching result is Yes; if the email attachment name does not contain Chinese characters, then The result of the match is no.
方式4、邮件特征包括邮件附件名称与内网邮件附件名相似度,确定待检测邮件的邮件特征的实现过程包括步骤4_1至步骤4_6,其中:Mode 4, the mail feature includes the similarity between the mail attachment name and the intranet mail attachment name, and the realization process of determining the mail feature of the mail to be detected includes step 4_1 to step 4_6, wherein:
步骤4_1、从历史日志中提取发件人邮箱为企业内部邮箱的至少一个历史邮件的邮件附件名称;Step 4_1, extracting the mail attachment name of at least one historical mail whose sender mailbox is the internal mailbox of the enterprise from the historical log;
其中,历史日志例如为邮件网关在目标历史时间内的日志;指定的目标历史时间范围是根据专家分析人员确定的,例如历史6个月内的。实际中,根据设定的目标历史时间范围条件从数据库中提取历史日志,例如若指定目标历史时间为从当前日期往前的6个月内,则从数据库中提取近6个月内的邮件日志,作为历史日志。从历史日志中提取发件人邮箱为企业内部邮箱的至少一个历史邮件的邮件附件名称,并且历史邮件包括邮箱附件。Wherein, the historical log is, for example, the log of the email gateway within the target historical time; the specified target historical time range is determined according to expert analysts, for example, within 6 months of history. In practice, historical logs are extracted from the database according to the set target historical time range conditions. For example, if the specified target historical time is within 6 months from the current date, the mail logs within the past 6 months are extracted from the database , as a history log. The email attachment name of at least one historical email whose sender mailbox is an internal mailbox of the enterprise is extracted from the historical log, and the historical email includes the mailbox attachment.
步骤4_2、对各所述历史邮件的邮件附件名称进行分词得到词组集合;Step 4_2, carrying out word segmentation to the mail attachment name of each described historical mail to obtain the phrase set;
使用分词算法对邮件附件名称进行分词处理。例如邮件附件名称为“员工福利调整说明”,则根据分词算法得到的词组集合为:“员工、福利、调整、说明”。Use the word segmentation algorithm to perform word segmentation processing on the email attachment name. For example, if the name of the email attachment is "Employee Benefits Adjustment Instructions", the phrase set obtained according to the word segmentation algorithm is: "Employees, Benefits, Adjustments, Instructions".
步骤4_3、计算各所述词组集合中每个词语的词频,得到词频集合;Step 4-3, calculate the term frequency of each term in each described phrase set, obtain the term frequency set;
词频集合的计算使用计数的方式统计附件名称中词语出现的次数,例如指定时间内有三封邮件含附件名称,其名称分别为“员工福利调整说明、员工信息、放假说明”,进行分词后得到三个词组集合为:“员工、福利、调整、说明”、“员工、信息”及“放假、说明”,则词语“员工”的词频为2,词语“说明”的词频为2,词语“福利”的词频为1,词语“调整”的词频为1,词语“放假”的词频为1。邮件附件名称的词频集合如表3。The calculation of the word frequency set uses the counting method to count the number of words in the attachment name. For example, there are three emails containing the attachment name within a specified time. The names are "employee welfare adjustment instructions, employee information, and holiday instructions". A set of phrases is: "employee, welfare, adjustment, explanation", "employee, information" and "holiday, explanation", then the word frequency of the word "employee" is 2, the word frequency of the word "explain" is 2, and the word "welfare" The word frequency of is 1, the word frequency of the word "adjustment" is 1, and the word frequency of the word "holiday" is 1. The word frequency set of email attachment names is shown in Table 3.
表3table 3
步骤4_4、对所述待检测邮件的邮件附件名称进行分词得到文本词组;Step 4-4, carry out word segmentation to the mail attachment name of the mail to be detected to obtain the text phrase;
例如一封待检测邮件含邮件附件,且邮件附件名称为“放假调整说明.pdf”则文本分词对象即为“放假调整说明”,分词后得到的词组为“放假、调整、说明”。For example, an email to be detected contains an email attachment, and the name of the email attachment is "Holiday Adjustment Instructions.pdf", then the text segmentation object is "Holiday Adjustment Instructions", and the phrase obtained after word segmentation is "Holiday, Adjustment, Instructions".
步骤4_5、使用所述文本词组与所述词频集合进行词频匹配,得到所述文本词组中各个词语的词频;Step 4-5, use described text phrase and described word frequency set to carry out word frequency matching, obtain the word frequency of each word in described text phrase;
对于词组“放假、调整、说明”,从表3中的邮件附件名称的词频集合匹配到的数值结果为“1、1、2”。如果文本词组中的词语在词频集合中不存在则对应的词频为0。For the phrase "holiday, adjustment, explanation", the numerical results matched from the word frequency set of the email attachment name in Table 3 are "1, 1, 2". If the word in the text phrase does not exist in the word frequency set, the corresponding word frequency is 0.
步骤4_6、计算文本词组中各个词语的词频的平均值;对所述平均值进行归一化处理,得到邮件附件名称与内网邮件附件名相似度。Steps 4-6, calculating the average value of the word frequency of each word in the text phrase; normalizing the average value to obtain the similarity between the name of the mail attachment and the name of the intranet mail attachment.
对文本词组中各个词语的词频求和,将所得到的总和除以文本词组中词语个数,得到平均值;对平均值进行归一化处理,得到邮件附件名称与内网邮件附件名相似度。Sum the word frequency of each word in the text phrase, and divide the obtained sum by the number of words in the text phrase to obtain the average value; normalize the average value to obtain the similarity between the email attachment name and the intranet email attachment name .
其中,平均值进行归一化处理后,所得到的数值小于或等于1,从而采用处理后的数值来表征相似度,相似度的数值越高表明邮件附件名称与内网邮件附件名的相似程度越高。Among them, after the average value is normalized, the obtained value is less than or equal to 1, so the processed value is used to represent the similarity, and the higher the similarity value, the similarity between the email attachment name and the intranet email attachment name higher.
需要说明的是,本专利申请任一实施例中所涉及归一化处理的内容,都可以参照上述解释,为避免重复,后续不再一一赘述。It should be noted that the content of the normalization process involved in any embodiment of the present patent application can refer to the above explanation, and in order to avoid repetition, details will not be repeated hereafter.
方式5、邮件特征包括邮件主题与内网邮件主题相似度:Method 5. Email features include the similarity between the subject of the email and the subject of the intranet email:
从历史日志中提取发件人邮箱为企业内部邮箱的历史邮件的邮件主题;对各所述历史邮件的邮件主题进行分词,得到词组集合;计算各所述词组集合中每个词组中词语的词频,得到词频集合;对所述待检测邮件的邮件主题进行分词,得到文本词组;使用所述文本词组与所述词频集合进行词频匹配,得到所述文本词组中各个词语的词频;计算所述文本词组中各个词语的词频的平均值;对所述平均值进行归一化处理,得到所述邮件主题与内网邮件主题相似度。Extracting sender's mailbox from the historical log is the mail subject of the historical mail of the internal mailbox of the enterprise; the mail subject of each described historical mail is carried out word segmentation, obtains a phrase set; calculate the word frequency of words in each phrase in each described phrase set , to obtain a word frequency set; the mail subject of the mail to be detected is segmented into words to obtain a text phrase; use the text phrase and the word frequency set to carry out word frequency matching to obtain the word frequency of each word in the text phrase; calculate the text The average value of the word frequency of each word in the phrase group; the average value is normalized to obtain the similarity between the subject of the email and the subject of the intranet email.
方式6、邮件特征包括发件人邮箱域名与内网邮箱域名相似度:Method 6. Email features include the similarity between the sender’s email domain name and the intranet email domain name:
从待检测邮件的发件人邮箱中提取发件人邮箱域名;确定所述发件人邮箱域名与内网邮箱域名相似度。例如,采用python的算法库确定所述发件人邮箱域名与内网邮箱域名相似度。Extracting the sender's mailbox domain name from the sender's mailbox of the email to be detected; determining the similarity between the sender's mailbox domain name and the intranet mailbox domain name. For example, an algorithm library of python is used to determine the similarity between the domain name of the sender's mailbox and the domain name of the intranet mailbox.
例如,内网邮箱域名为“mail.com”;发件人邮箱为zhangsan@mail1.com,则提取的发件人邮箱域名为“mail1.com”。For example, if the domain name of the intranet mailbox is "mail.com"; the sender's mailbox is zhangsan@mail1.com, then the extracted sender's mailbox domain name is "mail1.com".
方式7、邮件特征包括发件人邮箱名称与内网邮箱域名相似度:Method 7. Email features include the similarity between the sender’s mailbox name and the domain name of the intranet mailbox:
从待检测邮件的发件人邮箱中提取发件人邮箱名称;确定所述发件人邮箱名称与内网邮箱域名相似度。Extracting the sender's mailbox name from the sender's mailbox of the email to be detected; determining the similarity between the sender's mailbox name and the intranet mailbox domain name.
例如,内网邮箱域名为“mail.com”;发件人邮箱为mail@qq.com,则提取的发件人邮箱名称为“mail”。For example, if the domain name of the intranet mailbox is "mail.com"; the sender's mailbox is mail@qq.com, then the extracted sender's mailbox name is "mail".
方式8、邮件特征包括发件人昵称与内网邮箱昵称相似度:Method 8. Email features include the similarity between the sender's nickname and the intranet mailbox nickname:
从历史日志中提取发件人邮箱为企业内部邮箱的历史邮件的发件人昵称;对各所述历史邮件的发件人昵称进行分词,得到词组集合;计算各所述词组集合中每个词组中词语的词频,得到词频集合;对所述待检测邮件的发件人昵称进行分词,得到文本词组;使用所述文本词组与所述词频集合进行词频匹配,得到所述文本词组中各个词语的词频;计算所述文本词组中各个词语的词频的平均值;对所述平均值进行归一化处理,得到所述发件人昵称与内网邮箱昵称相似度。Extracting sender's mailbox from the history log is the sender's nickname of the historical mail of the internal mailbox of the enterprise; the sender's nickname of each described historical mail is word-segmented to obtain a phrase set; calculate each phrase in each described phrase set The word frequency of the word in the word, obtains word frequency set; The sender's nickname of described to-be-detected mail is carried out word segmentation, obtains text phrase; Use described text phrase and described word frequency set to carry out word frequency matching, obtain the each word in the described text phrase Word frequency: Calculate the average value of the word frequency of each word in the text phrase; Normalize the average value to obtain the similarity between the sender's nickname and the intranet mailbox nickname.
方式9、邮件特征包括发件人昵称与企业内部组织相似度:Method 9. Email characteristics include the similarity between the sender's nickname and the internal organization of the enterprise:
基于企业内部组织信息集合对各内部组织进行分词,得到词组集合;计算各词组集合中每个词组中词语的词频,得到词频集合;对待检测邮件的发件人昵称进行分词,得到文本词组;使用所述文本词组与所述词频集合进行词频匹配,得到所述文本词组中各个词语的词频;计算所述文本词组中各个词语的词频的平均值;对所述平均值进行归一化处理,得到发件人昵称与企业内部组织相似度。Segment each internal organization based on the information set of the internal organization of the enterprise to obtain a phrase set; calculate the word frequency of each word in each phrase set to obtain a word frequency set; perform word segmentation on the sender’s nickname of the email to be detected to obtain a text phrase; use Described text phrase and described word frequency set carry out word frequency matching, obtain the word frequency of each word in the described text phrase; Calculate the average value of the word frequency of each word in the described text phrase; The average value is carried out normalization process, obtain The similarity between the sender's nickname and the internal organization of the enterprise.
方式10、邮件特征包括外网邮箱历史发送邮件数量:Method 10. Email characteristics include the number of historical emails sent by external network mailboxes:
从历史日志中提取发件人邮箱不是企业内部邮箱的历史邮件数量,得到所述外网邮箱历史发送邮件数量。Extract the number of historical emails whose sender's mailbox is not an internal mailbox of the enterprise from the historical log, and obtain the historical number of emails sent by the external network mailbox.
例如,对于不属于企业内部邮箱集合中的发件人邮箱即判定为外部邮箱,根据给定时间范围内(比如近6个月内)的邮件网关历史日志,按照邮箱名称统计外网邮箱历史发送邮件数量。For example, the sender's mailbox that does not belong to the internal mailbox collection of the enterprise is determined as an external mailbox. According to the historical logs of the mail gateway within a given time range (for example, within the past 6 months), the historical sending of external network mailboxes is counted according to the mailbox name. number of mails.
方式11、邮件特征包括邮件收件人数量:Method 11. Email characteristics include the number of email recipients:
基于待检测邮件的收件人邮箱的个数,确定所述邮件收件人数量。The number of recipients of the email is determined based on the number of recipient mailboxes of the email to be detected.
实际中,一封待检测邮件在发送时可以选择一个或者多个收件人,待检测邮件的收件人邮箱的个数,可作为邮件收件人数量。In practice, one or more recipients can be selected for an email to be detected when it is sent, and the number of mailboxes of the recipients of the email to be detected can be used as the number of recipients of the email.
方式12、邮件特征包括邮件收件人对应部门数量:Method 12. The characteristics of the email include the number of departments corresponding to the email recipient:
从待检测邮件的收件人邮箱中提取收件人邮箱名称,基于所述收件人邮箱名称以及企业员工与部门映射信息集合,确定所述收件人邮箱对应部门;对所述收件人邮箱对应部门进行去重统计,得到邮件收件人对应部门数量。Extract the recipient's mailbox name from the recipient's mailbox of the mail to be detected, and determine the corresponding department of the recipient's mailbox based on the recipient's mailbox name and the enterprise employee and department mapping information set; The corresponding department of the mailbox is deduplicated and counted, and the number of the corresponding department of the email recipient is obtained.
方式13、邮件特征包括收件人所属部门历史收到此发件人邮件数量:Method 13. The characteristics of the email include the number of emails received from the sender in the history of the recipient's department:
从待检测邮件的收件人邮箱中提取收件人邮箱名称,基于所述收件人邮箱名称以及企业员工与部门映射信息集合,确定所述收件人邮箱对应部门;从所述历史日志中统计所述收件人邮箱对应部门在目标历史时间内,收到的来自所述待检测邮件的发件人邮箱的邮件数量。Extract the recipient mailbox name from the recipient mailbox of the mail to be detected, and determine the corresponding department of the recipient mailbox based on the recipient mailbox name and the enterprise employee and department mapping information set; from the historical log Counting the number of emails received from the sender's mailbox of the email to be detected by the department corresponding to the recipient's mailbox within the target historical time.
方式14、邮件特征包括收件人历史收到此发件人邮件数量:Method 14. Email characteristics include the number of emails received by the sender in the history of the recipient:
从历史日志中统计所述待检测邮件的收件人邮箱收到的来自所述待检测邮件的发件人邮箱的邮件数量。The number of emails received by the recipient mailbox of the email to be detected from the sender mailbox of the email to be detected is counted from the historical log.
方式15、邮件特征包括收件人历史收到发件人是外部邮箱的数量:Method 15. Email characteristics include the number of external mailboxes received by the recipient in history:
从历史日志中统计所述待检测邮件的收件人邮箱收到的发件人邮箱为外部邮箱的邮件数量。The number of emails whose sender mailboxes are external mailboxes received by the recipient mailboxes of the emails to be detected is counted from the historical logs.
可选地,基于所述企业内部信息以及所述邮件网关日志,确定待检测邮件的邮件特征的实现方式可以包括:Optionally, based on the internal information of the enterprise and the email gateway log, the implementation of determining the email characteristics of the email to be detected may include:
步骤1、在待检测邮件的邮件属性信息中包括发件人邮箱名称的情况下,基于待检测邮件的发件人邮箱名称确定所述待检测邮件的邮箱域名信息;Step 1. In the case that the email attribute information of the email to be detected includes the name of the sender's mailbox, determine the mailbox domain name information of the email to be detected based on the sender's mailbox name of the email to be detected;
步骤2、在企业内部信息包括企业内部邮箱域名集合,且所述待检测邮件的邮箱域名信息与所述企业内部邮箱域名集合不匹配的情况下,基于企业内部信息、邮件网关日志以及所述待检测邮件的邮件属性信息,确定所述待检测邮件的邮件特征。Step 2. In the case that the internal information of the enterprise includes the domain name collection of the internal mailbox of the enterprise, and the mailbox domain name information of the email to be detected does not match the domain name collection of the internal mailbox of the enterprise, based on the internal information of the enterprise, the mail gateway log, and the domain name collection of the email to be detected Detecting the mail attribute information of the mail, and determining the mail characteristics of the mail to be detected.
可选地,本发明实施例提供钓鱼邮件检测模型的训练方法。模型离线训练的周期是由根据用户指定的时间间隔进行执行的,例如用户设定了模型训练时间为每周的周一凌晨进行训练,则模型离线训练服务就会在指定时间进行离线训练。图2是本发明实施例提供的钓鱼邮件检测模型的训练方法示意图,如图2所示,该钓鱼邮件检测方法包括以下步骤:Optionally, this embodiment of the present invention provides a training method for a phishing email detection model. The cycle of model offline training is executed according to the time interval specified by the user. For example, if the user sets the model training time to be trained every Monday morning, the model offline training service will perform offline training at the specified time. Fig. 2 is a schematic diagram of the training method of the phishing email detection model provided by the embodiment of the present invention. As shown in Fig. 2, the phishing email detection method includes the following steps:
步骤201、获取模型训练依赖;
模型训练依赖数据包括:已标记的历史邮件以及所述企业内部信息。Model training depends on data including: marked historical emails and internal information of the enterprise.
1)已标记的邮件日志信息。其中,邮件日志信息使用到字段为发件人昵称、发件人邮箱名称、收件人、邮件主题、邮件附件名称、邮件附件类型。1) Flagged mail log information. Among them, the fields used in the mail log information are the sender's nickname, the sender's mailbox name, the recipient, the subject of the mail, the name of the mail attachment, and the type of the mail attachment.
2)企业内部信息包括:企业内部组织信息集合、企业员工与部门映射信息集合、企业内部邮箱集合、企业内部邮箱域名集合。2) Enterprise internal information includes: enterprise internal organization information collection, enterprise employee and department mapping information collection, enterprise internal mailbox collection, enterprise internal mailbox domain name collection.
当模型离线训练服务启动时,服务会根据给定的时间范围条件从数据库中提取已标记的历史邮件信息和企业内部信息。例如指定时间范围为历史6个月内,则数据提取时就会提取近6个月内的邮件日志信息。When the model offline training service is started, the service will extract the marked historical email information and enterprise internal information from the database according to the given time range conditions. For example, if the specified time range is within 6 months of history, the email log information within the past 6 months will be extracted during data extraction.
步骤202、基于所述已标记的历史邮件以及所述企业内部信息,确定所述已标记的历史邮件对应的邮件特征;
可选地,基于所述已标记的历史邮件以及所述企业内部信息,确定所述已标记的历史邮件对应的邮件特征的具体方法可以包括以下至少一种方式:Optionally, based on the marked historical emails and the internal information of the enterprise, the specific method for determining the email features corresponding to the marked historical emails may include at least one of the following methods:
方式a、邮件特征包括邮件是否包含附件:Method a. The characteristics of the email include whether the email contains attachments:
在已标记的历史邮件对应的附件字段的值为非空的情况下,确定待检测邮件包含附件;在待检测邮件对应的附件字段的值为空的情况下,确定待检测邮件未包含附件。When the value of the attachment field corresponding to the marked historical email is not empty, it is determined that the email to be detected contains an attachment; when the value of the attachment field corresponding to the email to be detected is empty, it is determined that the email to be detected does not contain an attachment.
方式b、邮件特征包括邮件附件类型对应异常等级:Method b. Email characteristics include email attachment types corresponding to abnormal levels:
基于已标记的历史邮件的邮件附件的文件后缀,以及预置的文件后缀与异常等级的对应关系,确定已标记的历史邮件的邮件附件类型对应的异常等级。Based on the file suffixes of the email attachments of the marked historical emails and the preset correspondence between the file suffixes and the abnormality levels, the abnormality level corresponding to the type of the email attachments of the marked historical emails is determined.
方式c、邮件特征包括邮件附件名称是否包含中文:Method c, email characteristics include whether the email attachment name contains Chinese:
在已标记的历史邮件的邮件附件名称与预置的正则表达式匹配的情况下,确定所述邮件附件名称包含中文;在所述邮件附件名称与预置的正则表达式不匹配的情况下,确定所述邮件附件名称不包含中文。In the case where the mail attachment name of the marked historical mail matches a preset regular expression, it is determined that the mail attachment name contains Chinese; when the mail attachment name does not match a preset regular expression, Make sure that the name of the email attachment does not contain Chinese.
例如,当已标记的历史邮件存在邮件附件时,使用正则表达式匹配邮件附件名称是否包含中文字符,若邮件附件名称中包含中文字符,则匹配结果为是;若邮件附件名称中不包含中文字符,则匹配结果为否。For example, when there are email attachments in the marked historical emails, use a regular expression to match whether the email attachment name contains Chinese characters, if the email attachment name contains Chinese characters, the matching result is Yes; if the email attachment name does not contain Chinese characters , the matching result is no.
方式d、邮件特征包括邮件附件名称与内网邮件附件名相似度,确定已标记的历史邮件的邮件特征的实现过程包括:从历史日志中提取发件人邮箱为企业内部邮箱的至少一个历史邮件的邮件附件名称;对各所述历史邮件的邮件附件名称进行分词得到词组集合;计算各所述词组集合中每个词语的词频,得到词频集合;对所述已标记的历史邮件的邮件附件名称进行分词得到文本词组;使用所述文本词组与所述词频集合进行词频匹配,得到所述文本词组中各个词语的词频;计算文本词组中各个词语的词频的平均值;对所述平均值进行归一化处理,得到邮件附件名称与内网邮件附件名相似度。Method d. The mail feature includes the similarity between the name of the mail attachment and the name of the intranet mail attachment. The implementation process of determining the mail feature of the marked historical mail includes: extracting at least one historical mail whose sender mailbox is an internal mailbox of the enterprise from the historical log The mail attachment name of each described historical mail is carried out participle to obtain phrase collection; Calculate the word frequency of each word in each described phrase collection, obtain word frequency collection; To the mail attachment name of described marked historical mail Carry out word segmentation to obtain text phrase; Use described text phrase and described word frequency set to carry out word frequency matching, obtain the word frequency of each word in described text phrase; Calculate the average value of the word frequency of each word in text phrase; Described average value is normalized Combined processing, the similarity between the email attachment name and the intranet email attachment name is obtained.
方式e、邮件特征包括邮件主题与内网邮件主题相似度:Method e, email features include the similarity between the subject of the email and the subject of the intranet email:
从历史日志中提取发件人邮箱为企业内部邮箱的历史邮件的邮件主题;对各所述历史邮件的邮件主题进行分词,得到词组集合;计算各所述词组集合中每个词组中词语的词频,得到词频集合;对所述已标记的历史邮件的邮件主题进行分词,得到文本词组;使用所述文本词组与所述词频集合进行词频匹配,得到所述文本词组中各个词语的词频;计算所述文本词组中各个词语的词频的平均值;对所述平均值进行归一化处理,得到所述邮件主题与内网邮件主题相似度。Extracting sender's mailbox from the historical log is the mail subject of the historical mail of the internal mailbox of the enterprise; the mail subject of each described historical mail is carried out word segmentation, obtains a phrase set; calculate the word frequency of words in each phrase in each described phrase set , to obtain a word frequency set; word segmentation is carried out to the mail subject of the marked historical mail to obtain a text phrase; use the text phrase to carry out word frequency matching with the word frequency set to obtain the word frequency of each word in the text phrase; calculate the The average value of the word frequencies of each word in the text phrase; the average value is normalized to obtain the similarity between the subject of the email and the subject of the intranet email.
方式f、邮件特征包括发件人邮箱域名与内网邮箱域名相似度:Method f. Email features include the similarity between the sender’s email domain name and the intranet email domain name:
从已标记的历史邮件的发件人邮箱中提取发件人邮箱域名;确定所述发件人邮箱域名与内网邮箱域名相似度。Extracting the sender's mailbox domain name from the sender's mailbox of the marked historical mail; determining the similarity between the sender's mailbox domain name and the intranet mailbox domain name.
例如,发件人邮箱为zhangsan@mail1.com,则提取的发件人邮箱域名为“mail1.com”;假设内网邮箱域名为“mail.com”。For example, if the email address of the sender is zhangsan@mail1.com, the domain name of the extracted email address of the sender is "mail1.com"; assuming that the domain name of the intranet email address is "mail.com".
方式g、邮件特征包括发件人邮箱名称与内网邮箱域名相似度:Method g. Email features include the similarity between the sender’s mailbox name and the domain name of the intranet mailbox:
从已标记的历史邮件的发件人邮箱中提取发件人邮箱名称;确定所述发件人邮箱名称与内网邮箱域名相似度。Extracting the sender's mailbox name from the sender's mailbox of the marked historical mail; determining the similarity between the sender's mailbox name and the intranet mailbox domain name.
例如,发件人邮箱为mail@qq.com,则提取的发件人邮箱名称为“mail”;假设内网邮箱域名为“mail.com”。For example, if the sender's email address is mail@qq.com, the name of the extracted sender's email address is "mail"; suppose the domain name of the intranet email address is "mail.com".
方式h、邮件特征包括发件人昵称与内网邮箱昵称相似度:Mode h. Email features include the similarity between the sender’s nickname and the nickname of the intranet mailbox:
从历史日志中提取发件人邮箱为企业内部邮箱的历史邮件的发件人昵称;对各所述历史邮件的发件人昵称进行分词,得到词组集合;计算各所述词组集合中每个词组中词语的词频,得到词频集合;对所述已标记的历史邮件的发件人昵称进行分词,得到文本词组;使用所述文本词组与所述词频集合进行词频匹配,得到所述文本词组中各个词语的词频;计算所述文本词组中各个词语的词频的平均值;对所述平均值进行归一化处理,得到所述发件人昵称与内网邮箱昵称相似度。Extracting sender's mailbox from the history log is the sender's nickname of the historical mail of the internal mailbox of the enterprise; the sender's nickname of each described historical mail is word-segmented to obtain a phrase set; calculate each phrase in each described phrase set The word frequency of the word in the word, obtains the word frequency set; The sender's nickname of the described historical mail of mark is carried out word segmentation, obtains text phrase; Use described text phrase and described word frequency set to carry out word frequency matching, obtain each in the described text phrase The word frequency of word; calculate the average value of the word frequency of each word in described text phrase; Described average value is carried out normalization process, obtains described sender's nickname and intranet mailbox nickname similarity.
方式i、邮件特征包括发件人昵称与企业内部组织相似度:Method i. Email characteristics include the similarity between the sender's nickname and the internal organization of the enterprise:
基于企业内部组织信息集合对各内部组织进行分词,得到词组集合;计算各词组集合中每个词组中词语的词频,得到词频集合;对已标记的历史邮件的发件人昵称进行分词,得到文本词组;使用所述文本词组与所述词频集合进行词频匹配,得到所述文本词组中各个词语的词频;计算所述文本词组中各个词语的词频的平均值;对所述平均值进行归一化处理,得到发件人昵称与企业内部组织相似度。Segment each internal organization based on the information set of the internal organization of the enterprise to obtain a set of phrases; calculate the word frequency of each word in each phrase set to obtain a set of word frequency; perform word segmentation on the sender nickname of the marked historical emails to obtain the text Phrase; use the text phrase and the word frequency set to carry out word frequency matching to obtain the word frequency of each word in the text phrase; calculate the average value of the word frequency of each word in the text phrase; normalize the average Processing to obtain the similarity between the sender's nickname and the internal organization of the enterprise.
方式j、邮件特征包括外网邮箱历史发送邮件数量:Method j, email characteristics include the number of historical emails sent by external network mailboxes:
从历史日志中提取发件人邮箱不是企业内部邮箱的历史邮件数量,得到所述外网邮箱历史发送邮件数量。Extract the number of historical emails whose sender's mailbox is not an internal mailbox of the enterprise from the historical log, and obtain the historical number of emails sent by the external network mailbox.
例如,对于不属于企业内部邮箱集合中的发件人邮箱即判定为外部邮箱,根据给定时间范围内(比如近6个月内)的邮件网关历史日志,按照邮箱名称统计外网邮箱历史发送邮件数量。For example, the sender's mailbox that does not belong to the internal mailbox collection of the enterprise is determined as an external mailbox. According to the historical logs of the mail gateway within a given time range (for example, within the past 6 months), the historical sending of external network mailboxes is counted according to the mailbox name. number of mails.
方式k、邮件特征包括邮件收件人数量:Mode k, email characteristics include the number of email recipients:
基于已标记的历史邮件的收件人邮箱的个数,确定所述邮件收件人数量。The number of email recipients is determined based on the number of recipient mailboxes of the marked historical emails.
实际中,一封已标记的历史邮件在发送时可以选择一个或者多个收件人,已标记的历史邮件的收件人邮箱的个数,可作为邮件收件人数量。In practice, one or more recipients can be selected when a marked historical email is sent, and the number of recipient mailboxes of the marked historical email can be used as the number of email recipients.
方式L、邮件特征包括邮件收件人对应部门数量:Method L, email characteristics include the number of departments corresponding to email recipients:
从已标记的历史邮件的收件人邮箱中提取收件人邮箱名称,基于所述收件人邮箱名称以及企业员工与部门映射信息集合,确定所述收件人邮箱对应部门;对所述收件人邮箱对应部门进行去重统计,得到邮件收件人对应部门数量。Extract the recipient's mailbox name from the recipient's mailbox of the marked historical mail, and determine the corresponding department of the recipient's mailbox based on the recipient's mailbox name and the enterprise employee and department mapping information set; The corresponding department of the sender's mailbox is deduplicated and counted, and the number of the corresponding department of the email recipient is obtained.
方式m、邮件特征包括收件人所属部门历史收到此发件人邮件数量:Mode m, email characteristics include the number of emails received from this sender in the history of the recipient's department:
从已标记的历史邮件的收件人邮箱中提取收件人邮箱名称,基于所述收件人邮箱名称以及企业员工与部门映射信息集合,确定所述收件人邮箱对应部门;从所述历史日志中统计所述收件人邮箱对应部门在目标历史时间内,收到的来自所述已标记的历史邮件的发件人邮箱的邮件数量。Extract the recipient's mailbox name from the recipient's mailbox of the marked historical mail, and determine the corresponding department of the recipient's mailbox based on the recipient's mailbox name and the enterprise employee and department mapping information set; from the history The log counts the number of emails received from the sender mailboxes of the marked historical emails by the department corresponding to the recipient mailbox within the target historical time period.
方式n、邮件特征包括收件人历史收到此发件人邮件数量:Method n, email characteristics include the number of emails received by the sender in the history of the recipient:
从历史日志中统计所述已标记的历史邮件的收件人邮箱收到的来自所述已标记的历史邮件的发件人邮箱的邮件数量。The number of emails received by the recipient mailbox of the marked historical mail from the sender mailbox of the marked historical mail is counted from the history log.
方式p、邮件特征包括收件人历史收到发件人是外部邮箱的数量:Mode p, email characteristics include the number of external mailboxes received by the recipient in the history of the recipient:
从历史日志中统计已标记的历史邮件的收件人邮箱收到的发件人邮箱为外部邮箱的邮件数量。Count the number of emails whose sender mailboxes are external mailboxes received by the recipient mailboxes of marked historical emails from the historical logs.
步骤203、基于所述已标记的历史邮件对应的邮件特征,以及所述已标记的历史邮件的标记值,进行二分类模型训练,得到所述钓鱼邮件检测模型;其中,所述已标记的历史邮件的标记值用于表示所述已标记的历史邮件是否为钓鱼邮件。Step 203: Based on the email features corresponding to the marked historical emails and the tag value of the marked historical emails, perform binary classification model training to obtain the phishing email detection model; wherein, the marked historical emails The tag value of the email is used to indicate whether the tagged historical email is a phishing email.
模型训练时可以使用表1所示的邮件特征中的至少一项,以及已标记的历史邮件的标记值,通过极端梯度提升(eXtreme Gradient Boosting,XGBoost)算法进行二分类模型训练。当模型离线训练服务完成训练后,将训练好的钓鱼邮件检测模型文件和模型依赖文件发送到模型实时检测服务中,用于邮件网关日志的实时检测。During model training, at least one of the email features shown in Table 1 and the tag value of the marked historical emails can be used to train the binary classification model through the extreme gradient boosting (eXtreme Gradient Boosting, XGBoost) algorithm. After the model offline training service completes the training, the trained phishing email detection model files and model dependency files are sent to the model real-time detection service for real-time detection of mail gateway logs.
图3是本发明实施例提供的钓鱼邮件检测方法的流程示意图之二,如图3所示,该钓鱼邮件检测方法包括以下步骤:Fig. 3 is the second schematic flow diagram of the phishing email detection method provided by the embodiment of the present invention. As shown in Fig. 3, the phishing email detection method includes the following steps:
步骤301、加载预先训练得到的钓鱼邮件检测模型、与企业邮箱业务相关的企业内部信息以及邮件网关的历史日志;Step 301, loading the pre-trained phishing email detection model, enterprise internal information related to the enterprise mailbox business and historical logs of the mail gateway;
钓鱼邮件检测模型的检测服务启动时会先将钓鱼邮件检测模型文件和模型检测过程中依赖的数据,例如企业内部信息,加载到内存中。When the detection service of the phishing email detection model starts, it will first load the phishing email detection model file and the data relied on in the model detection process, such as internal information of the enterprise, into the memory.
步骤302、实时获取邮件网关的目标日志;目标日志中包括N个邮件的邮件属性信息,N个邮件包括待检测邮件;Step 302, obtaining the target log of the mail gateway in real time; the target log includes the mail attribute information of N mails, and the N mails include the mails to be detected;
模型实时检测服务启动完成后,实时获取邮件网关的目标日志进行钓鱼邮件检测。After the model real-time detection service is started, the target log of the email gateway is obtained in real time for phishing email detection.
步骤303、判断待检测邮件的邮箱域名信息与企业内部邮箱域名集合是否匹配:在待检测邮件的邮箱域名信息与企业内部邮箱域名集合不匹配的情况下,转到步骤304;在待检测邮件的邮箱域名信息与企业内部邮箱域名集合匹配的情况下,退出钓鱼邮件检测流程;
实际中,接收到邮件网关的目标日志信息后,在待检测邮件的邮件属性信息中包括发件人邮箱名称的情况下,基于待检测邮件的发件人邮箱名称确定待检测邮件的邮箱域名信息;将待检测邮件的邮箱域名信息与企业内部邮箱域名集合进行比较,判断待检测邮件是否为内部邮件,如果判定待检测邮件为内部邮件,则直接退出钓鱼邮件检测流程,如果判定待检测邮件不是内部邮件,则转到步骤304,进一步判断待检测邮件是否为钓鱼邮件。In practice, after receiving the target log information of the mail gateway, if the mail attribute information of the mail to be detected includes the sender's mailbox name, the mailbox domain name information of the mail to be detected is determined based on the sender's mailbox name of the mail to be detected ; Compare the email domain name information of the mail to be detected with the domain name collection of internal mailboxes of the enterprise to determine whether the mail to be detected is an internal mail. If it is determined that the mail to be detected is an internal mail, it will directly exit the phishing mail detection process. Internal mail, then go to step 304, further judge whether the mail to be detected is a phishing mail.
步骤304、邮件判定为外部邮件后,基于企业内部信息、历史日志以及所述待检测邮件的邮件属性信息,确定所述待检测邮件的邮件特征。Step 304: After the email is determined to be an external email, the email characteristics of the email to be detected are determined based on internal information of the enterprise, historical logs, and email attribute information of the email to be detected.
例如,根据加载的企业内部信息和待检测邮件中的发件人昵称、发件人邮箱名称、邮件主题、收件人邮箱名称、邮件附件信息,计算待检测邮件的邮件特征。For example, according to the loaded internal information of the enterprise and the sender's nickname, sender's mailbox name, email subject, recipient's mailbox name, and email attachment information in the email to be detected, the email characteristics of the email to be detected are calculated.
步骤305、将所述待检测邮件的邮件特征输入至所述钓鱼邮件检测模型,得到所述钓鱼邮件检测模型输出的所述待检测邮件的邮件类型;所述邮件类型包括钓鱼邮件和非钓鱼邮件。如果通过钓鱼邮件检测模型判定待检测邮件为钓鱼邮件,则转到步骤306;如果通过钓鱼邮件检测模型判定待检测邮件为正常邮件,则退出钓鱼邮件检测流程。
步骤306、生成相应的钓鱼邮件告警事件,并下发到告警模块。Step 306: Generate a corresponding phishing email alarm event and send it to the alarm module.
本发明实施例提供的钓鱼邮件检测方法,通过使用邮件信息中敏感程度较低的属性信息提取的能够有效区分邮件是否为伪装内部邮件的各类特征;使用邮件信息中敏感程度较低的收发人信息与历史邮件收发行为提取的能区分邮件是否为非正常收发关系的各类特征;特征分析时只使用了邮件信息中敏感程度较低的属性,有效的保证了邮件的隐私性;由于基于钓鱼邮件检测模型的钓鱼邮件检测方法具有更强的泛化能力,可以适用于不同的钓鱼邮件变体,能够解决邮件沙箱对加密压缩附件检测不全的问题,及解决关键词匹配覆盖不全面的问题,降低漏报和误报概率,提高钓鱼邮件检测的可靠性。The phishing email detection method provided by the embodiment of the present invention extracts various features that can effectively distinguish whether an email is a fake internal email by using attribute information with a low sensitivity in the email information; Information and historical email sending and receiving behaviors extract various features that can distinguish whether the email is an abnormal sending and receiving relationship; only the less sensitive attributes in the email information are used in the feature analysis, which effectively guarantees the privacy of the email; due to phishing The phishing email detection method of the email detection model has stronger generalization ability, can be applied to different phishing email variants, can solve the problem of incomplete detection of encrypted compressed attachments by email sandbox, and solve the problem of incomplete coverage of keyword matching , reduce the probability of false negatives and false positives, and improve the reliability of phishing email detection.
图4是本发明实施例提供的钓鱼邮件检测系统的结构示意图,如图4所示,该钓鱼邮件检测系统包括:邮件网关日志模块401、模型离线训练服务模块402、模型实时检测服务模块403及钓鱼邮件告警服务模块404,其中:Fig. 4 is the structural representation of the phishing mail detection system that the embodiment of the present invention provides, as shown in Fig. 4, this phishing mail detection system comprises: mail
邮件网关日志模块401,邮件网关日志是整个系统分析和检测的数据源,每条邮件日志都包含了发件人昵称、发件人邮箱名、邮件主题、收件人邮箱名、邮件内容、附件信息等。Mail
模型离线训练服务模块402,基于已标记的历史邮件日志信息和企业内部信息结合机器学习算法训练用于钓鱼邮件检测的模型。The model offline
模型实时检测服务模块403,基于离线训练的钓鱼邮件检测模型对邮件网关日志进行实时的异常检测,并将判定为钓鱼邮件的邮件日志生成相应的告警事件下发到钓鱼邮件告警服务模块404The model real-time
钓鱼邮件告警服务模块404,接收模型实时检测服务模块403下发的告警事件进行相应的告警通知。The phishing email
本发明实施例基于邮件企业邮件网关邮件日志信息和邮箱间的收发行为数据进行机器学习模型训练,并通过训练好的模型进行钓鱼邮件实时检测。The embodiment of the present invention conducts machine learning model training based on the mail log information of the mail enterprise mail gateway and the sending and receiving behavior data between mailboxes, and performs real-time detection of phishing mails through the trained model.
下面对本发明提供的钓鱼邮件检测装置进行描述,下文描述的钓鱼邮件检测装置与上文描述的钓鱼邮件检测方法可相互对应参照。The apparatus for detecting phishing emails provided by the present invention is described below, and the apparatus for detecting phishing emails described below and the method for detecting phishing emails described above can be referred to in correspondence.
图5为本发明实施例提供的钓鱼邮件检测装置的结构示意图,如图5所示,该钓鱼邮件检测装置500包括:获取模块501、第一确定模块502和检测模块503;其中,FIG. 5 is a schematic structural diagram of a phishing email detection device provided by an embodiment of the present invention. As shown in FIG.
获取模块501,用于获取预先训练得到的钓鱼邮件检测模型、与企业邮箱业务相关的企业内部信息、以及邮件网关日志;The obtaining
第一确定模块502,用于基于所述企业内部信息以及所述邮件网关日志,确定所述待检测邮件的邮件特征;The
检测模块503,用于将所述待检测邮件的邮件特征输入至所述钓鱼邮件检测模型,得到所述钓鱼邮件检测模型输出的所述待检测邮件的邮件类型;所述邮件类型包括钓鱼邮件和非钓鱼邮件;所述钓鱼邮件检测模型为基于已标记的历史邮件对应的邮件特征以及所述已标记的历史邮件的标记值,进行二分类模型训练得到。The
本发明实施例提供的钓鱼邮件检测装置,通过基于企业内部信息以及邮件网关日志,确定待检测邮件的邮件特征,使用预先训练得到的钓鱼邮件检测模型进行钓鱼邮件实时检测,判断待检测邮件是否为钓鱼邮件,由于钓鱼邮件检测模型为基于已标记的历史邮件对应的邮件特征以及所述已标记的历史邮件的标记值进行二分类模型训练得到,这就使得本方法具有更强的泛化能力,可以适用于不同的钓鱼邮件变体,能够降低漏报和误报概率,提高钓鱼邮件检测的可靠性。The phishing email detection device provided by the embodiment of the present invention determines the email characteristics of the email to be detected based on the internal information of the enterprise and the mail gateway log, uses the phishing email detection model obtained in advance training to detect the phishing email in real time, and judges whether the email to be detected is For phishing emails, since the phishing email detection model is obtained by training a binary classification model based on the email features corresponding to the marked historical emails and the tag values of the marked historical emails, this method has a stronger generalization ability, It can be applied to different variants of phishing emails, which can reduce the probability of false negatives and false positives, and improve the reliability of phishing email detection.
可选地,所述邮件特征包括以下至少一项:Optionally, the email features include at least one of the following:
用于区分邮件是否为伪装内部邮件的特征,包括以下至少一项:邮件是否包含附件、邮件附件类型对应异常等级、邮件附件名称是否包含中文、邮件附件名称与内网邮件附件名相似度、邮件主题与内网邮件主题相似度、发件人邮箱域名与内网邮箱域名相似度、发件人邮箱名称与内网邮箱域名相似度、发件人昵称与内网邮箱昵称相似度及发件人昵称与企业内部组织相似度;The characteristics used to distinguish whether an email is a fake internal email, including at least one of the following: whether the email contains attachments, whether the email attachment type corresponds to the abnormal level, whether the email attachment name contains Chinese, the similarity between the email attachment name and the intranet email attachment name, the email The similarity between the subject and the intranet email subject, the similarity between the sender's mailbox domain name and the intranet mailbox domain name, the similarity between the sender's mailbox name and the intranet mailbox domain name, the similarity between the sender's nickname and the intranet mailbox nickname, and the sender The similarity between the nickname and the internal organization of the enterprise;
用于区分邮件中收发人关系是否为正常收发人关系的特征,包括以下至少一项:外网邮箱历史发送邮件数量、邮件收件人数量、邮件收件人对应部门数量、收件人所属部门历史收到此发件人邮件数量、收件人历史收到此发件人邮件数量及收件人历史收到发件人是外部邮箱的数量。The characteristics used to distinguish whether the relationship between the sender and receiver in the email is a normal relationship between the receiver and the receiver, including at least one of the following: the number of emails sent by the external network mailbox in history, the number of email recipients, the number of departments corresponding to the email recipients, and the department to which the recipient belongs The number of emails received from this sender in history, the number of emails received from this sender in history by recipients, and the number of emails received by recipients in history from external mailboxes.
可选地,第一确定模块502,具体用于以下至少一项:Optionally, the
在所述邮件网关日志中所述待检测邮件对应的附件字段的值为非空的情况下,确定所述待检测邮件包含附件;在所述待检测邮件对应的附件字段的值为空的情况下,确定所述待检测邮件未包含附件;When the value of the attachment field corresponding to the mail to be detected in the mail gateway log is not empty, it is determined that the mail to be detected contains an attachment; when the value of the attachment field corresponding to the mail to be detected is empty Next, confirm that the email to be detected does not contain any attachments;
基于所述待检测邮件的邮件附件的文件后缀,以及预置的文件后缀与异常等级的对应关系,确定所述待检测邮件的邮件附件类型对应的异常等级;Based on the file suffix of the email attachment of the email to be detected, and the preset correspondence between the file suffix and the abnormality level, determine the abnormality level corresponding to the email attachment type of the email to be detected;
在所述待检测邮件的邮件附件名称与预置的正则表达式匹配的情况下,确定所述邮件附件名称包含中文;所述正则表达式用于匹配邮件附件名称是否包含中文字符;在所述邮件附件名称与预置的正则表达式不匹配的情况下,确定所述邮件附件名称不包含中文;In the case that the name of the mail attachment of the mail to be detected matches a preset regular expression, it is determined that the name of the mail attachment contains Chinese; the regular expression is used to match whether the name of the mail attachment contains Chinese characters; in the If the name of the mail attachment does not match the preset regular expression, it is determined that the name of the mail attachment does not contain Chinese;
从所述历史日志中提取发件人邮箱为企业内部邮箱的至少一个历史邮件的邮件附件名称;对各所述历史邮件的邮件附件名称进行分词得到词组集合;计算各所述词组集合中每个词语的词频,得到词频集合;对所述待检测邮件的邮件附件名称进行分词得到文本词组;使用所述文本词组与所述词频集合进行词频匹配,得到所述文本词组中各个词语的词频;计算所述文本词组中各个词语的词频的平均值;对所述平均值进行归一化处理,得到所述邮件附件名称与内网邮件附件名相似度;From the historical log, extracting sender mailbox is the mail attachment name of at least one historical mail of the internal mailbox of the enterprise; the mail attachment name of each described historical mail is word-segmented to obtain the phrase set; calculate each described phrase set The word frequency of word obtains word frequency set; The mail attachment name of described to-be-detected mail is carried out participle to obtain text phrase; Use described text phrase and described word frequency set to carry out word frequency matching, obtain the word frequency of each word in described text phrase; Calculate The average value of the word frequency of each word in described text phrase; Described average value is carried out normalization process, obtains described mail attachment name and Intranet mail attachment name similarity;
从所述历史日志中提取发件人邮箱为企业内部邮箱的历史邮件的邮件主题;对各所述历史邮件的邮件主题进行分词,得到词组集合;计算各所述词组集合中每个词组中词语的词频,得到词频集合;对所述待检测邮件的邮件主题进行分词,得到文本词组;使用所述文本词组与所述词频集合进行词频匹配,得到所述文本词组中各个词语的词频;计算所述文本词组中各个词语的词频的平均值;对所述平均值进行归一化处理,得到所述邮件主题与内网邮件主题相似度;Extract sender mailbox from described history log and be the mail subject of the historical mail of enterprise internal mailbox; Carry out word segmentation to the mail subject of each described historical mail, obtain phrase set; Calculate the words in each phrase in each described phrase set The word frequency of the word frequency is obtained word frequency collection; The mail subject of described to-be-detected mail is carried out word segmentation, obtains text phrase; Use described text phrase and described word frequency set to carry out word frequency matching, obtain the word frequency of each word in the described text phrase; Calculate the The average value of the term frequency of each word in the text phrase; Described average value is carried out normalization process, obtains described mail subject and Intranet mail subject similarity;
从所述待检测邮件的发件人邮箱中提取发件人邮箱域名;确定所述发件人邮箱域名与内网邮箱域名相似度;Extracting the sender's mailbox domain name from the sender's mailbox of the mail to be detected; determining the similarity between the sender's mailbox domain name and the intranet mailbox domain name;
从所述待检测邮件的发件人邮箱中提取发件人邮箱名称;确定所述发件人邮箱名称与内网邮箱域名相似度;Extracting the sender's mailbox name from the sender's mailbox of the mail to be detected; determining the similarity between the sender's mailbox name and the intranet mailbox domain name;
从所述历史日志中提取发件人邮箱为企业内部邮箱的历史邮件的发件人昵称;对各所述历史邮件的发件人昵称进行分词,得到词组集合;计算各所述词组集合中每个词组中词语的词频,得到词频集合;对所述待检测邮件的发件人昵称进行分词,得到文本词组;使用所述文本词组与所述词频集合进行词频匹配,得到所述文本词组中各个词语的词频;计算所述文本词组中各个词语的词频的平均值;对所述平均值进行归一化处理,得到所述发件人昵称与内网邮箱昵称相似度;From described history log, extracting sender's mailbox is the sender's nickname of the historical mail of internal mailbox of enterprise; The sender's nickname of each described historical mail is carried out participle, obtains phrase collection; Calculate each described phrase collection The word frequency of word in word group, obtain word frequency set; The sender's nickname of described to-be-detected mail is carried out participle, obtain text phrase; Use described text phrase and described word frequency set to carry out word frequency matching, obtain each in described text phrase The word frequency of word; Calculate the mean value of the word frequency of each word in described text phrase; Described mean value is carried out normalization process, obtains described sender's nickname and intranet mailbox nickname similarity;
基于企业内部组织信息集合对各内部组织进行分词,得到词组集合;计算各所述词组集合中每个词组中词语的词频,得到词频集合;对所述待检测邮件的发件人昵称进行分词,得到文本词组;使用所述文本词组与所述词频集合进行词频匹配,得到所述文本词组中各个词语的词频;计算所述文本词组中各个词语的词频的平均值;对所述平均值进行归一化处理,得到所述发件人昵称与企业内部组织相似度;Carry out word segmentation to each internal organization based on the enterprise internal organization information set, obtain the phrase set; calculate the word frequency of words in each phrase in each described phrase set, obtain the word frequency set; carry out word segmentation to the sender's nickname of the described mail to be detected, Obtain text phrase; Use described text phrase and described word frequency set to carry out word frequency matching, obtain the word frequency of each word in described text phrase; Calculate the average value of the word frequency of each word in described text phrase; Described average value is normalized Combined processing to obtain the similarity between the sender's nickname and the internal organization of the enterprise;
从所述历史日志中提取发件人邮箱不是企业内部邮箱的历史邮件数量,得到所述外网邮箱历史发送邮件数量;Extracting the sender's mailbox from the historical log is not the historical number of emails in the internal mailbox of the enterprise, and obtaining the historical number of emails sent by the external network mailbox;
基于所述待检测邮件的收件人邮箱的个数,确定所述邮件收件人数量;Determine the number of recipients of the email based on the number of recipient mailboxes of the email to be detected;
从所述待检测邮件的收件人邮箱中提取收件人邮箱名称,基于所述收件人邮箱名称以及企业员工与部门映射信息集合,确定所述收件人邮箱对应部门;对所述收件人邮箱对应部门进行去重统计,得到所述邮件收件人对应部门数量;Extract the recipient mailbox name from the recipient mailbox of the mail to be detected, and determine the corresponding department of the recipient mailbox based on the recipient mailbox name and the enterprise employee and department mapping information set; The corresponding department of the sender's mailbox is deduplicated and counted to obtain the number of the department corresponding to the recipient of the email;
从所述待检测邮件的收件人邮箱中提取收件人邮箱名称,基于所述收件人邮箱名称以及企业员工与部门映射信息集合,确定所述收件人邮箱对应部门;从所述历史日志中统计所述收件人邮箱对应部门在目标历史时间内,收到的来自所述待检测邮件的发件人邮箱的邮件数量;Extract the recipient mailbox name from the recipient mailbox of the mail to be detected, and determine the corresponding department of the recipient mailbox based on the recipient mailbox name and the enterprise employee and department mapping information set; from the history Count the number of mails received from the sender's mailbox of the mail to be detected within the target historical time by the corresponding department of the recipient's mailbox in the log;
从所述历史日志中统计所述待检测邮件的收件人邮箱收到的来自所述待检测邮件的发件人邮箱的邮件数量;Counting the number of mails received by the recipient mailbox of the mail to be detected from the mailbox of the sender of the mail to be detected from the historical log;
从所述历史日志中统计所述待检测邮件的收件人邮箱收到的发件人邮箱为外部邮箱的邮件数量。Counting the number of emails whose sender mailboxes are external mailboxes received by the recipient mailboxes of the emails to be detected from the historical logs.
可选地,第一确定模块502,具体用于:Optionally, the
在所述待检测邮件的邮件属性信息中包括发件人邮箱名称的情况下,基于所述待检测邮件的发件人邮箱名称确定所述待检测邮件的邮箱域名信息;In the case that the email attribute information of the email to be detected includes a sender's mailbox name, determining the mailbox domain name information of the email to be detected based on the sender's mailbox name of the email to be detected;
在所述企业内部信息中包括企业内部邮箱域名集合,且所述待检测邮件的邮箱域名信息与所述企业内部邮箱域名集合不匹配的情况下,基于所述企业内部信息、所述邮件网关日志以及所述待检测邮件的邮件属性信息,确定所述待检测邮件的邮件特征。If the enterprise internal information includes the enterprise internal mailbox domain name set, and the mailbox domain name information of the email to be detected does not match the enterprise internal mailbox domain name set, based on the enterprise internal information, the mail gateway log and the email attribute information of the email to be detected, to determine the email feature of the email to be detected.
可选地,所述装置还包括:Optionally, the device also includes:
获取模块,用于获取已标记的历史邮件以及所述企业内部信息;An acquisition module, configured to acquire marked historical emails and the internal information of the enterprise;
第二确定模块,用于基于所述已标记的历史邮件以及所述企业内部信息,确定所述已标记的历史邮件对应的邮件特征;The second determination module is configured to determine the email characteristics corresponding to the marked historical emails based on the marked historical emails and the internal information of the enterprise;
训练模块,用于基于所述已标记的历史邮件对应的邮件特征,以及所述已标记的历史邮件的标记值,进行二分类模型训练,得到所述钓鱼邮件检测模型;A training module, configured to perform binary classification model training based on the email features corresponding to the marked historical emails and the tag value of the marked historical emails to obtain the phishing email detection model;
其中,所述已标记的历史邮件的标记值用于表示所述已标记的历史邮件是否为钓鱼邮件。Wherein, the flag value of the marked historical email is used to indicate whether the marked historical email is a phishing email.
可选地,所述企业内部信息包括以下至少一项:Optionally, the internal information of the enterprise includes at least one of the following:
企业内部组织信息集合;Collection of organizational information within the enterprise;
企业员工与部门映射信息集合;A collection of employee and department mapping information;
企业内部邮箱集合;Collection of mailboxes within the enterprise;
企业内部邮箱域名集合。A collection of internal mailbox domain names.
可选地,所述邮件网关日志中包括N个邮件的邮件属性信息,所述邮件属性信息包括以下至少一项:Optionally, the mail gateway log includes mail attribute information of N mails, and the mail attribute information includes at least one of the following:
发件人昵称;sender nickname;
发件人邮箱;sender email;
收件人邮箱;Recipient's email address;
邮件主题;Email Subject;
邮件附件名称;email attachment name;
邮件附件类型。Email attachment type.
图6为本发明实施例提供的电子设备的实体结构示意图,如图6所示,该电子设备可以包括:处理器(processor)610、通信接口(Communications Interface)620、存储器(memory)630和通信总线640,其中,处理器610,通信接口620,存储器630通过通信总线640完成相互间的通信。处理器610可以调用存储器630中的逻辑指令,以执行如下方法:FIG. 6 is a schematic diagram of the physical structure of the electronic device provided by the embodiment of the present invention. As shown in FIG. The
获取预先训练得到的钓鱼邮件检测模型、与企业邮箱业务相关的企业内部信息、以及邮件网关日志;Obtain the pre-trained phishing email detection model, enterprise internal information related to the enterprise mailbox business, and mail gateway logs;
基于所述企业内部信息以及所述邮件网关日志,确定待检测邮件的邮件特征;Determine the email characteristics of the email to be detected based on the internal information of the enterprise and the email gateway log;
将所述待检测邮件的邮件特征输入至所述钓鱼邮件检测模型,得到所述钓鱼邮件检测模型输出的所述待检测邮件的邮件类型;所述邮件类型包括钓鱼邮件和非钓鱼邮件;所述钓鱼邮件检测模型为基于已标记的历史邮件对应的邮件特征以及所述已标记的历史邮件的标记值,进行二分类模型训练得到。Inputting the email feature of the email to be detected into the phishing email detection model to obtain the email type of the email to be detected output by the phishing email detection model; the email type includes phishing emails and non-phishing emails; the The phishing email detection model is obtained by training a binary classification model based on the email features corresponding to the marked historical emails and the tag values of the marked historical emails.
此外,上述的存储器630中的逻辑指令可以通过软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。In addition, the logic instructions in the above-mentioned
另一方面,本发明实施例还提供一种非暂态计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时,实现如下方法:On the other hand, an embodiment of the present invention also provides a non-transitory computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a processor, the following method is implemented:
获取预先训练得到的钓鱼邮件检测模型、与企业邮箱业务相关的企业内部信息、以及邮件网关日志;Obtain the pre-trained phishing email detection model, enterprise internal information related to the enterprise mailbox business, and mail gateway logs;
基于所述企业内部信息以及所述邮件网关日志,确定待检测邮件的邮件特征;Determine the email characteristics of the email to be detected based on the internal information of the enterprise and the email gateway log;
将所述待检测邮件的邮件特征输入至所述钓鱼邮件检测模型,得到所述钓鱼邮件检测模型输出的所述待检测邮件的邮件类型;所述邮件类型包括钓鱼邮件和非钓鱼邮件;所述钓鱼邮件检测模型为基于已标记的历史邮件对应的邮件特征以及所述已标记的历史邮件的标记值,进行二分类模型训练得到。Inputting the email feature of the email to be detected into the phishing email detection model to obtain the email type of the email to be detected output by the phishing email detection model; the email type includes phishing emails and non-phishing emails; the The phishing email detection model is obtained by training a binary classification model based on the email features corresponding to the marked historical emails and the tag values of the marked historical emails.
又一方面,本发明实施例还提供一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,实现如下方法:In yet another aspect, an embodiment of the present invention further provides a computer program product, the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions When executed by a computer, the following methods are implemented:
获取预先训练得到的钓鱼邮件检测模型、与企业邮箱业务相关的企业内部信息、以及邮件网关日志;Obtain the pre-trained phishing email detection model, enterprise internal information related to the enterprise mailbox business, and mail gateway logs;
基于所述企业内部信息以及所述邮件网关日志,确定待检测邮件的邮件特征;Determine the email characteristics of the email to be detected based on the internal information of the enterprise and the email gateway log;
将所述待检测邮件的邮件特征输入至所述钓鱼邮件检测模型,得到所述钓鱼邮件检测模型输出的所述待检测邮件的邮件类型;所述邮件类型包括钓鱼邮件和非钓鱼邮件;所述钓鱼邮件检测模型为基于已标记的历史邮件对应的邮件特征以及所述已标记的历史邮件的标记值,进行二分类模型训练得到。Inputting the email feature of the email to be detected into the phishing email detection model to obtain the email type of the email to be detected output by the phishing email detection model; the email type includes phishing emails and non-phishing emails; the The phishing email detection model is obtained by training a binary classification model based on the email features corresponding to the marked historical emails and the tag values of the marked historical emails.
以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性的劳动的情况下,即可以理解并实施。The device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in One place, or it can be distributed to multiple network elements. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. It can be understood and implemented by those skilled in the art without any creative efforts.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到各实施方式可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件。基于这样的理解,上述技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行各个实施例或者实施例的某些部分所述的方法。Through the above description of the implementations, those skilled in the art can clearly understand that each implementation can be implemented by means of software plus a necessary general hardware platform, and of course also by hardware. Based on this understanding, the essence of the above technical solution or the part that contributes to the prior art can be embodied in the form of software products, and the computer software products can be stored in computer-readable storage media, such as ROM/RAM, magnetic discs, optical discs, etc., including several instructions to make a computer device (which may be a personal computer, server, or network device, etc.) execute the methods described in various embodiments or some parts of the embodiments.
最后应说明的是:以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still be Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent replacements are made to some of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the various embodiments of the present invention.
Claims (11)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210946321.5A CN115473676B (en) | 2022-08-08 | 2022-08-08 | Phishing email detection method, device, electronic device and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210946321.5A CN115473676B (en) | 2022-08-08 | 2022-08-08 | Phishing email detection method, device, electronic device and storage medium |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN115473676A true CN115473676A (en) | 2022-12-13 |
| CN115473676B CN115473676B (en) | 2025-03-11 |
Family
ID=84366560
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202210946321.5A Active CN115473676B (en) | 2022-08-08 | 2022-08-08 | Phishing email detection method, device, electronic device and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN115473676B (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119766494A (en) * | 2024-12-03 | 2025-04-04 | 天翼云科技有限公司 | Target mail detection method, device, computer equipment and storage medium |
| CN120017305A (en) * | 2024-12-30 | 2025-05-16 | 中国工商银行股份有限公司 | A method, device, equipment and storage medium for detecting phishing emails |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110868378A (en) * | 2018-12-17 | 2020-03-06 | 北京安天网络安全技术有限公司 | Phishing mail detection method and device, electronic equipment and storage medium |
| US20200089204A1 (en) * | 2017-05-31 | 2020-03-19 | Siemens Aktiengesellschaft | Industrial control system and network security monitoring method therefor |
| CN111404806A (en) * | 2020-03-16 | 2020-07-10 | 深信服科技股份有限公司 | Method, device and equipment for detecting harpoon mails and computer readable storage medium |
| CN113408281A (en) * | 2021-07-14 | 2021-09-17 | 北京天融信网络安全技术有限公司 | Mailbox account abnormity detection method and device, electronic equipment and storage medium |
-
2022
- 2022-08-08 CN CN202210946321.5A patent/CN115473676B/en active Active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20200089204A1 (en) * | 2017-05-31 | 2020-03-19 | Siemens Aktiengesellschaft | Industrial control system and network security monitoring method therefor |
| CN110868378A (en) * | 2018-12-17 | 2020-03-06 | 北京安天网络安全技术有限公司 | Phishing mail detection method and device, electronic equipment and storage medium |
| CN111404806A (en) * | 2020-03-16 | 2020-07-10 | 深信服科技股份有限公司 | Method, device and equipment for detecting harpoon mails and computer readable storage medium |
| CN113408281A (en) * | 2021-07-14 | 2021-09-17 | 北京天融信网络安全技术有限公司 | Mailbox account abnormity detection method and device, electronic equipment and storage medium |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119766494A (en) * | 2024-12-03 | 2025-04-04 | 天翼云科技有限公司 | Target mail detection method, device, computer equipment and storage medium |
| CN120017305A (en) * | 2024-12-30 | 2025-05-16 | 中国工商银行股份有限公司 | A method, device, equipment and storage medium for detecting phishing emails |
Also Published As
| Publication number | Publication date |
|---|---|
| CN115473676B (en) | 2025-03-11 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10187407B1 (en) | Collaborative phishing attack detection | |
| US11063897B2 (en) | Method and system for analyzing electronic communications and customer information to recognize and mitigate message-based attacks | |
| US7899866B1 (en) | Using message features and sender identity for email spam filtering | |
| US10243989B1 (en) | Systems and methods for inspecting emails for malicious content | |
| US8468597B1 (en) | System and method for identifying a phishing website | |
| US20140230050A1 (en) | Collaborative phishing attack detection | |
| US8370930B2 (en) | Detecting spam from metafeatures of an email message | |
| US8495735B1 (en) | System and method for conducting a non-exact matching analysis on a phishing website | |
| CN110213152B (en) | Method, device, server and storage medium for identifying junk mails | |
| CN110519150B (en) | Mail detection method, device, equipment, system and computer readable storage medium | |
| US8112484B1 (en) | Apparatus and method for auxiliary classification for generating features for a spam filtering model | |
| US20060149820A1 (en) | Detecting spam e-mail using similarity calculations | |
| CN108418777A (en) | Method, device and system for detecting phishing emails | |
| CN113408281B (en) | Mailbox account anomaly detection method and device, electronic equipment and storage medium | |
| CN115473676A (en) | Phishing mail detection method and device, electronic equipment and storage medium | |
| US20250077661A1 (en) | Email Security Detection Apparatus, Method and Device, and Storage Medium | |
| Taylor et al. | A model to detect spam email using support vector classifier and random forest classifier | |
| CN112039874A (en) | Malicious mail identification method and device | |
| RU2750643C2 (en) | Method for recognizing a message as spam through anti-spam quarantine | |
| US20090228565A1 (en) | System for detecting information leakage in outbound e-mails without using the content of the mail | |
| CN113240297A (en) | Phishing mail detection method and system | |
| CN103841006A (en) | Method and device for intercepting junk mails in cloud computing system | |
| Ishak et al. | Distance-based hoax detection system | |
| CN108876233A (en) | Mail sensitive information detection method, system, equipment and storage medium in express delivery industry | |
| CN110380952B (en) | Mail receiving and sending method and device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |