[go: up one dir, main page]

CN112364173A - IP address mechanism tracing method based on knowledge graph - Google Patents

IP address mechanism tracing method based on knowledge graph Download PDF

Info

Publication number
CN112364173A
CN112364173A CN202011130373.2A CN202011130373A CN112364173A CN 112364173 A CN112364173 A CN 112364173A CN 202011130373 A CN202011130373 A CN 202011130373A CN 112364173 A CN112364173 A CN 112364173A
Authority
CN
China
Prior art keywords
name
weight
organization
institution
knowledge graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011130373.2A
Other languages
Chinese (zh)
Other versions
CN112364173B (en
Inventor
周玉金
孙治
张志勇
刘方
陈剑锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Electronic Technology Cyber Security Co Ltd
Original Assignee
China Electronic Technology Cyber Security Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Electronic Technology Cyber Security Co Ltd filed Critical China Electronic Technology Cyber Security Co Ltd
Priority to CN202011130373.2A priority Critical patent/CN112364173B/en
Publication of CN112364173A publication Critical patent/CN112364173A/en
Application granted granted Critical
Publication of CN112364173B publication Critical patent/CN112364173B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/45Network directories; Name-to-address mapping
    • H04L61/4505Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols
    • H04L61/4511Network directories; Name-to-address mapping using standardised directories; using standardised directory access protocols using domain name system [DNS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/12Applying verification of the received information
    • H04L63/126Applying verification of the received information the source of the received data

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of information, in particular to an IP address mechanism tracing method based on a knowledge graph. Aiming at the problems of isolation and dispersion of safety information in network space, the invention solves the problem that IP and domain name mapping in network space safety have no affiliated organization information from practical application, eliminates the gap of network safety information such as IP, domain name and organization, and the like, and collects more and more comprehensive safety information for network defense, monitoring and the like in network space safety; the candidate organization names with higher probability and higher possibility are effectively screened from the disordered search results, and the optimal inference result of the organization to which the candidate organization names belong is efficiently obtained through the IP address; the accuracy of the result of the organization to which the organization belongs is deduced from the IP address in the network space security field is ensured.

Description

一种基于知识图谱的IP地址机构溯源方法An IP address organization traceability method based on knowledge graph

技术领域technical field

本发明涉及网络信息技术领域,具体涉及一种基于知识图谱的IP地址机构溯源方法。The invention relates to the technical field of network information, in particular to a method for tracing the source of an IP address organization based on a knowledge graph.

背景技术Background technique

当今时代,信息技术发展越来越快,网络安全的威胁来源和攻击手段不断变化,迫切要求网络安全人员提升网络安全信息的感知能力,全方位感知安全信息,及时发现掌握更多安全信息,才能在网络空间安全攻防战中抢占先机,做到“知己知彼”。In today's era, information technology is developing faster and faster, and the sources of threats and attack methods of network security are constantly changing. It is urgent for network security personnel to improve the perception ability of network security information, perceive security information in an all-round way, and discover and grasp more security information in time. Seize the opportunity in the cyberspace security offensive and defensive warfare, and achieve "know yourself and the enemy".

目前,网络安全领域中,大多网络安全信息都是孤立、分散的,这使得安全人员很难有效利用安全信息解决实际问题。At present, in the field of network security, most network security information is isolated and scattered, which makes it difficult for security personnel to effectively use security information to solve practical problems.

现有发明专利与网络空间安全领域有关,基于知识图谱解决实际安全问题的方法有:一种基于知识图谱的恶意域名检测方法(公开号:CN110290116A),该方法基于知识图谱实现恶意域名的检测,为恶意域名检测提供了新的视角,但其只考虑了网络空间中的域名信息维度,没有综合考虑域名信息与IP地址、网络资产以及其他维度的安全信息之间的相互关联关系,网络空间中IP、域名与其他安全信息之间往往蕴含着丰富的隐藏信息和知识,更深层次地挖掘这些信息,可以更好地解决实际安全问题;一种基于DNS映射IP的恶意域名匹配方法(公开号:CN108737385A),该方案通过DNS实时映射已知恶意域名对应的IP,基于全IP流量对恶意域名的访问行为进行匹配,但该方案只是利用了DNS解析域名,基于IP地址的流量进行恶意域名访问行为匹配,若需要基于IP与域名映射关系解决其他安全问题,还需要进一步挖掘IP与域名之间的隐藏信息;一种基于知识图谱的分布式安全事件关联分析方法(公开号:CN108270785A),该方案构建了网络空间安全领域包括基础维、漏洞维、威胁维、报警事件维以及攻击规则维的网络安全知识图谱,实现了安全事件的关联分析,该方案从网络空间安全资源的各个维度将安全信息关联起来,从宏观的角度实现安全威胁事件的关联分析,但未涉及实际安全问题的解决方案,实际应用中还需要在网络安全知识图谱之上,设计不同的算法流程去解决实际安全问题。The existing invention patents are related to the field of cyberspace security. The methods for solving practical security problems based on knowledge graphs include: a method for detecting malicious domain names based on knowledge graphs (publication number: CN110290116A), which realizes the detection of malicious domain names based on knowledge graphs, It provides a new perspective for malicious domain name detection, but it only considers the dimension of domain name information in cyberspace, and does not comprehensively consider the interrelationship between domain name information and IP addresses, network assets and other dimensions of security information. There is often a wealth of hidden information and knowledge between IP, domain name and other security information. Digging this information deeper can better solve practical security problems; a malicious domain name matching method based on DNS mapping IP (public number: CN108737385A), this scheme maps IPs corresponding to known malicious domain names in real time through DNS, and matches the access behavior of malicious domain names based on full IP traffic, but this scheme only uses DNS to resolve domain names, and conducts malicious domain name access behavior based on IP address traffic If it is necessary to solve other security problems based on the mapping relationship between IP and domain name, it is necessary to further excavate the hidden information between IP and domain name; a distributed security event correlation analysis method based on knowledge graph (public number: CN108270785A), this scheme A network security knowledge graph is constructed in the field of cyberspace security, including basic dimension, vulnerability dimension, threat dimension, alarm event dimension, and attack rule dimension, and the correlation analysis of security events is realized. This solution combines security information from various dimensions of cyberspace security resources. The correlation analysis of security threat events is realized from a macro perspective, but the solution to actual security problems is not involved. In practical applications, different algorithm processes need to be designed on the network security knowledge map to solve actual security problems.

不难得出,现在的网络空间安全领域存在如下缺点:It is not difficult to see that the current cyberspace security field has the following shortcomings:

(1)网络空间中IP地址信息无法直接映射其所属组织机构;(1) The IP address information in the cyberspace cannot directly map the organization to which it belongs;

(2)搜索结果繁杂无序、难以取得最优推断结果;(2) The search results are complex and disordered, and it is difficult to obtain the optimal inference results;

(3)聚类结果不精确,存在极大的聚类误差。(3) The clustering results are imprecise, and there is a huge clustering error.

因此,现有的网络信息安全领域中还没有特别实际有效的方法,可从网络中分析和获取有效的安全信息,进而进行利用和解决实际问题;使得现实中诸多问题的解决方案效率降低,可靠性也得不到有效提高。Therefore, there is no particularly practical and effective method in the field of existing network information security, which can analyze and obtain effective security information from the network, and then use and solve practical problems. Sex has not been effectively improved.

发明内容SUMMARY OF THE INVENTION

为了克服上述内容中提到的现有技术存在的缺陷,本发明提供了一种基于知识图谱的IP地址机构溯源方法,旨在从实际出发,利用网络空间探测IP地址,基于搜索、聚类与知识图谱相结合进一步推断其所属组织机构,挖掘IP地址中隐藏的深层价值信息,对网络空间安全的攻防、监测等具有重大的意义。In order to overcome the defects of the prior art mentioned in the above content, the present invention provides a method for tracing the source of IP address institutions based on knowledge graph, aiming at starting from reality, using cyberspace to detect IP addresses, based on searching, clustering and The combination of knowledge graphs further infers the organizations they belong to, and mines the deep value information hidden in IP addresses, which is of great significance to the attack, defense, and monitoring of cyberspace security.

为了实现上述目的,本发明具体采用的技术方案是:In order to achieve the above object, the technical scheme specifically adopted in the present invention is:

一种基于知识图谱的IP地址机构溯源方法,包括:A method for tracing the source of IP address institutions based on knowledge graph, comprising:

获取域名信息:针对待推断机构的有效IP地址,通过DNS反解析得出IP地址所对应的域名信息;Obtain domain name information: For the effective IP address of the institution to be inferred, the domain name information corresponding to the IP address is obtained through DNS reverse resolution;

获取域名关键信息:对域名信息进行截断、筛选处理,获取域名信息中的关键信息;Obtaining the key information of the domain name: truncate and filter the domain name information to obtain the key information in the domain name information;

获取分析样本:根据域名关键信息进行搜索并得到与域名关键信息对应的若干网页,对网页进行排序和筛选后留取若干网页作为分析样本;Obtain analysis samples: search according to the key information of the domain name and obtain several webpages corresponding to the key information of the domain name, sort and filter the webpages and reserve several webpages as analysis samples;

样本处理:对分析样本进行分析并获取其中存在的文本内容;Sample processing: analyze the analysis sample and obtain the text content present in it;

实体抽取:通过命名实体识别模型对文本内容进行实体抽取,识别出文本内容中所有的机构名称;Entity extraction: entity extraction is performed on the text content through the named entity recognition model, and all the institution names in the text content are identified;

计算权重:根据分析样本的排序先后和文本内容所在网页标签的标签元素,对机构名称进行权重计算和配置;Calculate the weight: Calculate and configure the weight of the institution name according to the order of the analysis samples and the label element of the webpage label where the text content is located;

实体聚类:通过机构名称之间的编辑距离进行实体间的聚类,同时使用先验知识图谱对聚类结果进行约束和指导,得到多个类别的机构名称集合;Entity clustering: Clustering between entities is performed through the edit distance between institution names, and at the same time, the prior knowledge graph is used to constrain and guide the clustering results, and multiple categories of institution name sets are obtained;

选定机构名称:根据每个机构名称的权重,以加权平均的计算方法计算出每个类别的机构名称集合的权重,选择权重最大的机构名称集合,并以该机构名称集合中权重最大的机构名称作为溯源推断结果。Selected institution name: According to the weight of each institution name, calculate the weight of each category of institution name set by weighted average calculation method, select the institution name set with the largest weight, and use the institution name set with the largest weight in the institution name set. The name is used as the traceability inference result.

上述公开的机构溯源方法,通过对IP地址进行的反解析及对获取域名信息的再度处理,可从获取的网页标签、网页文本等内容中提取有效信息进行深入分析,从而可筛选得到可能的目标机构,在进行多种计算后选定可能性最高的目标机构作为推断的结果以完成溯源。The above-disclosed institutional traceability method can extract valid information from the acquired web page tags, web page texts, etc. for in-depth analysis through anti-resolution of IP addresses and reprocessing of acquired domain name information, so that possible targets can be filtered out. Institutions, after performing various calculations, select the target institution with the highest possibility as the inferred result to complete the traceability.

进一步的,可采用多种方式对域名信息进行处理,此处采用一种可行且更为优化的方式,具体如下:在获取域名关键信息的步骤中,通过正则表达式对域名关键信息进行处理,去除域名信息的前缀和后缀,以此获取域名中的关键信息。Further, the domain name information can be processed in a variety of ways, and a feasible and more optimized way is adopted here, which is as follows: in the step of obtaining the key information of the domain name, the key information of the domain name is processed through a regular expression, Remove the prefix and suffix of the domain name information to obtain key information in the domain name.

进一步的,得到分析样本的数量通常较多,其中存在极大的噪音,对此可选择部分关联性较大的网页作为分析样本,一般选择20个左右,在选定分析样本并对分析样本进行处理时,通过正则表达式和网页文档处理规则,去除网页标签,获取文本内容。Further, the number of analysis samples obtained is usually large, and there is a lot of noise. For this purpose, some webpages with relatively large correlation can be selected as analysis samples, generally about 20. After the analysis samples are selected and analyzed. During processing, the web page tags are removed through regular expressions and web page document processing rules to obtain text content.

进一步的,所有的分析样本都对最终的推断结果存在影响,但每个分析样本的影响占比并不完全相同,在进行推断的过程中,对每个分析样本进行影响力的划分,具体可采用如下可行的方式:所述的根据分析样本的排序先后对机构名称进行权重计算和配置,根据网页的排序计算权重,按照如下方式计算其中一个权重因子,Further, all analysis samples have an impact on the final inference result, but the proportion of influence of each analysis sample is not exactly the same. In the process of inference, the influence of each analysis sample is divided. The following feasible methods are adopted: the weight calculation and configuration are performed on the name of the institution according to the order of the analysis samples, the weight is calculated according to the order of the webpage, and one of the weight factors is calculated as follows:

Figure BDA0002734951210000041
Figure BDA0002734951210000041

其中,ωi为排序为i的分析样本所占的权重。Among them, ω i is the weight of the analysis samples ranked i.

进一步的,分析样本的权重包括两个权重因子,其中一个权重因子为网页本身在所有分析样本中所占的权重,与其在所有分析样本中的影响力排序存在直接关系;另一个权重因子与每个网页所含的标签元素有关,所述的根据文本内容所在网页标签的标签元素,对机构名称进行权重计算和配置,根据网页标签元素对网页文档主题的贡献度,给网页标签元素赋予对应的语义权重ωj∈[0,10],并以该网页标签元素的语义权重反应文本内容对搜索结果的重要程度。Further, the weight of the analysis samples includes two weight factors, one of which is the weight of the web page itself in all the analysis samples, which is directly related to the ranking of its influence in all the analysis samples; the other weight factor is related to each analysis sample. The label elements contained in each web page are related. The weight of the organization name is calculated and configured according to the label element of the web page label where the text content is located. Semantic weight ω j ∈ [0, 10], and reflects the importance of text content to search results by the semantic weight of the label element of the web page.

再进一步,根据网页的权重和网页标签元素对应的语义权重,按照如下方法计算机构名称的权重Further, according to the weight of the webpage and the semantic weight corresponding to the label element of the webpage, calculate the weight of the organization name according to the following method

Figure BDA0002734951210000042
Figure BDA0002734951210000042

其中,Wij为机构名称j在网页i中的权重,ωi为网页i的权重因子,ωj为机构名称j的语义权重,以元素标签构建树形图,hj为机构名称j在所处的标签元素树形图中的层次且hj∈[0,4],tfij为机构名称j在网页i中的词条频率,idfj为名称词j的逆文档频率,

Figure BDA0002734951210000051
Ni为网页i中搜索到的所有机构名称条目,nij为网页i中含有机构名称j的词项数目。Among them, W ij is the weight of the organization name j in the web page i, ω i is the weight factor of the web page i, ω j is the semantic weight of the organization name j, and the element label is used to construct a tree diagram, h j is the place where the organization name j is located. and h j ∈ [0, 4], tf ij is the term frequency of the organization name j in web page i, idf j is the inverse document frequency of the name term j,
Figure BDA0002734951210000051
N i is all the organization name items searched in web page i, and n ij is the number of terms in web page i that contain organization name j.

进一步的,进行实体聚类前,至少找出候选机构名称在先验知识图谱中的别名和上下级机构从属关系以组成候选机构实体对并组成三元组,取三元组中的头实体和尾实体组成实体集合{(hi,ti)}。Further, before performing entity clustering, at least find out the alias of the candidate institution name in the prior knowledge graph and the subordinate institution affiliation to form the candidate institution entity pair and form a triple, and take the head entity and The tail entities form the entity set {(h i ,t i )}.

进一步的,在知识图谱中选取满足三元组中两个实体(hi,ti)的路径,将每一个路径作为一个特征并计算路径的特征值,以此构成三元组的特征向量。Further, a path satisfying the two entities ( hi,t i ) in the triplet is selected in the knowledge graph, each path is used as a feature, and the feature value of the path is calculated to form the feature vector of the triplet.

具体的,可按照如下方法计算路径的特征值Specifically, the eigenvalues of the path can be calculated as follows

Figure BDA0002734951210000052
Figure BDA0002734951210000052

式中,P=R1…Rl代表一条路径,s为开始节点,e为尾节点,e′为中间节点,hs,p′(e′)表示在关系类型Rl下,(s,e′)实体对能通过路径p′连接的概率。其中,

Figure BDA0002734951210000053
表示节点e′在关系类型Rl下,随机游走到尾节点e的概率,此概率代表着是否存在实体对(e′,e)间Rl关系。In the formula, P=R 1 ... R l represents a path, s is the start node, e is the tail node, e' is the middle node, h s,p' (e') represents that under the relation type R l , (s, e') The probability that the entity pair can be connected by the path p'. in,
Figure BDA0002734951210000053
Represents the probability that node e' randomly walks to the tail node e under the relationship type R l , and this probability represents whether there is a R l relationship between entity pairs (e', e).

在实现每个特征向量的概率计算时后,使用特征向量训练logistic回归分类器,根据两个实体之间的路径预测的概率,来判断两个实体之间是否存在指定类型的关系。当路径预测概率现实较高时,表示两个实体之间存在别名、上下级机构等指定关系的概率较高,即更应将此两个实体作为实体对放入相同的机构名称集合内;当路径预测的概率显示较低时,表示两个实体之间存在别名、上下级机构等指定关系的概率较低,即应该将此两个实体放到不同的两个实体对中。After realizing the probability calculation of each feature vector, use the feature vector to train the logistic regression classifier, and judge whether there is a specified type of relationship between the two entities according to the predicted probability of the path between the two entities. When the probability of path prediction is relatively high, it means that the probability of the existence of aliases, subordinate institutions, etc. between the two entities is relatively high, that is, the two entities should be put into the same institution name set as an entity pair; When the probability of path prediction is low, it means that the probability of the existence of a specified relationship such as an alias, a subordinate organization, etc. between the two entities is low, that is, the two entities should be placed in two different entity pairs.

与现有技术相比,本发明具有的有益效果是:Compared with the prior art, the present invention has the following beneficial effects:

(1)针对网络空间中安全信息的孤立、分散问题,从实际应用出发,利用有效IP地址合理推断其所属组织机构,解决网络空间安全中IP、域名映射无所属组织机构信息的困局,消除IP、域名与组织机构等网络安全信息的隔阂,为网络空间安全中的网络攻防、监测等收集更多、更全面的安全信息。(1) Aiming at the isolation and dispersion of security information in cyberspace, starting from practical applications, use valid IP addresses to reasonably infer its organization, solve the dilemma of IP and domain name mapping without organization information in cyberspace security, and eliminate The gap between network security information such as IP, domain name and organization, collect more and more comprehensive security information for network attack, defense and monitoring in cyberspace security.

(2)通过对网页及其标签元素设置合理的权重因子,科学计算网络信息中目标词的权重,从杂乱无序的搜索结果中有效筛选出具有较大概率、可能性较大的候选组织机构名称,高效地实现通过IP地址获取其所属组织机构的最优推断结果。(2) By setting a reasonable weight factor for the webpage and its label elements, scientifically calculate the weight of the target word in the network information, and effectively screen out the candidate organizations with greater probability and possibility from the chaotic and disordered search results. Name, efficiently obtain the optimal inference result of the organization to which it belongs through the IP address.

(3)利用网络安全知识图谱的关系互联优势,提高搜索结果聚类的精度,保证网络空间安全领域中从IP地址推断其所属组织机构结果的准确性。(3) Take advantage of the relational interconnection of the network security knowledge graph to improve the clustering accuracy of search results and ensure the accuracy of the results of inferring the organization to which it belongs from the IP address in the field of cyberspace security.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,应当理解,以下附图仅表示出了本发明的部分实施例,因此不应看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它相关的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings used in the embodiments. It should be understood that the following drawings only show some embodiments of the present invention, and therefore should not be It is regarded as a limitation of the scope. For those of ordinary skill in the art, other related drawings can also be obtained according to these drawings without any creative effort.

图1为本发明按照实施例进行实施的流程示意图。FIG. 1 is a schematic flowchart of the implementation of the present invention according to an embodiment.

图2为网页标签元素树形图。Figure 2 is a tree diagram of web page label elements.

具体实施方式Detailed ways

下面结合附图及具体实施例对本发明做进一步阐释。The present invention will be further explained below with reference to the accompanying drawings and specific embodiments.

在此需要说明的是,对于这些实施例方式的说明用于帮助理解本发明,但并不构成对本发明的限定。本文公开的特定结构和功能细节仅用于描述本发明的示例实施例。然而,可用很多备选的形式来体现本发明,并且不应当理解为本发明限制在本文阐述的实施例中。It should be noted here that the descriptions of these embodiments are used to help the understanding of the present invention, but do not constitute a limitation of the present invention. Specific structural and functional details disclosed herein are merely illustrative of example embodiments of the present invention. The present invention, however, may be embodied in many alternative forms and should not be construed as limited to the embodiments set forth herein.

实施例Example

本实施例针对现有技术中网络空间中信息安全的诸多问题,尤其是安全信息孤立、分散现状下难以回溯其所属机构的情况,提出了一种溯源的方式,能够根据IP地址包含的隐藏信息中找到可用信息,并计算和推断最可能的所述机构。Aiming at many problems of information security in the cyberspace in the prior art, especially the situation that it is difficult to trace back to the organization to which the security information is isolated and scattered, this embodiment proposes a traceability method, which can be based on the hidden information contained in the IP address. Find the available information in and calculate and infer the most likely said institutions.

具体的,如图1所示,本实施例公开了一种基于知识图谱的IP地址机构溯源方法,包括:Specifically, as shown in FIG. 1 , the present embodiment discloses a method for tracing the source of an IP address organization based on a knowledge graph, including:

S01、获取域名信息:针对待推断机构的有效IP地址,通过DNS反解析得出IP地址所对应的域名信息;S01. Obtain domain name information: for the effective IP address of the institution to be inferred, obtain the domain name information corresponding to the IP address through DNS reverse resolution;

S02、获取域名关键信息:对域名信息进行截断、筛选处理,获取域名信息中的关键信息;S02. Obtain key information of the domain name: truncate and filter the domain name information to obtain the key information in the domain name information;

S03、获取分析样本:根据域名关键信息进行搜索并得到与域名关键信息对应的若干网页,可通过网络爬虫爬取网页,对网页进行排序和筛选后留取若干网页作为分析样本;S03. Obtain analysis samples: search according to the key information of the domain name and obtain several web pages corresponding to the key information of the domain name. The web pages can be crawled through a web crawler, and after sorting and screening the web pages, several web pages are reserved as analysis samples;

S04、样本处理:对分析样本进行分析并获取其中存在的文本内容;S04, sample processing: analyze the analysis sample and obtain the text content existing in it;

S05、实体抽取:通过命名实体识别模型对文本内容进行实体抽取,识别出文本内容中所有的机构名称;S05, entity extraction: perform entity extraction on the text content through the named entity recognition model, and identify all the institution names in the text content;

S06、计算权重:根据分析样本的排序先后和文本内容所在网页标签的标签元素,对机构名称进行权重计算和配置;S06. Calculating weights: According to the order of analysis samples and the label elements of the webpage labels where the text content is located, weights are calculated and configured for the name of the institution;

S07、实体聚类:通过机构名称之间的编辑距离进行实体间的聚类,同时使用先验知识图谱对聚类结果进行约束和指导,得到多个类别的机构名称集合;S07. Entity clustering: Clustering between entities is performed through the edit distance between institution names, and at the same time, the prior knowledge graph is used to constrain and guide the clustering results, and a collection of institution names of multiple categories is obtained;

S08、选定机构名称:根据每个机构名称的权重,以加权平均的计算方法计算出每个类别的机构名称集合的权重,选择权重最大的机构名称集合,并以该机构名称集合中权重最大的机构名称作为溯源推断结果。S08. Select an institution name: According to the weight of each institution name, calculate the weight of each category of institution name sets by the weighted average calculation method, select the institution name set with the largest weight, and use the institution name set with the largest weight in the set of institution names. The name of the institution is used as the traceability inference result.

上述公开的机构溯源方法,通过对IP地址进行的反解析及对获取域名信息的再度处理,可从获取的网页标签、网页文本等内容中提取有效信息进行深入分析,从而可筛选得到可能的目标机构,在进行多种计算后选定可能性最高的目标机构作为推断的结果以完成溯源。The above-disclosed institutional traceability method can extract valid information from the acquired web page tags, web page texts, etc. for in-depth analysis through anti-resolution of IP addresses and reprocessing of acquired domain name information, so that possible targets can be filtered out. Institutions, after performing various calculations, select the target institution with the highest possibility as the inferred result to complete the traceability.

实际运用该方法时,可采用多种方式对域名信息进行处理,本实施例采用一种可行且更为优化的方式,具体如下:在获取域名关键信息的步骤中,通过正则表达式对域名关键信息进行处理,去除域名信息的前缀和后缀,以此获取域名中的关键信息。When the method is actually used, various methods can be used to process the domain name information. This embodiment adopts a feasible and more optimized method. The details are as follows: in the step of obtaining the key information of the domain name, the key information of the domain name is processed by a regular expression. The information is processed to remove the prefix and suffix of the domain name information, so as to obtain the key information in the domain name.

在采用该方法获取分析样本时,得到分析样本的数量通常较多,其中存在极大的噪音,对此可选择部分关联性较大的网页作为分析样本,一般选择20个左右,在选定分析样本并对分析样本进行处理时,通过正则表达式和网页文档处理规则,去除网页标签,获取文本内容。When using this method to obtain analysis samples, the number of obtained analysis samples is usually large, and there is a lot of noise. For this, some webpages with relatively high correlation can be selected as analysis samples, generally about 20. When processing samples and analyzing samples, regular expressions and web page document processing rules are used to remove web page tags and obtain text content.

获取样本后,所有的分析样本都对最终的推断结果存在影响,但每个分析样本的影响占比并不完全相同,在进行推断的过程中,对每个分析样本进行影响力的划分,具体可采用如下可行的方式:所述的根据分析样本的排序先后对机构名称进行权重计算和配置,根据网页的排序计算权重,按照如下方式计算其中一个权重因子After obtaining the samples, all the analysis samples have an impact on the final inference result, but the proportion of the influence of each analysis sample is not exactly the same. In the process of inference, the influence of each analysis sample is divided. The following feasible methods can be adopted: the weight calculation and configuration of the institution name according to the order of the analysis samples, the weight calculation according to the order of the web pages, and the calculation of one of the weight factors in the following way.

Figure BDA0002734951210000081
Figure BDA0002734951210000081

其中,ωi为排序为i的分析样本所占的权重。即对选取的分析样本进行排序,根据排序的先后会影响该分析样本的权重。一般的,根据网页文档内容与搜索词的相关度,从前往后进行排序,若词条来自wiki,baidu这样知名网页,则适当提前网页文档的顺序;网页排序越靠前,网页文档的内容越具有分析价值,网页文档权重ωi越大。Among them, ω i is the weight of the analysis samples ranked i. That is, the selected analysis samples are sorted, and the weight of the analysis samples will be affected according to the order of sorting. Generally, according to the relevance between the content of the web document and the search term, sort from front to back. If the entry comes from a well-known web page such as wiki or baidu, the order of the web document is appropriately advanced; the higher the web page is sorted, the more content the web document is. With analytical value, the greater the weight ω i of the web page document.

本实施例采用的方法中,分析样本的权重包括两个权重因子,其中一个权重因子为网页本身在所有分析样本中所占的权重,与其在所有分析样本中的影响力排序存在直接关系,上述内容中已经进行详细描述说明;另一个权重因子与每个网页所含的标签元素有关,所述的根据文本内容所在网页标签的标签元素,对机构名称进行权重计算和配置,根据网页标签元素对网页文档主题的贡献度,给网页标签元素赋予对应的语义权重ωj∈[0,10],并以该网页标签元素的语义权重反应文本内容对搜索结果的重要程度。In the method adopted in this embodiment, the weight of the analysis samples includes two weight factors, one of which is the weight of the webpage itself in all the analysis samples, which is directly related to the ranking of its influence in all the analysis samples. The content has been described in detail; another weight factor is related to the label element contained in each web page. According to the label element of the page label where the text content is located, the weight of the organization name is calculated and configured. The contribution of the topic of the web page document gives the corresponding semantic weight ω j ∈ [0, 10] to the web page tag element, and reflects the importance of the text content to the search results with the semantic weight of the web page tag element.

在网页中对文档主题贡献大的标签元素所包含的词项,越能体现网页的文档主题。例如,网页文档标签<head>中<title>标签元素所包含的文本内容比<body>中的<h1>、<p>出现的词项更能体现文档的主题,而<body>中<h1>、<h2>标签含有的词项比<p>、<li>等标签中的词项对文档主题的贡献更大。以此类推,根据标签元素对文档的重要程度,构建网页标签元素树形图,如图2所示。根据网页标签元素树形图的划分,将不同网页标签元素对应的语义权重对应划分如下The terms contained in the tag elements that contribute significantly to the document topic in the web page can better reflect the document topic of the web page. For example, the text content contained in the <title> tag element in the <head> tag of a web document can better reflect the topic of the document than the terms appearing in <h1> and <p> in <body>, while <h1> in <body> The terms contained in the > and <h2> tags contribute more to the topic of the document than the terms in the <p>, <li> and other tags. By analogy, according to the importance of the tag element to the document, a tree diagram of web page tag elements is constructed, as shown in Figure 2. According to the division of the webpage label element tree diagram, the corresponding semantic weights corresponding to different webpage label elements are divided as follows

Figure BDA0002734951210000101
Figure BDA0002734951210000101

依据XML文档的半结构化定义格式,将HTML标签元素建模为一颗树,每个标签元素表示为一个节点,层级是按照元素-子元素包含关系进行设置,再按照其重要程度进行适当的升降层次,例,HTML文档包含head标签元素和body标签元素,但一般head中的title、meta元素包含的文本内容对网页主题贡献度较大,所以将body元素下降一个层次与head标签中的title标签处于同一层级。以此类推。According to the semi-structured definition format of the XML document, the HTML tag element is modeled as a tree, and each tag element is represented as a node. Up and down levels, for example, HTML documents contain head tag elements and body tag elements, but generally the text content contained in the title and meta elements in the head contributes a lot to the theme of the web page, so the body element is lowered by one level and the title in the head tag. Labels are at the same level. And so on.

在实际计算中,若某一个待定机构名称A,出现在head的<title>标签元素中,那么这个网页有较大可能是A机构的主页,此时hj取值为2。In the actual calculation, if a certain pending organization name A appears in the <title> tag element of the head, then this web page is more likely to be the home page of the A organization, and the value of h j is 2 at this time.

根据网页的权重和网页标签元素对应的语义权重,对于网页i中搜索到的关于机构名称的词项,可根据其出现在不同标签元素中的位置极其出现在网页中的频率和次数,用以下公式来计算机构名称的权重According to the weight of the webpage and the semantic weight corresponding to the label elements of the webpage, for the term related to the name of the institution searched in webpage i, it can be used according to its position in different label elements and the frequency and number of times it appears in the webpage, using the following formula to calculate the weight of institution name

Figure BDA0002734951210000102
Figure BDA0002734951210000102

其中,Wij为机构名称j在网页i中的权重,ωi为网页i的权重因子,ωj为机构名称j的语义权重,hj为机构名称j在所处的标签元素树形图中的层次且hj∈[0,4],tfij为机构名称j在网页i中的词条频率,idfj为名称词j的逆文档频率,

Figure BDA0002734951210000111
Ni为网页i中搜索到的所有机构名称条目,nij为网页i中含有机构名称j的词项数目。Among them, W ij is the weight of the organization name j in the web page i, ω i is the weight factor of the web page i, ω j is the semantic weight of the organization name j, h j is the label element tree diagram where the organization name j is located and h j ∈ [0, 4], tf ij is the term frequency of institution name j in web page i, idf j is the inverse document frequency of name term j,
Figure BDA0002734951210000111
N i is all the organization name items searched in web page i, and n ij is the number of terms in web page i that contain organization name j.

上述公开的内容包括实体聚类的步骤,以帮助精确确定安全信息的所属机构,进行实体聚类前,至少找出候选机构名称在先验知识图谱中的别名和上下级机构从属关系以组成候选机构实体对并组成三元组,取三元组中的头实体和尾实体组成实体集合{(hi,ti)}。本实施例中,采用Path Ranking Algorithm(PRA算法)进行具有指定关系的实体的寻找。PRA算法的主要思想是通过知识图谱中实体间连接的不同路径来判断某种关系类型的存在。The above disclosure includes the steps of entity clustering to help accurately determine the organization to which the security information belongs. Before performing entity clustering, at least find out the alias of the name of the candidate organization in the prior knowledge graph and the affiliation of the superior and inferior organizations to form the candidate organization. Organization entities are paired to form a triple, and the head entity and tail entity in the triple are taken to form an entity set {(h i ,t i )}. In this embodiment, the Path Ranking Algorithm (PRA algorithm) is used to search for entities with specified relationships. The main idea of the PRA algorithm is to judge the existence of a certain type of relationship through the different paths of connections between entities in the knowledge graph.

具体的,在知识图谱中选取满足三元组中两个实体(hi,ti)的路径,将每一个路径作为一个特征并计算路径的特征值,以此构成三元组的特征向量。Specifically, a path that satisfies the two entities ( hi, t i ) in the triplet is selected in the knowledge graph, each path is used as a feature, and the feature value of the path is calculated to form the feature vector of the triplet.

具体的,可按照如下方法计算路径的特征值Specifically, the eigenvalues of the path can be calculated as follows

Figure BDA0002734951210000112
Figure BDA0002734951210000112

式中,P=R1…Rl代表一条路径,s为开始节点,e为尾节点,e′为中间节点,hs,p′(e′)表示在关系类型Rl下,(s,e′)实体对能通过路径p′连接的概率。其中,

Figure BDA0002734951210000113
表示节点e′在关系类型Rl下,随机游走到尾节点e的概率,此概率代表着是否存在实体对(e′,e)间Rl关系。In the formula, P=R 1 ... R l represents a path, s is the start node, e is the tail node, e' is the middle node, h s,p' (e') represents that under the relation type R l , (s, e') The probability that the entity pair can be connected by the path p'. in,
Figure BDA0002734951210000113
Represents the probability that node e' randomly walks to the tail node e under the relationship type R l , and this probability represents whether there is a R l relationship between entity pairs (e', e).

在实现每个特征向量的概率计算时后,使用特征向量训练logistic回归分类器,根据两个实体之间的路径预测的概率,来判断两个实体之间是否存在指定类型的关系。当路径预测概率现实较高时,表示两个实体之间存在别名、上下级机构等指定关系的概率较高,即更应将此两个实体作为实体对放入相同的机构名称集合内;当路径预测的概率显示较低时,表示两个实体之间存在别名、上下级机构等指定关系的概率较低,即应该将此两个实体放到不同的两个实体对中。After realizing the probability calculation of each feature vector, use the feature vector to train the logistic regression classifier, and judge whether there is a specified type of relationship between the two entities according to the predicted probability of the path between the two entities. When the probability of path prediction is relatively high, it means that the probability of the existence of aliases, subordinate institutions, etc. between the two entities is relatively high, that is, the two entities should be put into the same institution name set as an entity pair; When the probability of path prediction is low, it means that the probability of the existence of a specified relationship such as an alias, a subordinate organization, etc. between the two entities is low, that is, the two entities should be placed in two different entity pairs.

融入网络安全知识图谱先验知识,推断候选组织机构实体间的相似性,是为了避免候选组织机构实体具有不同别名或上下级从属等关系的其它相似候选组织机构实体被错误地聚类到不同的类别。例如,NBA与美国篮球协会都表示同一组织机构,若按照组织机构名称的编辑距离进行聚类,NBA与美国篮球协会就会聚类到不同的类别,使结果产生歧义。此时利用PRA算法在先验网络安全知识图谱中推断得知,NBA与美国篮球协会存在别名关系,具有较大的相似性,作为先验知识,指导聚类过程,使其聚类为同一类别。Integrating the prior knowledge of network security knowledge graph to infer the similarity between candidate organizational entities is to avoid other similar candidate organizational entities with different aliases or subordinate relationships from being incorrectly clustered into different candidate organizational entities. category. For example, both the NBA and the American Basketball Association represent the same organization. If clustering is performed according to the edit distance of the organization name, the NBA and the American Basketball Association will be clustered into different categories, making the results ambiguous. At this time, the PRA algorithm is used to infer from the prior network security knowledge map that the NBA and the American Basketball Association have an alias relationship, which has a large similarity. As prior knowledge, it guides the clustering process and makes it cluster into the same category. .

经由第七步实体聚类后,所获得的是多个类别的组织机构候选名称集合,每个集合中又含有多个候选组织机构名称,输出结果需要取出最可靠、最大概率的一个词项,作为最终输出的组织机构名称。实体聚类后,保留第六步所计算出的候选组织机构名称的权重因子,根据每个候选组织机构名称的权重,通过加权平均,获取每个类别集合的权重。选择权重最大的集合,并输出集合中最大权重的组织机构名称,作为最终推断结果。After the seventh step of entity clustering, a set of candidate organization names of multiple categories is obtained, and each set contains multiple candidate organization names. The output result needs to take out the most reliable and most probable term. Organization name as final output. After entity clustering, the weight factor of the candidate organization name calculated in the sixth step is retained, and the weight of each category set is obtained by weighted average according to the weight of each candidate organization name. Select the set with the largest weight, and output the name of the organization with the largest weight in the set as the final inference result.

以上即为本发明列举的实施方式,但本发明不局限于上述可选的实施方式,本领域技术人员可根据上述方式相互任意组合得到其他多种实施方式,任何人在本发明的启示下都可得出其他各种形式的实施方式。上述具体实施方式不应理解成对本发明的保护范围的限制,本发明的保护范围应当以权利要求书中界定的为准,并且说明书可以用于解释权利要求书。The above are the listed embodiments of the present invention, but the present invention is not limited to the above-mentioned optional embodiments. Those skilled in the art can arbitrarily combine the above-mentioned methods to obtain other various embodiments. Various other forms of implementation can be derived. The above specific embodiments should not be construed as limiting the protection scope of the present invention, which should be defined in the claims, and the description can be used to interpret the claims.

Claims (10)

1.一种基于知识图谱的IP地址机构溯源方法,其特征在于,包括:1. an IP address mechanism tracing method based on knowledge graph, is characterized in that, comprises: 获取域名信息:针对待推断机构的有效IP地址,通过DNS反解析得出IP地址所对应的域名信息;Obtain domain name information: For the effective IP address of the institution to be inferred, the domain name information corresponding to the IP address is obtained through DNS reverse resolution; 获取域名关键信息:对域名信息进行截断、筛选处理,获取域名信息中的关键信息;Obtaining the key information of the domain name: truncate and filter the domain name information to obtain the key information in the domain name information; 获取分析样本:根据域名关键信息进行搜索并得到与域名关键信息对应的若干网页,对网页进行排序和筛选后留取若干网页作为分析样本;Obtain analysis samples: search according to the key information of the domain name and obtain several webpages corresponding to the key information of the domain name, sort and filter the webpages and reserve several webpages as analysis samples; 样本处理:对分析样本进行分析并获取其中存在的文本内容;Sample processing: analyze the analysis sample and obtain the text content present in it; 实体抽取:通过命名实体识别模型对文本内容进行实体抽取,识别出文本内容中所有的机构名称;Entity extraction: entity extraction is performed on the text content through the named entity recognition model, and all the institution names in the text content are identified; 计算权重:根据分析样本的排序先后和文本内容所在网页标签的标签元素,对机构名称进行权重计算和配置;Calculate the weight: Calculate and configure the weight of the institution name according to the order of the analysis samples and the label element of the webpage label where the text content is located; 实体聚类:通过机构名称之间的编辑距离进行实体间的聚类,同时使用先验知识图谱对聚类结果进行约束和指导,得到多个类别的机构名称集合;Entity clustering: Clustering between entities is performed through the edit distance between institution names, and at the same time, the prior knowledge graph is used to constrain and guide the clustering results, and multiple categories of institution name sets are obtained; 选定机构名称:根据每个机构名称的权重,以加权平均的计算方法计算出每个类别的机构名称集合的权重,选择权重最大的机构名称集合,并以该机构名称集合中权重最大的机构名称作为溯源推断结果。Selected institution name: According to the weight of each institution name, calculate the weight of each category of institution name set by weighted average calculation method, select the institution name set with the largest weight, and use the institution name set with the largest weight in the institution name set. The name is used as the traceability inference result. 2.根据权利要求1所述的基于知识图谱的IP地址机构溯源方法,其特征在于:在获取域名关键信息的步骤中,通过正则表达式对域名关键信息进行处理,去除域名信息的前缀和后缀,以此获取域名中的关键信息。2. the IP address mechanism tracing method based on knowledge graph according to claim 1, is characterized in that: in the step of obtaining domain name key information, the domain name key information is processed by regular expression, and the prefix and suffix of domain name information are removed , to obtain key information in the domain name. 3.根据权利要求1所述的基于知识图谱的IP地址机构溯源方法,其特征在于:对分析样本进行处理时,通过正则表达式和网页文档处理规则,去除网页标签,获取文本内容。3. The IP address organization tracing method based on knowledge graph according to claim 1, characterized in that: when the analysis sample is processed, through regular expressions and webpage document processing rules, the webpage label is removed, and the text content is obtained. 4.根据权利要求1所述的基于知识图谱的IP地址机构溯源方法,所述的根据分析样本的排序先后对机构名称进行权重计算和配置,其特征在于:根据网页的排序计算权重,按照如下方式进行计算其中一个权重因子,4. the IP address institution traceability method based on knowledge graph according to claim 1, the described order according to the analysis sample carries out the weight calculation and configuration to the institution name successively, it is characterized in that: according to the order of the web page to calculate the weight, according to the following way to calculate one of the weighting factors,
Figure FDA0002734951200000021
Figure FDA0002734951200000021
其中,ωi为排序为i的分析样本所占的权重。Among them, ω i is the weight of the analysis samples ranked i.
5.根据权利要求4所述的基于知识图谱的IP地址机构溯源方法,所述的根据文本内容所在网页标签的标签元素,对机构名称进行权重计算和配置,其特征在于:根据网页标签元素对网页文档主题的贡献度,给网页标签元素赋予对应的语义权重ωj∈[0,10],并以该网页标签元素的语义权重反应文本内容对搜索结果的重要程度。5. the IP address organization tracing method based on knowledge graph according to claim 4, described according to the label element of the web page label where the text content is located, weight calculation and configuration are carried out to the organization name, it is characterized in that: according to the web page label element to. The contribution of the topic of the web page document gives the corresponding semantic weight ω j ∈ [0, 10] to the web page tag element, and reflects the importance of the text content to the search results with the semantic weight of the web page tag element. 6.根据权利要求5所述的基于知识图谱的IP地址机构溯源方法,其特征在于:根据网页的权重和网页标签元素对应的语义权重,按照如下方法计算机构名称的权重6. the IP address institution tracing method based on knowledge graph according to claim 5, is characterized in that: according to the weight of webpage and the corresponding semantic weight of webpage label element, calculate the weight of institution name according to the following method
Figure FDA0002734951200000022
Figure FDA0002734951200000022
其中,Wij为机构名称j在网页i中的权重,ωi为网页i的权重因子,ωj为机构名称j的语义权重,以元素标签构建树形图,hj为机构名称j在所处的标签元素树形图中的层次且hj∈[0,4],tfij为机构名称j在网页i中的词条频率,idfj为名称词j的逆文档频率,
Figure FDA0002734951200000023
Ni为网页i中搜索到的所有机构名称条目,nij为网页i中含有机构名称j的词项数目。
Among them, W ij is the weight of the organization name j in the web page i, ω i is the weight factor of the web page i, ω j is the semantic weight of the organization name j, and the element label is used to construct a tree diagram, h j is the place where the organization name j is located. and h j ∈ [0, 4], tf ij is the term frequency of the organization name j in web page i, idf j is the inverse document frequency of the name term j,
Figure FDA0002734951200000023
N i is all the organization name items searched in web page i, and n ij is the number of terms in web page i that contain organization name j.
7.根据权利要1所述的基于知识图谱的IP地址机构溯源方法,其特征在于:进行实体聚类前,至少找出候选机构名称在先验知识图谱中的别名和上下级机构从属关系以组成候选机构实体对并组成三元组,取三元组中的头实体和尾实体组成实体集合{(hi,ti)}。7. The IP address organization tracing method based on knowledge graph according to claim 1, it is characterized in that: before carrying out entity clustering, at least find out the alias of candidate organization name in the prior knowledge graph and the affiliation of superior and inferior organizations to A candidate institution entity pair is formed and a triplet is formed, and the head entity and the tail entity in the triplet are taken to form an entity set {(h i ,t i )}. 8.根据权利要求7所述的基于知识图谱的IP地址机构溯源方法,其特征在于:在知识图谱中选取满足三元组中两个实体(hi,ti)的路径,将每一个路径作为一个特征并计算路径的特征值,以此构成三元组的特征向量。8. the IP address mechanism traceability method based on knowledge graph according to claim 7, is characterized in that: in knowledge graph, choose the path that satisfies two entities (h i , t i ) in triplet, each path As a feature and calculate the eigenvalues of the path to form the eigenvectors of the triples. 9.根据权利要求8所述的基于知识图谱的IP地址机构溯源方法,其特征在于:按照如下方法计算路径的特征值9. the IP address mechanism tracing method based on knowledge graph according to claim 8, is characterized in that: calculate the characteristic value of path according to following method
Figure FDA0002734951200000031
Figure FDA0002734951200000031
式中,P=R1…Rl代表一条路径,s为开始节点,e为尾节点,e′为中间节点,hs,p′(e′)表示在关系类型Rl下,(s,e′)实体对能通过路径p′连接的概率。其中,
Figure FDA0002734951200000032
表示节点e′在关系类型Rl下,随机游走到尾节点e的概率,此概率代表着是否存在实体对(e′,e)间Rl关系。
In the formula, P=R 1 ... R l represents a path, s is the start node, e is the tail node, e' is the middle node, h s,p' (e') represents that under the relation type R l , (s, e') The probability that the entity pair can be connected by the path p'. in,
Figure FDA0002734951200000032
Represents the probability that node e' randomly walks to the tail node e under the relationship type R l , and this probability represents whether there is a R l relationship between entity pairs (e', e).
10.根据权利要求8所述的基于知识图谱的IP地址机构溯源方法,其特征在于:使用特征向量训练logistic回归分类器,根据两个实体之间的路径预测的概率,来判断两个实体之间是否存在指定类型的关系。10. The IP address mechanism tracing method based on knowledge graph according to claim 8, it is characterized in that: use feature vector training logistic regression classifier, according to the probability of path prediction between two entities, to judge the relationship between two entities. Whether there is a relationship of the specified type between them.
CN202011130373.2A 2020-10-21 2020-10-21 An IP address organization traceability method based on knowledge graph Active CN112364173B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011130373.2A CN112364173B (en) 2020-10-21 2020-10-21 An IP address organization traceability method based on knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011130373.2A CN112364173B (en) 2020-10-21 2020-10-21 An IP address organization traceability method based on knowledge graph

Publications (2)

Publication Number Publication Date
CN112364173A true CN112364173A (en) 2021-02-12
CN112364173B CN112364173B (en) 2022-03-18

Family

ID=74511364

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011130373.2A Active CN112364173B (en) 2020-10-21 2020-10-21 An IP address organization traceability method based on knowledge graph

Country Status (1)

Country Link
CN (1) CN112364173B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113141378A (en) * 2021-05-18 2021-07-20 中国互联网络信息中心 Bad domain name identification method and device
CN114328937A (en) * 2022-03-10 2022-04-12 中国医学科学院医学信息研究所 Scientific research institution information processing method and device
CN114422170A (en) * 2021-12-08 2022-04-29 中国科学院信息工程研究所 Method and system for reversely acquiring domain name from IP address
WO2023040530A1 (en) * 2021-09-18 2023-03-23 华为技术有限公司 Webpage content traceability method, knowledge graph construction method and related device
CN115987803A (en) * 2022-12-23 2023-04-18 天翼安全科技有限公司 Organization mechanism determination method of autonomous system and related device
CN116011564A (en) * 2022-12-28 2023-04-25 桂林电子科技大学 An entity relationship completion method, system and application for electric equipment
CN117235200A (en) * 2023-09-12 2023-12-15 杭州湘云信息技术有限公司 Data integration method and device based on AI technology, computer equipment and storage medium

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180121601A1 (en) * 2016-10-28 2018-05-03 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods for performing secondary and/or tertiary processing
CN109617728A (en) * 2018-12-14 2019-04-12 中国电子科技网络信息安全有限公司 A distributed IP-level network topology detection method based on multi-protocol
CN109885692A (en) * 2019-01-11 2019-06-14 平安科技(深圳)有限公司 Knowledge data storage method, device, computer equipment and storage medium
CN110113314A (en) * 2019-04-12 2019-08-09 中国人民解放军战略支援部队信息工程大学 Network safety filed knowledge mapping construction method and device for dynamic threats analysis
CN110188191A (en) * 2019-04-08 2019-08-30 北京邮电大学 A method and system for constructing entity-relationship graphs for online community texts
CN110362660A (en) * 2019-07-23 2019-10-22 重庆邮电大学 A kind of Quality of electronic products automatic testing method of knowledge based map
CN110674310A (en) * 2019-09-04 2020-01-10 东华大学 Knowledge graph-based industrial Internet of things identification method
US10630715B1 (en) * 2019-07-25 2020-04-21 Confluera, Inc. Methods and system for characterizing infrastructure security-related events
CN111177591A (en) * 2019-12-10 2020-05-19 浙江工业大学 Web data optimization method based on knowledge graph for visualization requirements
CN111193749A (en) * 2020-01-03 2020-05-22 北京明略软件系统有限公司 Attack tracing method and device, electronic equipment and storage medium
CN111247773A (en) * 2017-04-03 2020-06-05 力士塔有限公司 Method and apparatus for ultra-secure last-in-the-road communication
CN111581397A (en) * 2020-05-07 2020-08-25 南方电网科学研究院有限责任公司 A network attack source tracing method, device and device based on knowledge graph

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180121601A1 (en) * 2016-10-28 2018-05-03 Edico Genome, Corp. Bioinformatics systems, apparatuses, and methods for performing secondary and/or tertiary processing
CN111247773A (en) * 2017-04-03 2020-06-05 力士塔有限公司 Method and apparatus for ultra-secure last-in-the-road communication
CN109617728A (en) * 2018-12-14 2019-04-12 中国电子科技网络信息安全有限公司 A distributed IP-level network topology detection method based on multi-protocol
CN109885692A (en) * 2019-01-11 2019-06-14 平安科技(深圳)有限公司 Knowledge data storage method, device, computer equipment and storage medium
CN110188191A (en) * 2019-04-08 2019-08-30 北京邮电大学 A method and system for constructing entity-relationship graphs for online community texts
CN110113314A (en) * 2019-04-12 2019-08-09 中国人民解放军战略支援部队信息工程大学 Network safety filed knowledge mapping construction method and device for dynamic threats analysis
CN110362660A (en) * 2019-07-23 2019-10-22 重庆邮电大学 A kind of Quality of electronic products automatic testing method of knowledge based map
US10630715B1 (en) * 2019-07-25 2020-04-21 Confluera, Inc. Methods and system for characterizing infrastructure security-related events
CN110674310A (en) * 2019-09-04 2020-01-10 东华大学 Knowledge graph-based industrial Internet of things identification method
CN111177591A (en) * 2019-12-10 2020-05-19 浙江工业大学 Web data optimization method based on knowledge graph for visualization requirements
CN111193749A (en) * 2020-01-03 2020-05-22 北京明略软件系统有限公司 Attack tracing method and device, electronic equipment and storage medium
CN111581397A (en) * 2020-05-07 2020-08-25 南方电网科学研究院有限责任公司 A network attack source tracing method, device and device based on knowledge graph

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YIN MIN TUN 等: "A Two-Phase Approach for Stance Classification in Twitter Using Name Entity Recognition and Term Frequency Feature", 《2019 IEEE/ACIS 18TH INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION SCIENCE (ICIS)》 *
冯鑫 等: "基于知识实体的突发公共卫生事件数据平台构建研究", 《知识管理论坛》 *
周园春 等: "科技大数据知识图谱构建方法及应用研究综述", 《中国科学:信息科学》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113141378A (en) * 2021-05-18 2021-07-20 中国互联网络信息中心 Bad domain name identification method and device
CN113141378B (en) * 2021-05-18 2022-12-02 中国互联网络信息中心 A method and device for identifying bad domain names
WO2023040530A1 (en) * 2021-09-18 2023-03-23 华为技术有限公司 Webpage content traceability method, knowledge graph construction method and related device
CN114422170A (en) * 2021-12-08 2022-04-29 中国科学院信息工程研究所 Method and system for reversely acquiring domain name from IP address
CN114422170B (en) * 2021-12-08 2023-01-17 中国科学院信息工程研究所 A method and system for reversely obtaining a domain name from an IP address
CN114328937A (en) * 2022-03-10 2022-04-12 中国医学科学院医学信息研究所 Scientific research institution information processing method and device
CN115987803A (en) * 2022-12-23 2023-04-18 天翼安全科技有限公司 Organization mechanism determination method of autonomous system and related device
CN116011564A (en) * 2022-12-28 2023-04-25 桂林电子科技大学 An entity relationship completion method, system and application for electric equipment
CN117235200A (en) * 2023-09-12 2023-12-15 杭州湘云信息技术有限公司 Data integration method and device based on AI technology, computer equipment and storage medium
CN117235200B (en) * 2023-09-12 2024-05-10 杭州湘云信息技术有限公司 Data integration method and device based on AI technology, computer equipment and storage medium

Also Published As

Publication number Publication date
CN112364173B (en) 2022-03-18

Similar Documents

Publication Publication Date Title
CN112364173B (en) An IP address organization traceability method based on knowledge graph
US11799823B2 (en) Domain name classification systems and methods
Moghimi et al. New rule-based phishing detection method
US9276956B2 (en) Method for detecting phishing website without depending on samples
Sonowal Phishing email detection based on binary search feature selection
Diesner et al. Using network text analysis to detect the organizational structure of covert networks
CN110177114A (en) The recognition methods of network security threats index, unit and computer readable storage medium
CN110855716B (en) Self-adaptive security threat analysis method and system for counterfeit domain names
CN108023868B (en) Malicious resource address detection method and device
CN110572359A (en) Phishing webpage detection method based on machine learning
He et al. Malicious domain detection via domain relationship and graph models
CN105138921A (en) Phishing site target domain name identification method based on page feature matching
Li et al. Phishing detection based on newly registered domains
Shyni et al. Phishing detection in websites using parse tree validation
Carragher et al. Detection and discovery of misinformation sources using attributed webgraphs
CN113868649B (en) Malicious outer chain detection method and device, electronic equipment and storage medium
CN119766581B (en) Malicious website community discovery method and system based on website fission
Teoh et al. Analyst intuition inspired high velocity big data analysis using PCA ranked fuzzy k-means clustering with multi-layer perceptron (MLP) to obviate cyber security risk
Saha et al. Mobile device and social media forensic analysis: impacts on cyber-crime
Chen et al. Phishing target identification based on neural networks using category features and images
Alshammery et al. Classifying illegal activities on tor network using hybrid technique
CN115580422B (en) A black link identification method, device, equipment and storage medium
Wang et al. TSMWD: a high-speed malicious web page detection system based on two-step classifiers
CN117220921A (en) Uncertainty reasoning-based malicious and false website traceability analysis method and device and electronic equipment
Wedyan et al. An Associative Classification Data Mining Approach for Detecting Phishing Websites

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant