CN116303443A

CN116303443A - An Automatic Extraction Method of Network Protocol Design Knowledge Based on RFC Documents

Info

Publication number: CN116303443A
Application number: CN202310181424.1A
Authority: CN
Inventors: 李平; 李丰; 陈婧婷; 陈晴方; 霍玮
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2023-02-20
Filing date: 2023-02-20
Publication date: 2023-06-23

Abstract

The invention discloses a method for automatically extracting network protocol design knowledge based on RFC documents. The method includes acquiring a chapter list of the RFC document and a row list in each chapter; acquiring the chapter list and the row list to generate a list of key lines; parse the structured description in the list of key lines to obtain a list of formatted fields containing partial information and/or a list of undefined names of automata; based on the structure The description of the chapter content corresponding to the key line belongs to, and supplements the formatted field list containing partial information and/or the undefined name list of the automaton; based on the formatted field list containing complete information and the automatic machine migration list to get the network protocol design knowledge extraction results. The present invention can automatically extract relevant network protocol design information.

Description

An Automatic Extraction Method of Network Protocol Design Knowledge Based on RFC Documents

技术领域technical field

本发明涉及计算机技术领域，尤其涉及一种面向网络协议的设计理解技术，具体为一种基于RFC文档的网络协议设计知识自动提取方法。The invention relates to the field of computer technology, in particular to a network protocol-oriented design comprehension technology, in particular to a method for automatically extracting network protocol design knowledge based on RFC documents.

背景技术Background technique

网络协议是计算机与计算机之间通过网络实现通信时事先达成的一种“约定”，只要遵循相同的协议，就能够通信。网络协议作为互联网中的基础设施，有着至关重要的作用，因此对网络协议的分析显得尤为重要。对网络协议的分析主要分为动态测试和静态分析两类方法，这两类方法均需要一定的领域知识来指导构造协议报文结构。RFC(RequestFor Comments)文档是一系列以编号排列的记录了互联网规范、协议以及过程的标准，基本的互联网通信协议都有在对应的RFC文档中有详细说明。对网络协议的分析需要相关研究人员阅读大量的RFC文档以获取领域知识，动态测试方法需要从中获取协议报文的结构组成信息，静态检测方法则需要获取协议报文字段的长度、类型等具体信息。而由于RFC文档成稿过程时间跨度长参与撰写的组织多，包含的协议种类多，导致RFC文档的描述方式不规范，人工阅读理解困难，给网络协议的分析工作带来了极大的困难。A network protocol is an "agreement" reached in advance between computers when they communicate through the network. As long as they follow the same protocol, they can communicate. As the infrastructure of the Internet, network protocols play a vital role, so the analysis of network protocols is particularly important. The analysis of network protocols is mainly divided into two types: dynamic testing and static analysis. These two types of methods require certain domain knowledge to guide the construction of protocol message structures. RFC (Request For Comments) documents are a series of serialized standards that record Internet norms, protocols, and processes. Basic Internet communication protocols are described in detail in the corresponding RFC documents. The analysis of network protocols requires relevant researchers to read a large number of RFC documents to obtain domain knowledge. The dynamic test method needs to obtain the structure and composition information of the protocol message, and the static detection method needs to obtain specific information such as the length and type of the protocol message field. . However, due to the long time span of the drafting process of RFC documents, many organizations participated in the writing, and there are many types of protocols included, resulting in non-standard descriptions of RFC documents and difficulties in manual reading and understanding, which brings great difficulties to the analysis of network protocols.

发明内容：Invention content:

针对上述问题，本发明提出一种基于RFC文档的网络协议设计知识自动提取方法，通过对RFC文档中结构化表示及自然语言描述的分析、理解，自动化提取出相关的网络协议设计信息。Aiming at the above problems, the present invention proposes a method for automatically extracting network protocol design knowledge based on RFC documents, which automatically extracts relevant network protocol design information through analysis and understanding of structured representations and natural language descriptions in RFC documents.

一种基于RFC文档的网络协议设计知识自动提取方法，所述方法包括：A method for automatically extracting network protocol design knowledge based on RFC documents, said method comprising:

基于RFC文档的txt格式，获取所述RFC文档的章节列表以及每一章节中的行列表；Based on the txt format of the RFC document, obtain the chapter list of the RFC document and the row list in each chapter;

获取所述章节列表和所述行列表中的结构化描述，以生成关键行列表；其中，所述结构化描述包括：报文结构和/或自动机；Obtaining the structured description in the chapter list and the line list to generate a key line list; wherein, the structured description includes: message structure and/or automaton;

对所述关键行列表中的结构化描述进行解析，得到包含部分信息的格式化字段列表和/或自动机的未定义名字列表；其中，所述格式化字段列表的信息包括：章节名、字段名、字段类型、字段长度、字段取值和字段描述；Parse the structured description in the key line list to obtain a formatted field list containing partial information and/or an undefined name list of the automaton; wherein, the information in the formatted field list includes: chapter name, field name, field type, field length, field value and field description;

在所述结构化描述为报文结构的情况下，基于所述结构化描述对应关键行所属的章节内容，更新字段长度、字段取值以及字段类型，以得到包含完整信息的格式化字段列表；In the case where the structured description is a message structure, update the field length, field value and field type based on the chapter content of the structured description corresponding to the key line to obtain a formatted field list containing complete information;

在所述结构化描述为自动机的情况下，获取所述结构化描述对应关键行所属的章节内容，并使用所述未定义名字列表中的名字在所述章节内容进行匹配，将得到的自动机迁移信息存入自动机迁移列表；In the case that the structured description is an automaton, obtain the chapter content to which the key row of the structured description belongs, and use the names in the undefined name list to match the chapter content, and the obtained automatic The machine migration information is stored in the automatic machine migration list;

综合所述包含完整信息的格式化字段列表和所述入自动机迁移列表，得到网络协议设计知识提取结果。Combining the formatted field list containing complete information and the importing automaton migration list to obtain a network protocol design knowledge extraction result.

进一步地，所述报文结构包括：图文形式、伪C代码形式和ASN.1形式中的一或多种。Further, the message structure includes: one or more of a graphic format, a pseudo-C code format, and an ASN.1 format.

所述获取所述章节列表和所述行列表中的结构化描述，包括：The acquisition of the structured description in the chapter list and the row list includes:

在所述报文结构为图文形式的情况下，使用正则表达式r'^\+-+\+$'匹配识别结构化描述的起始行之后，继续遍历行列表，在遇到单独一个空行时，将当前行作为终止行；In the case where the message structure is in the form of graphics and text, use the regular expression r'^\+-+\+$' to match and identify the start line of the structured description, continue to traverse the line list, and when a single When the line is blank, the current line is used as the termination line;

在所述报文结构为伪C代码形式的情况下，使用正则表达式r'^\s*(struct|enum|typedef struct)\s+\w*\s*\{'匹配识别起始行，使用正则表达式r'\}[^；]*；'匹配识别终止行，且起始行与终止行之间的花括号正确闭合；When the message structure is in the form of pseudo-C code, use the regular expression r'^\s*(struct|enum|typedef struct)\s+\w*\s*\{' to match and identify the start line, Use the regular expression r'\}[^;]*;' to match and identify the termination line, and the curly braces between the start line and the termination line are correctly closed;

在所述报文结构为ASN.1形式的情况下，使用正则表达式r'\w+\s+(OBJECT-GROUP|MODULE-COMPLIANCE|OBJECT IDENTIFIER|OBJECT-TYPE|SEQUENCE)'以及r'\w+\s+::＝\s+\w+\s\{'匹配识别起始行之后，继续遍历行列表，在当前行是以右花括号结尾且下一行是空行时，将当前行作为终止行。When the message structure is in the form of ASN.1, use regular expressions r'\w+\s+(OBJECT-GROUP|MODULE-COMPLIANCE|OBJECT IDENTIFIER|OBJECT-TYPE|SEQUENCE)' and r'\w+\ s+::=\s+\w+\s\{'After matching and identifying the start line, continue to traverse the line list. When the current line ends with a right curly brace and the next line is a blank line, the current line is used as the end line.

进一步地，所述对所述关键行列表中的结构化描述进行解析，得到包含部分信息的格式化字段列表，包括：Further, the structured description in the key row list is parsed to obtain a formatted field list containing partial information, including:

针对图文形式的报文结构，获取存在“|”符号或者“：”符号的关键行，并使用对应的“|”符号或者“：”符号作为分隔符切割该关键行，以得到若干个字段以及该字段的字段名；For the message structure in the form of graphics and text, obtain the key line with the "|" symbol or ":" symbol, and use the corresponding "|" symbol or ":" symbol as a separator to cut the key line to obtain several fields and the field name of the field;

基于所述字段包含的变长字段符号和所述字段名包含的该字段在图中的空格占位符，得到每一字段的字段长度后，存入格式化字段列表；其中，若字段名中“()”符号包含字段长度信息，则使用该字段长度信息对所述字段长度进行更新后，存入格式化字段列表；Based on the variable-length field symbol contained in the field and the space placeholder in the figure contained in the field name, after obtaining the field length of each field, store it in the formatted field list; wherein, if the field name is The "()" symbol contains field length information, then use the field length information to update the field length and store it in the formatted field list;

对所述字段名进行格式化更新后，存入格式化字段列表；所述格式化更新包括：去除所述字段名中的多余空白字符以及“()”符号和“[]”符号中包含的额外信息、和将所述字段名包含的缩写符号还原为全单词描述；After the field name is formatted and updated, it is stored in the formatted field list; the format update includes: removing redundant blank characters in the field name and the characters contained in the "()" symbol and the "[]" symbol Additional information, and reverting the abbreviated symbols contained in the field names to full-word descriptions;

获取该关键行对应的章节名，并存入格式化字段列表；Obtain the chapter name corresponding to the key row and store it in the formatted field list;

和/或，and / or,

针对伪C代码形式的报文结构，基于关键行列表中当前行后第一个元素的值，判断该当前行描述的信息类型；所述信息类型包括：结构化信息或枚举信息；For the message structure in the form of pseudo-C code, based on the value of the first element after the current line in the key line list, the information type described by the current line is judged; the information type includes: structured information or enumerated information;

如果前行描述的信息类型是结构化信息，则通过正则表达式r'[\w\:]+\s+[\*\w\d\[\]]+；'匹配描述字段信息的行、“<>”符号包含的字段长度情况、以及描述字段信息的行对应的章节名，分别获取字段名与字段类型、字段长度和章节名之后，存入格式化字段列表；If the type of information described in the previous line is structured information, use the regular expression r'[\w\:]+\s+[\*\w\d\[\]]+;' to match the line describing the field information, The field length contained in the "<>" symbol, and the chapter name corresponding to the line describing the field information, respectively obtain the field name and field type, field length and chapter name, and store them in the formatted field list;

如果前行描述的信息类型是枚举信息，则使用正则表达式r'[\w\d]+$[\w\d\.]+$'匹配描述字段信息的行，以得到字段取值信息，并通过匹配所述字段取值信息对应的字段，得到字段名、字段类型和章节名之后，存入格式化字段列表；If the information type described in the previous line is enumerated information, use the regular expression r'[\w\d]+$[\w\d\.]+$' to match the line describing the field information to get the field Value information, and by matching the field corresponding to the field value information, after obtaining the field name, field type and chapter name, store it in the formatted field list;

和/或，and / or,

针对ASN.1形式的报文结构，基于关键行列表中“SYNTAX”关键字，得到字段类型，并存入格式化字段列表；For the message structure in ASN.1 form, based on the "SYNTAX" keyword in the key line list, the field type is obtained and stored in the formatted field list;

基于关键行列表中第一行的信息，判断关键行的文本描述是否是嵌套的信息；Based on the information of the first line in the key line list, determine whether the text description of the key line is nested information;

如果是嵌套的信息，则基于关键行文本的起始行以当前字段名开头，获取字段名，并根据关键行终止行存在的“：：＝{UP_FIELD}”形式描述中UP_FIELD的内容，得到上级字段名之后，将字段名和上级字段名存入格式化字段列表；If it is nested information, based on the start line of the key line text starting with the current field name, obtain the field name, and describe the content of UP_FIELD in the form of "::={UP_FIELD}" in the end line of the key line, and get After the superior field name, store the field name and superior field name in the formatted field list;

如果不是嵌套的信息，则将第一行的信息作为后续解析每个字段的上级字段名，并在遍历后续每一关键行时，使用“FIELD_NAME FIELD_TYPE,”，得到字段名，且将字段名和上级字段名存入格式化字段列表；If it is not nested information, use the information in the first line as the superior field name for subsequent parsing of each field, and when traversing each subsequent key line, use "FIELD_NAME FIELD_TYPE," to get the field name, and combine the field name and The superior field name is stored in the formatted field list;

获取关键行对应的章节名，并存入格式化字段列表。Obtain the chapter name corresponding to the key row and store it in the formatted field list.

进一步地，所述更新字段长度，包括：Further, the update field length includes:

获取所述章节名对应的章节，并将章节内容按照句子重新进行划分，以得到更新后的章节内容；Obtaining the chapter corresponding to the chapter name, and re-dividing the chapter content according to the sentences to obtain the updated chapter content;

根据字段名生成匹配字段描述的模式以及字段名，对更新后的章节内容进行匹配，并在匹配成功后，将当前行以及当前行后面的若干行作为字段描述；Generate a matching field description pattern and field name according to the field name, match the updated chapter content, and use the current line and several lines after the current line as the field description after the match is successful;

根据字段名对更新后的章节内容进行匹配，并在匹配成功的情况下，对包含字段名的句子进行词性分析，以得到包含动词+基数的语句；Match the updated chapter content according to the field name, and if the match is successful, perform part-of-speech analysis on the sentence containing the field name to obtain a sentence containing verb + base;

将所述基数对应的单词作为字段长度的数值，并基于所述基数的后一个字符得到数值的单位；The word corresponding to the base number is used as the value of the field length, and the unit of the value is obtained based on the last character of the base number;

根据所述字段长度的数值及所述数值的单位，更新格式化字段列表中的字段长度。The field length in the formatted field list is updated according to the value of the field length and the unit of the value.

进一步地，所述更新字段取值，包括：Further, the update field value includes:

根据字段名对更新后的章节内容进行匹配，并在匹配成功的情况下，判断句子中是否存在情态动词或者比较关系词；Match the updated chapter content according to the field name, and if the match is successful, determine whether there are modal verbs or comparative relative words in the sentence;

在存在情态动词或者比较关系词的情况下，对包含字段名的句子进行词性分析，得到包含MUST(NOT)be(set to)NUM、less than NUM；greater than NUM或is equal toNUM,are not equal to NUM的句子；In the case of modal verbs or comparative relative words, the part-of-speech analysis is performed on the sentence containing the field name, and the sentence containing MUST (NOT) be (set to) NUM, less than NUM; greater than NUM or is equal to NUM, are not equal to NUM sentences;

将所述情态动词或者比较关系词转化成对应符号，并提取MUST(NOT)be(set to)NUM、less than NUM；greater than NUM或is equal to NUM,are not equal to NUM中的值；Described modal verb or relatively relative word is converted into corresponding symbol, and extract MUST (NOT) be (set to) NUM, less than NUM; Greater than NUM or is equal to NUM, the value in are not equal to NUM;

基于所述符号和所述值，更新格式化字段列表中的字段取值。Based on the symbol and the value, the values of the fields in the formatted field list are updated.

进一步地，所述更新字段类型，包括：Further, the update field type includes:

基于结构化描述中包括协议报文字段名、字段类型、字段长度的字段描述信息，推导字段类型并更新格式化字段列表；Based on the field description information in the structured description including the field name, field type, and field length of the protocol message, deduce the field type and update the formatted field list;

或，or,

根据格式化字段列表中的字段名和字段长度，推导字段类型并更新格式化字段列表；According to the field name and field length in the formatted field list, deduce the field type and update the formatted field list;

或，or,

根据格式化字段列表中的字段名，推导字段类型并更新格式化字段列表；According to the field names in the formatted field list, deduce the field type and update the formatted field list;

或，or,

根据格式化字段列表中的字段长度，推导字段类型并更新格式化字段列表。Based on the field length in the formatted field list, deduce the field type and update the formatted field list.

进一步地，所述自动机包括：文字列表和点线图中的一或多种。Further, the automaton includes: one or more of text lists and point-and-line diagrams.

所述获取所述章节列表和所述行列表中的结构化描述，以生成关键行列表，包括：The acquiring the structured description in the chapter list and the line list to generate a key line list includes:

在自动机形式为文字列表的情况下，如果当前行匹配到正则表达式[r"State\s+Event\s+Action\s+New State",r"\|\sState\s\|"]的一个，则识别为起始行，并在后续遍历过程中当前行为为空行，且起始行不为None，则识别为终止行；In the case where the automaton form is a literal list, if the current line matches the regular expression [r"State\s+Event\s+Action\s+New State",r"\|\sState\s\|"] , it is recognized as the start line, and the current behavior is a blank line in the subsequent traversal process, and the start line is not None, it is recognized as the end line;

在自动机形式为点线图的情况下，如果当前行匹配到正则表达式[r'^[\|\+]+---->',r'<----[-\|\+]+$']的一个，则识别为起始行，并在后续遍历过程中当前行为空行，且起始行不为None，则识别为终止行。In the case of a point-and-line graph in the form of the automaton, if the current line matches the regular expression [r'^[\|\+]+---->', r'<----[-\|\ +]+$'], it is recognized as the start line, and the current line is blank in the subsequent traversal process, and the start line is not None, it is recognized as the end line.

进一步地，所述对所述关键行列表中的结构化描述进行解析，得到自动机的未定义名字列表，包括：Further, the structured description in the key line list is parsed to obtain an undefined name list of the automaton, including:

利用空格切割关键行，以得到一单词列表；Use spaces to cut key lines to get a list of words;

遍历所述单词列表；iterate over said list of words;

如果能够获取全为大写的单词或词性是名词的单词，则将该全为大写的单词和该词性是名词的单词加入到未定义名字列表；If it is possible to obtain all uppercase words or part of speech words that are nouns, then add the all uppercase words and the part of speech words that are nouns to the list of undefined names;

如果未能获取全为大写的单词或词性是名词的单词，则使用正则表达式[r'(Event\d+):(\w*)',r'(\w*)[sS]tate:',r'State$s$:(\w+)']匹配自动机的状态名和事件名，并加入到未定义名字列表。If it is not possible to obtain all capitalized words or words whose part of speech is a noun, use the regular expression [r'(Event\d+):(\w*)',r'(\w*)[sS]tate:' ,r'State$s$:(\w+)'] matches the state name and event name of the automaton, and adds it to the list of undefined names.

进一步地，所述在所述结构化描述为自动机的情况下，获取所述结构化描述对应关键行所属的章节内容，并使用所述未定义名字列表中的名字在所述章节内容进行匹配，将得到的自动机迁移信息存入自动机迁移列表，包括：Further, in the case that the structured description is an automaton, obtain the chapter content to which the key line of the structured description belongs, and use the names in the undefined name list to match the chapter content , store the obtained automaton migration information into the automaton migration list, including:

获取所述结构化描述对应关键行所属的章节内容；Obtain the content of the chapter to which the key row of the structured description belongs;

将章节内容按照句子重新进行划分，以得到更新后的章节内容；Re-divide the chapter content according to the sentence to get the updated chapter content;

如果更新后的章节内容中存在包含表示状态迁移的关键词的句子，且该关键词的前半部分和后半部分分别存在一个所述未定义名字列表中的名字，则判定该句子包含改变状态的自动机迁移信息，并对该句子进行自动机迁移信息四元组的分割后，存入自动机迁移列表；其中，所述自动机迁移信息四元组包括：原始状态、事件、动作和目的状态；If there is a sentence containing a keyword representing a state transition in the updated chapter content, and there is a name in the undefined name list in the first half and the second half of the keyword, then it is determined that the sentence contains a sentence that changes the state. Automata migration information, and after the segmentation of the automaton migration information quaternion, the sentence is stored in the automaton migration list; wherein, the automaton migration information quaternion includes: original state, event, action and purpose state ;

如果更新后的章节内容中存在包含表示状态迁移的关键词的句子，且该关键词的前半部分或后半部分存在一个所述未定义名字列表中的名字，则判定该句子包含不改变状态的自动机迁移信息，并对该句子进行自动机迁移信息四元组的分割后，存入自动机迁移列表。If there is a sentence containing a keyword representing a state transition in the updated chapter content, and there is a name in the undefined name list in the first half or the second half of the keyword, then it is determined that the sentence contains a sentence that does not change the state. The automaton transfer information, and after the sentence is divided into quadruples of the automaton transfer information, it is stored in the automaton transfer list.

进一步地，所述方法还包括：Further, the method also includes:

基于所述网络协议设计知识提取结果，生成检测规则和/或模糊测试配置文件。Based on the network protocol design knowledge extraction result, a detection rule and/or fuzzing configuration file is generated.

与现有技术相比，本发明的网络协议设计知识自动提取方法，其核心是通过识别文档中对报文结构、自动机的5种常见结构化描述，从中提取出代表报文字段名称、自动机状态名称等关键信息，在此基础上，结合模式匹配和自然语言提取技术，进一步从上述结构以及文档中其它使用自然语言描述的文本提取更丰富、全面的设计知识，最终形成包括协议报文字段名、字段类型、字段长度以及自动机信息在内的完整设计知识，用于指导静态检测过程中的规则生成、模糊测试过程中的输入生成等，为协议软件的代码缺陷检测和漏洞挖掘提供领域知识支持，提高检测及挖掘效能。Compared with the prior art, the network protocol design knowledge automatic extraction method of the present invention, its core is to extract the representative message field name, automatic On this basis, combined with pattern matching and natural language extraction technology, further richer and more comprehensive design knowledge is extracted from the above structure and other texts described in natural language in the document, and finally forms a protocol message The complete design knowledge including field name, field type, field length and automaton information is used to guide the rule generation in the static detection process, the input generation in the fuzzy test process, etc., and provide a basis for code defect detection and vulnerability mining of protocol software. Domain knowledge support improves detection and mining performance.

附图说明Description of drawings

图1基于RFC文档的网络协议设计知识自动提取方法的流程图。Figure 1 is a flow chart of the method for automatically extracting network protocol design knowledge based on RFC documents.

图2图文描述示例图。Figure 2 An example diagram of graphic description.

图3伪C代码形式描述示例图。Figure 3 Pseudo-C code description example diagram.

图4ASN.1形式描述示例图。Figure 4 ASN.1 format description example diagram.

图5文字列表描述形式示例图。Figure 5 is an example diagram of the text list description form.

图6点线图自动机描述形式示例图。Figure 6. An example diagram of the description form of the point-line graph automaton.

图7模糊测试配置文件表示。Figure 7 Fuzzing configuration file representation.

图8自动机测试序列示例图。Figure 8 is an example diagram of an automaton test sequence.

图9RFC 4271第一个结构化描述。Figure 9 RFC 4271 first structured description.

图10模糊测试配置文件生成示例。Figure 10 Example of fuzzing configuration file generation.

具体实施方式Detailed ways

为使本发明的上述特征和优点能更明显易懂，下面通过具体实施例，对本发明的技术方案做进一步说明。In order to make the above features and advantages of the present invention more comprehensible, the technical solutions of the present invention will be further described below through specific examples.

本发明的实施步骤主要分为五个阶段：预处理阶段、结构化描述识别阶段、关键信息提取阶段、文本理解阶段、知识表达阶段。The implementation steps of the present invention are mainly divided into five stages: a preprocessing stage, a structured description recognition stage, a key information extraction stage, a text understanding stage, and a knowledge expression stage.

预处理阶段：Preprocessing stage:

1.RFC文档预处理1. RFC document preprocessing

首先是从RFC-editor网站(RFC-editor是由官方维护的RFC文档编辑网站)获取RFC文档的txt格式，然后遍历文档，将RFC文档的页眉和页脚去除掉，一般来说RFC文档的页眉和页脚附近行都有一个特殊的字节“0x0c”存在，可以根据这个特点做匹配；然后通过多个正则表达式(例：r'^[0-9.]{1,8}.？+')匹配剩余RFC文档中的标题行，将RFC文档切割成多个章节，得到一个章节列表，章节列表中每个元素为一个章节，一个章节又是一个行列表，行列表中每个元素为文档中的一个原始行。The first is to obtain the txt format of the RFC document from the RFC-editor website (RFC-editor is an officially maintained RFC document editing website), and then traverse the document to remove the header and footer of the RFC document. Generally speaking, the RFC document There is a special byte "0x0c" in the lines near the header and footer, which can be matched according to this feature; then through multiple regular expressions (eg: r'^[0-9.]{1,8} .?+') matches the title line in the remaining RFC document, cuts the RFC document into multiple chapters, and obtains a chapter list. Each element in the chapter list is a chapter, and a chapter is a line list. Each element in the line list element is a raw line in the document.

结构化描述识别阶段：Structured description recognition stage:

结构化描述方式可分为大致分为两类，即报文结构、自动机，其中报文结构又可分为图文、伪C代码、ASN.1这3种形式，自动机分为文字列表、点线图这2种形式。Structured description methods can be roughly divided into two categories, namely, message structure and automaton. Among them, message structure can be divided into three forms: graphic, pseudo-C code, and ASN.1. Automaton is divided into text list , point-line graph these two forms.

2.识别报文结构的结构化描述方式2. Identify the structural description of the message structure

遍历步骤1得到的章节列表以及章节中的行列表，尝试匹配识别所有形式的结构化描述文本，根据总结的特点识别，仅识别第一个示例，识别成功则设置当前RFC文档的结构化描述方式，识别失败说明当前RFC文档没有要提取的内容。Traverse the list of chapters obtained in step 1 and the list of lines in the chapters, try to match and recognize all forms of structured description text, recognize according to the characteristics of the summary, and only recognize the first example, if the recognition is successful, set the structured description method of the current RFC document , the recognition failure indicates that the current RFC document has no content to be extracted.

针对报文结构不同的描述风格，具体识别步骤分别如下：For different description styles of message structures, the specific identification steps are as follows:

a.图文形式：针对以“+-”符号组成的描述报文字段构成信息的图表(如图2)，使用正则表达式r'^\+-+\+$'匹配识别起始行，记录下起始行在当前章节中的索引(index)，继续遍历章节中的行列表，当遇到单独一个空行(即只包含一个换行符'\n')时，记录下当前行为终止行。设置当前文档的报文结构的结构化描述方式为图文形式。a. Graphic and text form: for the chart (as shown in Figure 2) that describes the composition of message fields composed of "+-" symbols, use the regular expression r'^\+-+\+$' to match and identify the starting line, Record the index (index) of the starting line in the current chapter, continue to traverse the list of lines in the chapter, and when encountering a single blank line (that is, only containing a newline character '\n'), record the current behavior termination line . Set the structural description mode of the message structure of the current document to graphic form.

b.伪C代码形式：针对以伪C代码形式描述报文字段构成信息的关键行文本(如图3)，则是使用正则表达式r'^\s*(struct|enum|typedef struct)\s+\w*\s*\{'匹配识别起始行，使用正则表达式r'\}[^；]*；'匹配识别终止行。同时需要考虑关键行中的“{}”是否正确闭合的情况，这里使用了“栈”的思想来处理这种情况。设置当前文档的报文结构的结构化描述方式为伪C代码形式；b. Pseudo-C code form: for the key line text (as shown in Figure 3) that describes the composition information of the message field in the form of pseudo-C code, the regular expression r'^\s*(struct|enum|typedef struct)\ s+\w*\s*\{' match to identify the start line, use the regular expression r'\}[^;]*;' match to identify the end line. At the same time, it is necessary to consider whether the "{}" in the key line is closed correctly. Here, the idea of "stack" is used to deal with this situation. Set the structural description method of the message structure of the current document to pseudo-C code form;

c.ASN.1形式：针对以ASN.1形式描述报文字段构成信息的关键行文本(如图4)，使用正则表达式r'\w+\s+(OBJECT-GROUP|MODULE-COMPLIANCE|OBJECTc.ASN.1 format: For the key line text that describes the message field composition information in ASN.1 format (as shown in Figure 4), use the regular expression r'\w+\s+(OBJECT-GROUP|MODULE-COMPLIANCE|OBJECT

IDENTIFIER|OBJECT-TYPE|SEQUENCE)'以及r'\w+\s+::＝\s+\w+\s\{'匹配识别起始行，如果当前行是以右花括号“}”结尾且下一行是空行的话，则识别为终止行。设置当前文档的报文结构的结构化描述方式为ASN.1形式。IDENTIFIER|OBJECT-TYPE|SEQUENCE)' and r'\w+\s+::＝\s+\w+\s\{' match and identify the start line, if the current line ends with a right curly brace "}" and the next line is A blank line is recognized as a terminating line. Set the structured description mode of the packet structure of the current document to ASN.1 format.

3.识别自动机信息的结构化描述方式3. Identify the structured description of automaton information

针对自动机信息不同的描述风格，识别的具体步骤如下：For different description styles of automaton information, the specific steps of identification are as follows:

a.文字列表形式：针对文字列表形式的情况(例图5)，如果当前行匹配到正则表达式[r"State\s+Event\s+Action\s+New State",r"\|\sState\s\|"]的其中一个，则识别为自动机描述的起始行，继续遍历当前章节的下一行，遇到当前行为空行，且起始行不为None，则识别为终止行。设置当前文档的自动机的结构化描述方式为文字列表形式；a. Text list form: For the case of text list form (Example Figure 5), if the current line matches the regular expression [r"State\s+Event\s+Action\s+New State",r"\|\ sState\s\|"], it will be recognized as the start line described by the automaton, continue to traverse the next line of the current chapter, if it encounters a blank line in the current line, and the start line is not None, it will be recognized as the end line . Set the structured description method of the automaton of the current document to the form of a text list;

b.点线图形式：针对点线图形式的情况(例图6)，如果当前行匹配到正则表达式[r'^[\|\+]+---->',r'<----[-\|\+]+$']的其中一个，则识别为自动机点线图描述形式的起始行，继续遍历当前章节的下一行，遇到当前行为空行，且起始行不为None，则识别为终止行。设置当前文档的自动机的结构化描述方式为点线图形式。b. Point-line graph form: For the case of point-line graph form (Example Figure 6), if the current line matches the regular expression [r'^[\|\+]+---->',r'<- ---[-\|\+]+$'], it is recognized as the starting line in the description form of the automaton point-line diagram, and continues to traverse the next line of the current chapter. When encountering a blank line in the current line, start If the start line is not None, it is recognized as the end line. Sets the structured description of the automaton of the current document in the form of a dot-line diagram.

4.提取结构化描述4. Extract structured description

从头遍历步骤1得到的章节列表以及章节中的行列表，利用步骤2识别到的文档的结构化描述方式特点，匹配识别结构化描述的起始行以及终止行，并记录下结构化描述所在的章节，依次将关键行描述信息添加到对应的关键行列表中，最终得到一个不同描述风格的关键行列表信息。Traverse the list of chapters and the list of lines in the chapters obtained in step 1 from the beginning, use the characteristics of the structured description method of the document identified in step 2, match and identify the start line and end line of the structured description, and record the location of the structured description Chapters, the key line description information is added to the corresponding key line list in turn, and finally a key line list information with different description styles is obtained.

关键信息提取阶段：Key information extraction stage:

5.提取报文结构信息5. Extract message structure information

遍历报文结构关键行列表，依据不同的描述风格解析得到相对应的网络协议设计信息，对于包括协议报文字段名、字段类型、字段长度的字段描述信息，将其以规范格式存放到格式化的字段中，一个字段以<章节名,字段名,字段类型,字段长度,字段取值,字段描述>六元组的形式存放，部分字段类型、字段长度和字段描述的推导在下一个阶段中叙述。依据步骤2得到的RFC文档结构化描述方式来对关键行文本信息进行解析，具体步骤如下：Traverse the list of key lines of the message structure, analyze and obtain the corresponding network protocol design information according to different description styles, and store the field description information including the field name, field type, and field length of the protocol message in a standardized format in the format Among the fields, a field is stored in the form of <chapter name, field name, field type, field length, field value, field description> six-tuple, and the derivation of some field types, field lengths and field descriptions will be described in the next stage . According to the structured description method of the RFC document obtained in step 2, the key line text information is analyzed, and the specific steps are as follows:

a.图文形式：遍历步骤3得到的关键行列表，解析每一行的内容。如果当前行是以符号“+-”开头的或者是空行，则跳过解析，因为这通常是分隔线。如果当前行存在“|”或者“：”符号，则以该符号作为分隔符切割当前行，得到当前行描述的多个字段。然后遍历得到的多个字段，首先判断是否是变长字段，即判断当前行是否存在“variable、/、～”等符号，如果是，则将当前字段的size设置为“-1”，代表该字段长度不确定；否则则是固定长度的字段，这些字段名包含了其在图中的空格占位符，可以根据其长度计算出该字段的实际意义长度，主要依据“size＝(len(field)+3)//BIT_SBL_SIZE”，其中BIT_SBL_SIZE为当前图描述风格中以“+-”符号代表1bit长度的字符长度(例：“+-”代表1bit长度，则BIT_SBL_SIZE为2)。然后继续对字段名进一步确认，去掉字段名多余的空白字符，去除字段名中存在的“()、[]”包含的额外信息，如果“()”中有描述字段长度的信息(例：2octets、4bits)，则将其识别计算出来并更新当前字段的size，并且将当前字段的size设置为不可更改的(设置一个flag标志)。然后对字段名中存在的缩写情况进行处理，将其还原为全单词描述。最后，更新字段的字段名，将当前关键行所在章节作为字段的章节名写入上述六元组中，得到一个初步的字段，将该字段加入格式化字段列表中。a. Graphic and text form: Traverse the list of key lines obtained in step 3, and parse the content of each line. Parsing is skipped if the current line starts with the sign "+-" or is an empty line, as this is usually a separator line. If there is a "|" or ":" symbol in the current line, use the symbol as a separator to cut the current line to obtain multiple fields described by the current line. Then traverse the multiple fields obtained, first judge whether it is a variable-length field, that is, judge whether there are symbols such as "variable, /, ~" in the current line, if so, set the size of the current field to "-1", representing the The length of the field is uncertain; otherwise, it is a fixed-length field. These field names include their space placeholders in the figure. The actual length of the field can be calculated according to its length, mainly based on "size=(len(field )+3)//BIT_SBL_SIZE", where BIT_SBL_SIZE is the character length in the current image description style with "+-" symbol representing 1bit length (for example: "+-" represents 1bit length, then BIT_SBL_SIZE is 2). Then continue to further confirm the field name, remove the extra blank characters in the field name, and remove the extra information contained in "(), []" in the field name. If there is information describing the length of the field in "()" (for example: 2octets , 4bits), then it will be identified and calculated and the size of the current field will be updated, and the size of the current field will be set as unchangeable (set a flag). Abbreviations that exist in field names are then processed to restore them to full-word descriptions. Finally, the field name of the field is updated, and the chapter where the current key row is located is written into the above six-tuple as the chapter name of the field to obtain a preliminary field, which is added to the formatted field list.

通过对图文形式的结构化信息进行提取，当前可以得到“章节名”、“字段名”、“字段长度”信息。By extracting structured information in the form of graphics and text, the information of "chapter name", "field name" and "field length" can currently be obtained.

b.伪C代码形式：遍历步骤3得到的关键行列表，解析每一行的内容。首先利用第一行的信息，判断该关键行文本描述的是结构化信息还是枚举信息，因为结构化信息中存在字段信息，而枚举信息中存在的是字段取值信息，使用空格切割当前行后第一个元素的值就可以判断。如果是描述结构化信息的，则遍历每一行，使用正则表达式r'[\w\:]+\s+[\*\w\d\[\]]+；'匹配描述字段信息的行，一般具有这样的特点，“TYPE FIELD_NAME；”的形式，这样就可以将字段的type和field_name识别出来。此外，还有可能存在以“<>”符号说明字段长度的情况，这时字段的size也得到了。如果是描述枚举信息的，则使用正则表达式r'[\w\d]+$[\w\d\.]+$'匹配描述的枚举信息，特点就是“ENUM_NAME(VALUE)”，同样可以将其中的值提取出来，然后在关键行末尾匹配这些枚举值所对应的字段(例：“}HandshakeType；”)。最后设置相对应的字段名和字段取值以及字段类型，将该字段加入格式化字段列表中。b. Pseudo-C code form: traverse the list of key lines obtained in step 3, and parse the content of each line. First, use the information in the first line to judge whether the key line text describes structured information or enumerated information, because structured information contains field information, while enumerated information contains field value information, use spaces to cut the current The value of the first element after the line can be judged. If it describes structured information, traverse each line and use the regular expression r'[\w\:]+\s+[\*\w\d\[\]]+;' to match the line describing the field information, Generally, it has such a feature, the form of "TYPE FIELD_NAME;", so that the type and field_name of the field can be identified. In addition, there may be a situation where the "<>" symbol is used to describe the length of the field, and the size of the field is also obtained at this time. If it describes the enumeration information, use the regular expression r'[\w\d]+$[\w\d\.]+$' to match the described enumeration information, which is characterized by "ENUM_NAME(VALUE) ", you can also extract the values in it, and then match the fields corresponding to these enumeration values at the end of the key line (for example: "}HandshakeType;"). Finally, set the corresponding field name, field value, and field type, and add this field to the list of formatted fields.

通过对伪C代码形式的结构化信息进行提取，当前可以得到“章节名”、“字段名”、“字段长度”、“字段类型”信息。By extracting the structured information in the form of pseudo-C code, the information of "chapter name", "field name", "field length" and "field type" can be obtained at present.

c.ASN.1形式：遍历步骤3得到的关键行列表，解析每一行的内容。首先利用第一行的信息，判断该关键行文本描述是否是嵌套的信息，如果存在“OBJECT-GROUP、MODULE-COMPLIANCE、OBJECT IDENTIFIER、OBJECT-TYPE”其中之一的ASN.1关键字，则说明是嵌套的。如果是嵌套的，则描述的就是两个字段的一个关系，关键行文本的起始行则以当前字段名开头，可以通过这个特点获取字段名。遍历每一行，如果存在“SYNTAX”关键字，则当前行说明了当前字段的类型，易提取，最后在关键行终止行中存在“：：＝{UP_FIELD}”形式的描述，提取UP_FIELD的内容作为当前字段的上级字段，设置为当前字段的section_name，并将当前字段加入格式化字段列表中。如果不是嵌套的，则是描述一个字段及其包含的字段信息。取起始行的信息作为后续解析每个字段的上级字段，即均设置为后续每一个字段的section_name，然后遍历每一行，描述字段的特点是这样的“FIELD_NAME FIELD_TYPE,”，则通过简单分隔得到相应的信息，得到一个字段，将该字段加入格式化字段列表中。c.ASN.1 format: traverse the list of key lines obtained in step 3, and parse the content of each line. First, use the information in the first line to determine whether the text description of the key line is nested information. If there is an ASN.1 keyword among "OBJECT-GROUP, MODULE-COMPLIANCE, OBJECT IDENTIFIER, OBJECT-TYPE", then Descriptions are nested. If it is nested, it describes a relationship between two fields, and the starting line of the key line text starts with the current field name, and the field name can be obtained through this feature. Traverse each line, if there is a "SYNTAX" keyword, the current line describes the type of the current field, which is easy to extract, and finally there is a description in the form of "::={UP_FIELD}" in the end line of the key line, and the content of UP_FIELD is extracted as The parent field of the current field, set it to the section_name of the current field, and add the current field to the list of formatted fields. If not nested, it describes a field and the field information it contains. Take the information of the starting line as the upper-level field for subsequent parsing of each field, that is, set it as the section_name of each subsequent field, and then traverse each line. The characteristics of the description field are such "FIELD_NAME FIELD_TYPE," which can be obtained by simple separation According to the corresponding information, a field is obtained, and the field is added to the list of formatted fields.

通过对ASN.1形式的结构化信息进行提取，当前可以得到“章节名”、“字段名”、“字段类型”信息。By extracting the structured information in the form of ASN.1, the information of "chapter name", "field name" and "field type" can be obtained at present.

6.提取自动机信息6. Extract automaton information

如果步骤3获取到了自动机关键行列表，遍历得到的自动机关键行列表，解析每一行的内容。根据自动机描述中的一些特点识别出一些可能是事件、状态、动作的名字，将其加入自动机的未定义名字中。具体为，利用空格切割当前行，得到一个单词列表，然后遍历这个单词列表，如果单词全为大写字母，则有可能是一个状态名，如果当前单词为一个名词，则其可能是事件或者动作名，将这些单词加入自动机的未定义名字中。If the automaton key line list is obtained in step 3, traverse the obtained automaton key line list and parse the content of each line. Identify some names that may be events, states, and actions according to some characteristics in the automaton description, and add them to the undefined names of the automaton. Specifically, use spaces to cut the current line to get a word list, and then traverse the word list. If the word is all capital letters, it may be a state name. If the current word is a noun, it may be an event or action name. , adding these words to the undefined name of the automaton.

如果自动机关键行列表为空，则使用正则表达式[r'(Event\d+):(\w*)',r'(\w*)[sS]tate:',r'State$s$:(\w+)']匹配自动机的状态名和事件名，加入自动机的未定义名字中。If the automaton keyline list is empty, use the regular expression [r'(Event\d+):(\w*)',r'(\w*)[sS]tate:',r'State$s $:(\w+)'] matches the state name and event name of the automaton, and adds it to the undefined name of the automaton.

文本理解阶段：Text understanding stage:

通过步骤5、6，得到了一系列的格式化字段列表以及自动机的未定义名字列表，而每个字段的六元组信息并不完善，因此这一步是对相应字段进一步完善以及对自动机迁移信息推导的过程。这一阶段本发明的输入是格式化字段列表，以及字段所属章节内容，输出是完善后的格式化字段列表或者自动机迁移列表。为了进行更广泛的匹配，章节内容可以扩展为更大的章节或者整个RFC文档的章节。Through steps 5 and 6, a series of formatted field lists and undefined name lists of the automata are obtained, but the six-tuple information of each field is not perfect, so this step is to further improve the corresponding fields and the automaton The process of migration information derivation. The input of the present invention at this stage is a list of formatted fields and the contents of the chapters to which the fields belong, and the output is a list of completed formatted fields or a list of automaton migration. For broader matching, section content can be expanded into larger sections or sections of the entire RFC document.

7.利用文本描述信息推导字段长度7. Use text description information to derive field length

遍历步骤5得到的格式化字段列表，针对每一个字段所属章节，本发明首先将章节按照句子重新进行划分。将章节中的所有行，利用空格替换所有换行符，将章节中的内容组成完整的语句信息。然后，利用自然语言处理工具将其拆分成单个语句存储在章节中。Traversing the formatted field list obtained in step 5, for the chapters to which each field belongs, the present invention first re-divides the chapters according to sentences. Replace all line breaks with spaces in all the lines in the chapter, and combine the contents of the chapter into complete statement information. Then, use natural language processing tools to split it into individual sentences and store them in chapters.

遍历格式化字段列表中的每一个字段，利用字段名加上“is”、“：\n”、“”作为匹配字段描述的模式，遍历输入的章节中的每一行。如果当前行中存在上述的模式或者当前行就是当前的字段名，则匹配成功。则将当前行以及往下5行的文本内容作为该字段的文本描述信息并添加到格式化字段列表中去，如果有多余的不完整的句子，则将其舍弃。字段描述信息仅是辅助推导其他字段信息的部分。Traverse each field in the formatted field list, using the field name plus "is", ":\n", "" as a pattern to match the field description, and traverse each line in the entered chapter. If the above pattern exists in the current line or the current line is the current field name, the match is successful. Then use the text content of the current line and the next 5 lines as the text description information of the field and add it to the formatted field list. If there are redundant incomplete sentences, discard them. Field description information is only a part to assist in deriving other field information.

接着，遍历字段所在章节和字段描述中的每一个句子。若句子包含字段名称，则将其作为初始的分析的位置。在语句中字段的长度通常以is/are/length of/length/oflength+num的形式出现。本发明将包含这些关键词的语句作为输入，利用NLP技术对语句的词性进行分析，将其中包含“VBZ CD”(动词+基数)的语句识别出来。其中CD所对应的单词即为报文的长度。同时，为了确定提取的长度的单位，本发明通过分析判断它的后一个字符是否为['-octet','-bit','octet','bit','octets','bits','bytes']来确认其的单位。本发明以bit为单位进行报文字段长度存储，对于octet和byte，本发明会将取值*8进行存储。若取值后面没有任何单位，那么本发明默认其是按照byte进行说明的。因此也会将取值*8进行存储。这样本发明就获得了字段所对应的长度，然后更新格式化字段列表中对应的值。Next, iterate over each sentence in the section where the field is located and in the description of the field. If the sentence contains a field name, use that as the initial parsing location. The length of the field in the statement usually appears in the form of is/are/length of/length/oflength+num. The invention takes the sentences containing these keywords as input, uses NLP technology to analyze the part of speech of the sentences, and recognizes the sentences containing "VBZ CD" (verb+radix) therein. The word corresponding to CD is the length of the packet. At the same time, in order to determine the unit of the extracted length, the present invention judges whether its next character is ['-octet', '-bit', 'octet', 'bit', 'octets', 'bits', ' bytes'] to confirm its unit. The present invention stores the message field length in units of bits, and for octet and byte, the present invention stores the value *8. If there is no unit behind the value, the present invention defaults that it is described in bytes. Therefore, the value *8 will also be stored. In this way, the present invention obtains the length corresponding to the field, and then updates the corresponding value in the formatted field list.

8.推导字段取值8. Deduce field value

RFC在文本中对字段取值进行描述时，一般会采用情态动词(“MUST”，“MUST NOT”)或者比较关系词(“is less than”，“is equal to”)等进行描述。因此，本发明遍历章节中的每个语句，若语句中包含字段名，且包含情态动词或者比较关系词，将其作为待提取的语句。接着，本发明利用词性分析的方式，生成语句的词性列表。若其中包括如表1所示的词性组合，那么本发明相应的将情态动词和比较关系词转化成对应符号，然后将CD中的数值提取出来。最后，根据符号生成取值列表。例如：field<＝3那么，values＝[0,1,2,3]。When RFC describes field values in text, it generally uses modal verbs ("MUST", "MUST NOT") or comparative relative words ("is less than", "is equal to"), etc. to describe. Therefore, the present invention traverses each sentence in the chapter, and if the sentence contains a field name, and contains a modal verb or a comparative relative word, it is used as a sentence to be extracted. Next, the present invention utilizes part-of-speech analysis to generate a part-of-speech list of sentences. If it includes the part-of-speech combination as shown in Table 1, then the present invention correspondingly transforms the modal verbs and comparative relative words into corresponding symbols, and then extracts the values in the CD. Finally, a list of values is generated based on the symbol. For example: field<=3 then, values=[0,1,2,3].

表1NLP词性分析列表Table 1 NLP part-of-speech analysis list

9.推导字段类型9. Deduce field type

有三种方法来推断字段类型。一是利用步骤5得到的文本描述信息可以简单推导当前字段的字段类型，如果“octet”、“string”、“boolean”、“integer”等明显字段类型存在于文本描述中，则将字段类型设置为该类型。此外，在较为完善地获取了字段名、字段长度信息后，可以根据这两个信息推断当前字段的类型。所以，二是根据字段名推断字段类型，如果字段名中存在“domain name”、“owner name”，则可以将字段类型设置为“dnsname”，如果字段名中存在“time、ttl”，“ip、ipv6”，则可以将字段类型分别设置为“timestamp”、“ip”；三是根据字段长度推断字段类型，如果字段长度为“8、16、32、64”，则可以将字段类型分别设置为“byte、word、dword、qword”。以上推断字段类型的方法的优先级从1到3依次递减，通过字段长度推断字段类型的方法优先级最低。There are three ways to infer field types. One is to use the text description information obtained in step 5 to simply deduce the field type of the current field. If obvious field types such as "octet", "string", "boolean", and "integer" exist in the text description, set the field type to for that type. In addition, after obtaining the field name and field length information relatively well, the type of the current field can be deduced based on these two pieces of information. Therefore, the second is to infer the field type according to the field name. If there are "domain name" and "owner name" in the field name, you can set the field type to "dnsname". If there are "time, ttl" in the field name, "ip , ipv6", you can set the field type to "timestamp", "ip" respectively; the third is to infer the field type according to the field length, if the field length is "8, 16, 32, 64", you can set the field type respectively For "byte, word, dword, qword". The priority of the methods for inferring the field type above is from 1 to 3 in descending order, and the method for inferring the field type through the field length has the lowest priority.

10.推导自动机迁移信息10. Deriving automaton migration information

通过步骤5，得到了一系列的自动机未定义名字，利用这些名字作为关键字，到段落描述信息中去匹配可能的自动机迁移描述语句识别出一条自动机迁移的信息(即原始状态、事件、动作、目的状态)，将迁移信息存入自动机迁移列表中。Through step 5, a series of automaton undefined names are obtained, and these names are used as keywords to match possible automaton migration description sentences in the paragraph description information to identify a piece of automaton migration information (that is, original state, event , action, destination state), and store the migration information in the automaton migration list.

遍历步骤1得到的章节列表，组合每个章节中的每一行，利用NLTK(一个第三方库，自然语言工具包)对段落进行分句。然后对得到的正常的自然语句逐个匹配，如果语句中存在{“receive”，“transfer”，“change”，“remain”}等可能表示状态迁移的关键词时，进一步判断在该关键词的前半部分和后半部分是否均存在一个自动机未定义名字，如果是，则是一条改变状态的自动机迁移信息，只存在一个则可能是一条不改变状态的自动机迁移信息。然后进一步切割语句按照英语语法将各个部分存入自动机迁移信息四元组中，再将其加入自动机迁移列表中。Traverse the list of chapters obtained in step 1, combine each line in each chapter, and use NLTK (a third-party library, natural language toolkit) to segment the paragraph. Then match the obtained normal natural sentences one by one. If there are keywords in the sentence that may indicate state transition such as {"receive", "transfer", "change", "remain"}, further judge in the first half of the keyword Whether there is an automaton undefined name in both the part and the second half, if yes, it is an automaton transition information that changes the state, and if there is only one, it may be an automaton transition information that does not change the state. Then further cut the statement and store each part in the automaton migration information quadruple according to the English grammar, and then add it to the automaton migration list.

知识表达阶段：Knowledge expression stage:

通过步骤7、8、9、10完善得到的格式化字段列表以及自动机迁移列表就是最终的网络协议设计信息。可以将格式化字段列表转换成JSON格式存入文件中，以供后续使用。后续使用的场景目前可分为检测规则生成以及模糊测试配置文件生成，具体表示如下：The formatted field list and automaton migration list obtained through steps 7, 8, 9, and 10 are the final network protocol design information. The formatted field list can be converted into JSON format and stored in a file for subsequent use. Subsequent use scenarios can currently be divided into detection rule generation and fuzz test configuration file generation, specifically as follows:

11.检测规则生成11. Detection rule generation

利用格式化字段列表中字段的信息(字段名、字段长度、字段取值)，可以生成一系列形如“chk_bf(cond,op)”的检测规则，其中“cond”代表该规则的条件，一般对应于字段的取值范围或者长度限制之类，如果满足该条件，正确的操作定义为“op”。例“((hold>3&&hold＝＝0),use(hold))”，代表在使用hold字段之间，本发明需要判断它的取值是否为0或者大于等于3。Using the field information (field name, field length, field value) in the formatted field list, a series of detection rules in the form of "chk_bf(cond, op)" can be generated, where "cond" represents the condition of the rule, generally Corresponding to the value range or length limit of the field, if the condition is met, the correct operation is defined as "op". For example, "((hold>3&&hold==0), use(hold))" means that between using the hold field, the present invention needs to judge whether its value is 0 or greater than or equal to 3.

12.模糊测试配置文件生成12. Fuzz test configuration file generation

利用格式化字段列表中字段的信息(字段名、字段长度、字段类型、字段取值)，本发明可以自动化生成BooFuzz(一款通用化的网络协议模糊测试框架)的测试配置文件用于模糊测试。根据字段长度或者字段类型确定当前字段需要选择的变异原语，例s_bytes()代表多个字节大小的字段，如图7所示，第一个字段为“Marker”，默认取值长度不足以“\x00填充”。s_byte()代表一个字节大小的字段，图7中字段“Optional Parameters Length”为1字节大小，字段取值由其“value”参数的字节串代表，模糊测试的取值由“fuzz_values”参数代表。Using the information of the fields in the formatted field list (field name, field length, field type, field value), the present invention can automatically generate the test configuration file of BooFuzz (a generalized network protocol fuzzy testing framework) for fuzz testing . According to the field length or field type, determine the mutation primitive to be selected for the current field. For example, s_bytes() represents a field with multiple byte sizes. As shown in Figure 7, the first field is "Marker", and the default value length is not enough "\x00 padding". s_byte() represents a byte-sized field. The field "Optional Parameters Length" in Figure 7 is 1 byte in size. The value of the field is represented by the byte string of its "value" parameter, and the value of the fuzz test is represented by "fuzz_values" parameter representation.

利用自动机迁移信息(原始状态、目的状态)，可以生成模糊测试的自动机序列。如图8所示，利用“State Machine Record(state＝'Connect',event＝'['CollisionDetectEstablishedState',…,'Event 12','DelayOpenTimer_Expires']',action＝'None',new_state＝'OpenSent')”这条自动机迁移信息，得到自动机的测试序列。Using the automaton transition information (original state, destination state), the automaton sequence for fuzz testing can be generated. As shown in Figure 8, using "State Machine Record(state='Connect', event='['CollisionDetectEstablishedState',...,'Event 12', 'DelayOpenTimer_Expires']', action='None', new_state='OpenSent' )" This automaton transfers information to obtain the test sequence of the automaton.

实施例：Example:

下面介绍一个实施例，以提取BGP协议RFC 4271中的网络协议设计知识为例。An embodiment is introduced below, taking the extraction of network protocol design knowledge in the BGP protocol RFC 4271 as an example.

在步骤1中，对RFC 4271进行预处理去除了冗余的页眉、页脚等不重要的信息，得到了预处理文件“rfc4271.txt.preprocessed”，然后对预处理文件进行章节匹配，划分章节，得到章节列表“sections”。In step 1, RFC 4271 is preprocessed to remove unimportant information such as redundant headers and footers, and the preprocessed file "rfc4271.txt.preprocessed" is obtained, and then the chapters of the preprocessed file are matched and divided Sections, get the list of chapters "sections".

在步骤2中，遍历步骤1得到的章节列表“sections”，在RFC4271的章节“4.1.Message Header Format”中识别到了如图9的图文形式报文结构描述，于是将RFC4271报文结构的结构化描述方式设置为图文形式，结束遍历。In step 2, the chapter list "sections" obtained in step 1 is traversed, and the message structure description in graphic and text form as shown in Figure 9 is recognized in the chapter "4.1.Message Header Format" of RFC4271, so the structure of the RFC4271 message structure Set the description mode to graphic and text form, and end the traversal.

在步骤3中，未识别到自动机信息的结构化描述方式，置空。In step 3, if the structured description mode of the automaton information is not recognized, leave it blank.

在步骤4中，根据步骤2得到的图文描述形式，进入步骤3.a对图文形式的结构化描述进行提取，识别每一个结构化描述的起始行、终止行，记录其所在的章节，添加到报文结构关键行列表中去。In step 4, according to the graphic description form obtained in step 2, proceed to step 3.a to extract the structured description in graphic and text form, identify the start line and end line of each structured description, and record the chapter it is in , added to the key line list of the message structure.

在步骤5中，首先根据RFC结构化描述方式选择步骤5.a对图文形式描述的报文结构关键行列表进行关键信息提取，遍历关键行列表，解析每一行的内容，依次得到字段信息，存入格式化字段列表中。In step 5, first select step 5.a according to the RFC structured description method to extract key information from the key line list of the message structure described in graphic form, traverse the key line list, parse the content of each line, and obtain field information in turn, Stored in the list of formatted fields.

在步骤6中由于自动机信息的结构化描述方式为空，则直接在遍历关键行的过程中使用较通用的正则表达式[r'(Event\d+):(\w*)',r'(\w*)[sS]tate:',r'State$s$:(\w+)']匹配自动机的状态名和事件名，将其加入自动机的未定义名字中。In step 6, since the structured description of the automaton information is empty, the more general regular expression [r'(Event\d+):(\w*)',r' is directly used in the process of traversing the key lines (\w*)[sS]tate:',r'State$s$:(\w+)'] matches the state name and event name of the automaton and adds it to the undefined name of the automaton.

在步骤7中，遍历步骤4得到的格式化字段列表，根据字段信息中的文字描述以及其所在的章节进行字段长度的进一步推断和理解。如下示例则是推断过程中显示的文字描述和对应的字段部分：“[+]Field:Error Code Description:Error Code:This 1-octetunsigned integer indicates the type of NOTIFICATION.”In step 7, the formatted field list obtained in step 4 is traversed, and the field length is further inferred and understood according to the text description in the field information and the chapter where it is located. The following example is the text description and the corresponding field part displayed during the inference process: "[+]Field:Error Code Description:Error Code:This 1-octetunsigned integer indicates the type of NOTIFICATION."

在步骤8中，对字段的取值进行推断，利用NLP的词性分析将文本描述中的数值提取出来以作为字段取值的候选值。In step 8, the value of the field is inferred, and the numerical value in the text description is extracted by using the part-of-speech analysis of NLP as the candidate value of the field value.

在步骤9中，依据三种推断方法的优先级对格式化字段的类型进行推断，更新字段的类型。例如，“[+]Current fieldname:Withdrawn Routes,field size:-1\n[+]Field:Withdrawn Routes Description:Withdrawn Routes:This is a variable-length fieldthat contains a list of IP address prefixes for the routes that are beingwithdrawn from service.”中字段“Withdrawn Routes”的文本描述中存在“IP address”等关键词，则认为该字段属于描述IP地址的，则将其字段类型设为“ip”，且其长度为“-1”，则代表是多个IP地址。In step 9, the type of the formatted field is inferred according to the priorities of the three inference methods, and the type of the field is updated. For example, "[+]Current fieldname:Withdrawn Routes,field size:-1\n[+]Field:Withdrawn Routes Description:Withdrawn Routes:This is a variable-length fieldthat contains a list of IP address prefixes for the routes that are If there are keywords such as "IP address" in the text description of the field "Withdrawn Routes" in beingwithdrawn from service.", then this field is considered to describe the IP address, and its field type is set to "ip" and its length is " -1", it means that there are multiple IP addresses.

在步骤10中，利用步骤6得到的自动机的未定义的名字，遍历章节，寻找关于自动机的描述，识别自动机迁移信息，添加到自动机迁移列表。下面是添加自动机迁移信息到自动机迁移列表中的示例：“[+]Added unchanged state machine transition:StateMachine Record(state＝'Established',event＝'['CollisionDetectEstablishedState']',action＝'None',new_state＝'Established')”。In step 10, use the undefined name of the automaton obtained in step 6 to traverse the chapters, find the description about the automaton, identify the automaton migration information, and add it to the automaton migration list. The following is an example of adding automaton transition information to the automaton transition list: "[+]Added unchanged state machine transition:StateMachine Record(state='Established', event='['CollisionDetectEstablishedState']', action='None' ,new_state='Established')".

在步骤11和步骤12中，利用步骤7、8、9得到的格式化字段列表以及步骤10得到的自动机迁移列表，生成相应的JSON格式文件，然后依据设计的应用方式按需做检测规则的生成以及模糊测试配置文件的生成。图10是根据RFC 4271得到的模糊测试配置文件的示例：In steps 11 and 12, use the formatted field lists obtained in steps 7, 8, and 9 and the automaton migration list obtained in step 10 to generate corresponding JSON format files, and then make detection rules as needed according to the designed application method Generation and generation of fuzzing configuration files. Figure 10 is an example of a fuzzing configuration file according to RFC 4271:

同样也可生成如图10所示的自动机测试序列。Similarly, the automaton test sequence shown in Figure 10 can also be generated.

虽然本发明已以实施例公开如上，然其并非用以限定本发明，本领域的普通技术人员对本发明的技术方案进行的适当修改或者等同替换，均应涵盖于本发明的保护范围内，本发明的保护范围以权利要求所限定者为准。Although the present invention has been disclosed as above with the embodiments, it is not intended to limit the present invention. Appropriate modifications or equivalent replacements to the technical solutions of the present invention by those of ordinary skill in the art shall fall within the protection scope of the present invention. The scope of protection of the invention is defined by the claims.

Claims

1. An automatic network protocol design knowledge extraction method based on RFC documents is characterized by comprising the following steps:

based on a txt format of an RFC document, acquiring a chapter list of the RFC document and a row list in each chapter;

obtaining structured descriptions in the chapter list and the line list to generate a key line list; wherein the structured description comprises: message structure and/or automaton;

Analyzing the structured description in the key row list to obtain a formatted field list containing partial information and/or an undefined name list of the automaton; wherein the information of the formatted field list includes: chapter name, field type, field length, field value and field description;

under the condition that the structural description is a message structure, updating the field length, the field value and the field type based on the chapter content of the corresponding key row of the structural description so as to obtain a formatted field list containing complete information;

under the condition that the structured description is an automaton, acquiring chapter contents to which a corresponding key row of the structured description belongs, matching the chapter contents by using names in the undefined name list, and storing the acquired automaton migration information into an automaton migration list;

and integrating the formatted field list containing the complete information and the automatic entry migration list to obtain a network protocol design knowledge extraction result.

2. The method of claim 1, wherein the message structure comprises: one or more of a teletext form, a pseudo-C code form, and an asn.1 form.

The obtaining the structured descriptions in the chapter list and the row list includes:

when the message structure is in an image-text form, after the initial line of the structural description is matched and identified by using a regular expression r '- + $';

when the message structure is in a pseudo-C code form, a regular expression r ' +\s is used (struct|enum|typef struct) \s + \w \s \{ ' to match and identify a starting line, and the regular expression r ' \ [); ] x; ' the match identifies the ending line and the curly brackets between the starting line and ending line are correctly closed;

when the message structure is in an ASN.1 form, a regular expression r '\w + \s+ (OBJECT-group|module-complete| OBJECT IDENTIFIER |object-type|sequence)' and r '\w + \s + \w + \s {' are used for matching and identifying the initial line, the line table is continuously traversed, and when the current line ends with a right-hand bracket and the next line is blank, the current line is taken as the termination line.

3. The method of claim 2, wherein parsing the structured description in the critical row list to obtain a formatted field list containing partial information comprises:

Aiming at a message structure in an image-text form, obtaining a symbol with ' sign ' or ' sign: "key row of symbols, and use the corresponding" | "symbol or": the symbol is used as a separator to cut the key line so as to obtain a plurality of fields and field names of the fields;

based on the variable length field symbol contained in the field and the space placeholder of the field contained in the field name in the diagram, obtaining the field length of each field, and storing the field length into a formatted field list; if the "()" symbol in the field name contains field length information, updating the field length by using the field length information, and storing the updated field length into a formatted field list;

after the field names are formatted and updated, storing the field names into a formatted field list; the formatting update includes: removing redundant blank characters in the field names, adding extra information contained in "()" symbols and "[ ]" symbols, and restoring abbreviations contained in the field names into full word descriptions;

acquiring a chapter name corresponding to the key row and storing the chapter name into a formatting field list;

and/or the number of the groups of groups,

aiming at a message structure in a pseudo-C code form, judging the information type described by the current line based on the value of the first element after the current line in the key line list; the information types include: structured information or enumeration information;

If the information type of the forward description is structured information, the information type is expressed by a regular expression r' [ \w\s+ [ \x\w\d [ \ ] ] +; the method comprises the steps of respectively acquiring a field name, a field type, a field length and a chapter name, and storing the field name, the field type, the field length and the chapter name into a formatted field list after' matching the field length condition contained in a description field information, "< >" symbol and the chapter name corresponding to the description field information;

if the information type of the forward description is enumeration information, using a regular expression r '[ \w\d ] + \ ([ \w\d. ] + \)' to match the line of the description field information so as to obtain field value information, and storing a formatted field list after obtaining a field name, a field type and a chapter name by matching the field corresponding to the field value information;

and/or the number of the groups of groups,

aiming at the message structure in the ASN.1 form, obtaining a field type based on a SYNTAX keyword in a key row list, and storing the field type into a formatted field list;

judging whether text description of the key line is nested information or not based on information of a first line in the key line list;

if the information is nested information, the starting line based on the key line text starts with the current field name, the field name is obtained, and the existence of the line is terminated according to the key line ": : after the upper FIELD name is obtained from the content of up_field in the = { up_field } "form description, the FIELD name and the upper FIELD name are stored into a formatted FIELD list;

If the information is not nested information, the information of the first row is used as the next upper-level FIELD name of each FIELD, and the FIELD name is obtained by using 'FIELD_ NAME FIELD _TYPE' when each subsequent key row is traversed, and the FIELD name and the upper-level FIELD name are stored in a formatted FIELD list;

and acquiring the chapter names corresponding to the key rows and storing the chapter names into a formatting field list.

4. The method of claim 2, wherein updating the field length comprises:

acquiring chapters corresponding to the chapter names, and re-dividing the chapter contents according to sentences to obtain updated chapter contents;

generating a mode for matching field description and a field name according to the field name, matching updated chapter contents, and taking the current line and a plurality of lines behind the current line as field descriptions after successful matching;

matching the updated chapter content according to the field names, and performing part-of-speech analysis on sentences containing the field names under the condition that the matching is successful to obtain sentences containing verbs and cardinalities;

taking the word corresponding to the base as the numerical value of the field length, and obtaining a unit of the numerical value based on the next character of the base;

And updating the field length in the formatted field list according to the value of the field length and the unit of the value.

5. The method of claim 2, wherein the updating the field value comprises:

matching the updated chapter content according to the field names, and judging whether a stateful verb or a comparative relationship word exists in the sentence under the condition of successful matching;

in the case of the presence of a stateful verb or a comparison relation word, performing part-of-speech analysis on a sentence containing a field name to obtain a sentence containing MUST (NOT) be (set to) NUM and less than NUM; sentences of greaterthan NUM or is equivalent to NUM, are not equal to NUM;

converting the morbid verbs or the comparison relation words into corresponding symbols, and extracting MUST (NOT) be (set to) NUM and less than NUM; values in greaterthan NUM or is equivalent to NUM, are not equal to NUM;

and updating the field value in the formatted field list based on the symbol and the value.

6. The method of claim 2, wherein the updating the field type comprises:

Deducing the field type and updating a formatted field list based on field description information comprising the field name, the field type and the field length of the protocol message in the structural description;

or alternatively, the first and second heat exchangers may be,

deducing a field type and updating the formatted field list according to the field name and the field length in the formatted field list;

or alternatively, the first and second heat exchangers may be,

deducing the field type and updating the formatted field list according to the field names in the formatted field list;

or alternatively, the first and second heat exchangers may be,

the field type is derived and the formatted field list is updated based on the field length in the formatted field list.

7. The method of claim 1, wherein the automaton comprises: one or more of a text list and a dot line graph.

The obtaining the structured descriptions in the chapter list and the row list to generate a key row list includes:

if the automaton form is a text list, if the current line is matched with one of the regular expressions [ r 'State\s+event\s+action\s+New State', r '\s State\s\I' ], identifying the current line as a starting line, identifying the current line as a blank line in the subsequent traversal process, and identifying the starting line as a terminating line if the starting line is not None;

in the case of the automaton form being a dotted line graph, if the current line matches one of the regular expressions [ r ' < - [/i + ] + - - - >, r ' < - - - [ -/i + ] + $ ' ], then the start line is identified, and in the subsequent traversal process the current line is empty, and the start line is not None, then the end line is identified.

8. The method of claim 7, wherein parsing the structured descriptions in the list of critical rows to obtain a list of undefined names for an automaton comprises:

cutting the key rows by using the spaces to obtain a word list;

traversing the word list;

if it is possible to obtain all capitalized words or words whose part of speech is a noun, adding the all capitalized words and the words whose part of speech is a noun to an undefined name list;

if a fully capitalized word or a word whose part of speech is a noun cannot be obtained, the regular expression [ r '(event\d+): (\w)', r '(\w) [ sS ] State:', r 'state\s (s\w+)' ] is used to match the State name and Event name of the automaton, and is added to the undefined name list.

9. The method of claim 8, wherein, in the case that the structured description is an automaton, acquiring chapter contents to which the corresponding key row of the structured description belongs, and matching the chapter contents with names in the undefined name list, and storing the obtained automaton migration information into an automaton migration list, including:

acquiring chapter contents to which the corresponding key rows of the structured description belong;

Dividing the chapter content again according to sentences to obtain updated chapter content;

if a sentence containing a keyword representing state transition exists in updated chapter content, and the first half part and the second half part of the keyword respectively have a name in the undefined name list, judging that the sentence contains automaton transition information for changing the state, and storing the sentence in the automaton transition list after dividing the sentence into four groups of automaton transition information; the automaton migration information quadruple comprises: original state, event, action, and destination state;

if a sentence containing a keyword representing state transition exists in the updated chapter content, and a first half part or a second half part of the keyword contains a name in the undefined name list, judging that the sentence contains automaton transition information without changing the state, and storing the sentence in the automaton transition list after dividing the sentence into four groups of automaton transition information.

10. The method of any one of claims 1-9, further comprising:

and generating detection rules and/or fuzzy test configuration files based on the network protocol design knowledge extraction result.