CN112463804B

CN112463804B - A Data Processing Method of Image Database Based on KDTree

Info

Publication number: CN112463804B
Application number: CN202110139298.4A
Authority: CN
Inventors: 王浩; 秦拯; 陈嘉欣; 欧露
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2021-02-02
Filing date: 2021-02-02
Publication date: 2021-06-15
Anticipated expiration: 2041-02-02
Also published as: CN112463804A

Abstract

The invention discloses a KDTree-based image database data processing method, comprising the following steps: step 1, traversing and integrating map labeling information based on KDTree to obtain label set S={s1,s2,...,sn}; step 2 Step 3: Perform corresponding desensitization processing according to the sensitivity level of the map annotation content. The present invention utilizes the location information marked on the map, realizes the traversal and integration of the marked content in the geospatial data and the automatic processing of sensitive information desensitization, and overcomes the cumbersome work, low efficiency, error-prone and loopholes under the existing manual processing. The phenomenon.

Description

A Data Processing Method of Image Database Based on KDTree

技术领域technical field

本发明涉及地理信息数据处理领域，尤其涉及一种基于KDTree的图像数据库数据处理方法。The invention relates to the field of geographic information data processing, in particular to a KDTree-based image database data processing method.

背景技术Background technique

近年来，计算机技术的迅猛发展推动了地理信息系统技术的发展与进步。地理信息系统（Geographic Information System，GIS）涉及地理学、测绘学、计算机科学与技术等多个学科，以计算机为工具，将地理空间数据作为研究对象，把地图这一独特的可视化效果和地理分析功能与数据库操作集成在一起，为地理、规划与管理等许多行业和部门提供决策信息。In recent years, the rapid development of computer technology has promoted the development and progress of geographic information system technology. Geographic Information System (GIS) involves many disciplines such as geography, surveying and mapping, computer science and technology. It uses computers as tools, takes geospatial data as the research object, and combines the unique visualization effect of maps with geographic analysis. Functionality is integrated with database operations to provide decision-making information for many industries and sectors such as geography, planning and management.

目前，随着移动互联网的发展，人们日常出行时对基于地理信息的服务需求不断增加。各类电子地图的广泛应用，在给人们工作、生活提供便利的同时，也带来不少安全隐患问题。其中，地图标注信息安全保护是值得研究的一个问题。地图标注信息中可能包含一些敏感内容，比如国家战略资源、军事禁地、国防设施等。对此，国家出台了多项法律规定和政策，如《公开地图内容表示补充规定（试行）》、《基础地理信息公开表示内容的规定（试行）》等，明确规定了公开地图中能表示和不能表示的内容，从法律政策层面加强了地理信息安全保护工作。目前各机构单位采用了内网隔离的技术手段以保障数据安全，并要求在将地理信息发布至外网之前，需要对内网中的敏感数据进行脱敏处理。At present, with the development of the mobile Internet, people's demand for services based on geographic information in their daily travel is increasing. The wide application of various electronic maps not only brings convenience to people's work and life, but also brings many hidden safety problems. Among them, the security protection of map annotation information is a problem worthy of study. The map annotation information may contain some sensitive content, such as national strategic resources, military forbidden areas, and national defense facilities. In this regard, the state has promulgated a number of legal regulations and policies, such as the Supplementary Provisions on the Representation of Public Map Content (Trial), and the Provisions on the Public Representation of Basic Geographic Information (Trial), etc., which clearly stipulate that the representation and The content that cannot be expressed has strengthened the security protection of geographic information from the legal and policy level. At present, various institutions have adopted the technical means of intranet isolation to ensure data security, and require that sensitive data in the intranet need to be desensitized before publishing geographic information to the extranet.

在现实场景中，地图上的标注内容以属性表的形式保存，为使得表达效果美观、完整，部分标注由多个单字构成，如属性表中用多条由单字构成的记录来表示一个地名，当在地图上搜索关键字时容易遗漏部分标注内容。在该情况下，现存的脱敏处理工作不得不依赖于人工检查地图中每个区域的标注内容，对此进行审核识别和处理，这种方式仍存在内容遗漏问题，且市级以上的地图区域庞大、内容复杂，人工处理工作繁琐，效率低下。因此，迫切需要研究一种基于计算机技术整合地图中的标注信息并进行脱敏处理的方法，这对维护地理信息安全具有重要意义。In the real scene, the labeling content on the map is saved in the form of an attribute table. In order to make the expression effect beautiful and complete, some labels are composed of multiple single characters. For example, in the attribute table, multiple records composed of single characters are used to represent a place name. When searching for keywords on the map, it is easy to miss some annotations. In this case, the existing desensitization work has to rely on manually checking the labeled content of each area in the map, and then review, identify and process it. This method still has the problem of content omission, and the map area above the city level It is huge, complex in content, cumbersome in manual processing and low in efficiency. Therefore, there is an urgent need to study a method for integrating and desensitizing the annotation information in the map based on computer technology, which is of great significance for maintaining the security of geographic information.

对由多个单字构成的标注内容，常见的划分方法为词法分析，即将字符序列转换为单词序列，把接收到的一串连续的字符切分成单个的词，再将得到的词与敏感词库进行匹配，进而检测出敏感信息。然而，当对标注包含的多个单字执行常规的增加、删除或修改操作，可能导致这些单字在属性表中乱序、重复排列，仅采用简单的分词无法处理该情况。For the labeling content composed of multiple words, the common division method is lexical analysis, that is, converting the sequence of characters into a sequence of words, dividing the received string of consecutive characters into individual words, and then combining the obtained words with the sensitive vocabulary. Matching is performed to detect sensitive information. However, when performing routine addition, deletion or modification operations on multiple words included in the annotation, these words may be out of order and repeated in the attribute table, which cannot be handled by simple word segmentation.

虽然标注内容在属性表中的排列不规则，但在地图中每个字符均关联一个位置坐标，待整合的标注字段在位置分布上有很强的相关性，如排列更紧密、从上至下、从左至右、几乎分布在一条直线上等。本发明利用了地图标注的位置信息，实现了对地理空间数据中标注内容的遍历、整合和敏感信息脱敏处理。Although the labeling content is irregularly arranged in the attribute table, each character in the map is associated with a location coordinate, and the labeling fields to be integrated have a strong correlation in location distribution, such as closer arrangement, top-to-bottom , from left to right, almost in a straight line, etc. The invention utilizes the location information marked on the map, and realizes the traversal, integration and desensitization processing of the marked content in the geospatial data.

名词解释：KDTree（k-dimensional树）：是一种对k维空间中的实例点进行存储以便对其进行快速检索的树形数据结构。Glossary: KDTree (k-dimensional tree): It is a tree data structure that stores instance points in k-dimensional space for fast retrieval.

jieba分词：一款非常流行中文开源分词包，具有高性能、准确率、可扩展性等特点，目前主要支持python，其它语言也有相关版本。jieba word segmentation: a very popular Chinese open source word segmentation package, with high performance, accuracy, scalability and other characteristics, currently mainly supports python, other languages also have related versions.

word2vec：是一群用来产生词向量的相关模型。这些模型为浅而双层的神经网络，用来训练以重新建构语言学之词文本。网络以词表现，并且需猜测相邻位置的输入词，在word2vec中词袋模型假设下，词的顺序是不重要的。训练完成之后，word2vec模型可用来映射每个词到一个向量，可用来表示词对词之间的关系，该向量为神经网络之隐藏层。word2vec: is a group of related models used to generate word vectors. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word texts. The network is represented by words and needs to guess the input words in adjacent positions. Under the assumption of the bag-of-words model in word2vec, the order of words is not important. After the training is completed, the word2vec model can be used to map each word to a vector, which can be used to represent the relationship between words and words, which is the hidden layer of the neural network.

发明内容SUMMARY OF THE INVENTION

本发明提出了一种基于KDTree（k-dimensional树）的图像数据库数据处理方法。本发明利用了地图标注的位置信息，实现了对地理空间数据中标注内容的遍历、整合和敏感信息脱敏的自动处理，克服了现有人工处理下工作繁琐，效率低下且容易出错以及出现漏洞的现象。The invention proposes an image database data processing method based on KDTree (k-dimensional tree). The present invention utilizes the location information marked on the map, realizes the traversal and integration of the marked content in the geospatial data and the automatic processing of sensitive information desensitization, and overcomes the cumbersome work, low efficiency, error-prone and loopholes under the existing manual processing. The phenomenon.

为实现上述目的，本发明的技术方案如下所示：To achieve the above object, technical scheme of the present invention is as follows:

一种基于KDTree的图像数据库数据处理方法，包括如下步骤：A KDTree-based image database data processing method, comprising the following steps:

步骤一、基于KDTree对地图标注信息进行遍历和整合，得到由n条标注构成的标注集S={s₁,s₂,…,s_n}；sn表示第n条标注；Step 1. Traverse and integrate the map annotation information based on KDTree, and obtain an annotation set S={s ₁ ,s ₂ ,...,s _n } composed of n annotations; sn represents the nth annotation;

步骤二、对标注集S进行基于词语相似度的敏感信息检测，将地图标注内容进行敏感度分级；Step 2: Sensitive information detection based on word similarity is performed on the label set S, and sensitivity grading is performed on the map label content;

步骤三、根据地图标注内容的敏感度等级进行对应的脱敏处理。Step 3: Perform corresponding desensitization processing according to the sensitivity level of the content marked on the map.

进一步的改进，所述步骤一包括如下步骤：A further improvement, the step 1 includes the following steps:

1.1提取地图的属性表中所有单字在地图上的位置坐标和字符内容形成数据集，根据每个字的二维坐标构建KDTree；1.1 Extract the position coordinates and character content of all words on the map in the attribute table of the map to form a data set, and construct KDTree according to the two-dimensional coordinates of each word;

1.2将所有单字按纵坐标由大到小的顺序进行排列，得到初步的队列Q；创建一个标记数组vis[]，用于记录队列Q中每个单字是否已处理，初始化为0，遍历队列Q直至队列Q为空；1.2 Arrange all words in descending order of the ordinate to get the initial queue Q; create a mark array vis[] to record whether each word in the queue Q has been processed, initialize it to 0, and traverse the queue Q until the queue Q is empty;

1.3按照初步的队列Q中单字的排列顺序依次处理单字；1.3 Process the words in turn according to the arrangement order of the words in the preliminary queue Q;

若当前单字点p未处理，即vis[p]=0，执行1.4步，并置vis[p]为1；If the current word point p is not processed, that is, vis[p]=0, execute step 1.4, and set vis[p] to 1;

若当前单子点p已处理，即vis[p]=1，跳至队列Q中的下一个点；If the current single point p has been processed, that is, vis[p]=1, jump to the next point in the queue Q;

1.4 在构建好的KDTree中查找距离当前单字点p范围为阈值[0, ε]的点，得到当前单字点p的近邻结点集，其中ε表示整合范围的参数，取前单字点p所对应字宽的1.5-2倍；在近邻结点集中按距离点p由近到远的顺序查找一个满足与当前单字点p整合条件的单字点q，若成功找到，则将当前单字点替换为q，并置vis[q]为1；单字点q即单字q在KDTree中对应的点；1.4 Find the point in the constructed KDTree whose distance from the current word point p is the threshold [0, ε], and obtain the set of neighbor nodes of the current word point p, where ε represents the parameter of the integration range, and take the corresponding point of the previous word point p. 1.5-2 times the word width; search for a single-word point q that satisfies the integration condition with the current single-word point p in the order of the distance point p from the nearest to the farthest in the neighbor node set, if it is successfully found, replace the current single-word point with q , and juxtapose vis[q] to 1; the single-character point q is the corresponding point of the single-character q in KDTree;

1.5重复步骤1.4至近邻结点集中没有与当前单字点可以整合的单字点，则整合在一起的单字点作为一条标注；1.5 Repeat step 1.4 until there is no single-character point that can be integrated with the current single-character point in the set of adjacent nodes, then the integrated single-character point is used as a label;

1.6当近邻结点集中没有与当前单字点可以整合的单字点时，按照初步的队列Q中单字的排列顺序，处理下一个未处理的单字；1.6 When there is no single word point that can be integrated with the current single word point in the set of adjacent nodes, the next unprocessed single word is processed according to the arrangement order of the words in the preliminary queue Q;

1.7重复步骤1.3-1.6至初步的队列Q中单字均完成处理，得到地图上的各条标注。1.7 Repeat steps 1.3-1.6 until the words in the preliminary queue Q are all processed, and each label on the map is obtained.

进一步的改进，步骤1.4中，所述整合条件要求如下：Further improvement, in step 1.4, the integration conditions are required as follows:

单字点q未处理过，即vis[q]=0；The single-character point q has not been processed, that is, vis[q]=0;

情况一、当已整合字段中只包含一个当前单字点p，则当前单字点p与近邻结点集中与前单字点p距离最近的单字点q进行整合；当近邻结点集中只有单字点p，则单字点p自身构成一个标注；Situation 1. When the integrated field contains only one current single-character point p, the current single-character point p is integrated with the single-character point q that is closest to the previous single-character point p in the set of adjacent nodes; Then the single character point p itself constitutes a label;

情况二、当已整合字段中包含两个及以上字时，即由多个字构成的字段与单字点q进行整合时，判断单字点q与已整合字段构成的新字段s中所有的字是否处于同一直线且由每相邻两个字的距离构成的数组的极差R是否满足：Case 2. When the integrated field contains two or more characters, that is, when the field composed of multiple characters is integrated with the single-character point q, determine whether all the words in the new field s composed of the single-character point q and the integrated field are not. Whether the range R of the arrays that are on the same straight line and formed by the distance of every two adjacent words satisfies:

其中，Len表示构成的新字段s中所包含的单字个数，

表示新字段s中第i个单字点与第j个单字点j的距离，

和

分别表示取最大值和最小值； γ为新字段s中单字宽度的0.2-0.5倍；Among them, Len represents the number of words contained in the new field s formed,

Indicates the distance between the i -th word point and the j-th word point j in the new field s,

and

Respectively represent the maximum and minimum values; γ is 0.2-0.5 times the width of the word in the new field s;

上述2个整合条件，若均满足则多个字构成的字段与单字点q进行整合，若至少有一个不满足，则多个字构成的字段不与单字点q进行整合。If both of the above two integration conditions are satisfied, the field composed of multiple characters is integrated with the single-character dot q, and if at least one of them is not satisfied, the field composed of multiple characters is not integrated with the single-character dot q.

进一步的改进，在整合前，首先排除重复字的干扰，若p、q对应的字框相交、且字的内容相同，则p、q是重复字，在属性表中删除q实现去重。For further improvement, before integration, first eliminate the interference of repeated words. If the word boxes corresponding to p and q intersect and the content of the words is the same, then p and q are repeated words. Delete q in the attribute table to achieve deduplication.

进一步的改进，步骤1.7中针对水平分布的标注，将标注中的单字按照横坐标自小到大的顺序，从左到右按照顺序排列。A further improvement, in step 1.7, for the labeling of the horizontal distribution, the words in the label are arranged in the order of the abscissa from small to large, and from left to right.

进一步的改进，所述步骤二包括如下步骤：A further improvement, the step 2 includes the following steps:

2.1：对标注内容采用中文分词技术进行分词：2.1: Use Chinese word segmentation technology to segment the marked content:

针对标注集S的每条标注内容si，采用中文分词技术和词向量构建技术转换为多个词向量；得到si={a1,a2,…,am}，a1…am为划分后得到的m个特征词；For each tag content si of tag set S, use Chinese word segmentation technology and word vector construction technology to convert into multiple word vectors; get si={a1,a2,...,am}, a1...am are m obtained after division feature word;

2.2：对特征词和敏感词采用词向量构建技术转换为词向量：2.2: Convert feature words and sensitive words into word vectors using word vector construction technology:

使用word2vec将特征词转换为词向量，第j个特征词aj转换后的词向量记为Aj；同样地，将敏感词库中所有敏感词bk转换为词向量，记为Bk；将特征词与敏感词的相似程度量化为特征词向量与敏感词向量的相似度，即取两个向量内积空间的夹角的余弦值为相似度，取值范围为[0,1]，相似度越接近1表示两个词相似程度越大；Use word2vec to convert the feature words into word vectors, and the converted word vector of the jth feature word aj is denoted as Aj; similarly, convert all sensitive words bk in the sensitive thesaurus into word vectors, denoted as Bk; The similarity of sensitive words is quantified as the similarity between the feature word vector and the sensitive word vector, that is, the cosine of the angle between the inner product spaces of the two vectors is taken as the similarity, and the value range is [0, 1], and the closer the similarity is 1 indicates that the two words are more similar;

表示Aj与Bk的相似度；

Represents the similarity between Aj and Bk;

2.3特征词的敏感度计算：2.3 Sensitivity calculation of feature words:

在敏感词库中，每个敏感词ck对应一个敏感级别

，

值越大，则敏感词的敏感程度越高；遍历敏感词库，对特征词aj，定义特征词aj最大敏感度

为In the sensitive thesaurus, each sensitive word ck corresponds to a sensitivity level

,

The larger the value, the higher the sensitivity of the sensitive word; traverse the sensitive word database, define the maximum sensitivity of the feature word aj to the feature word aj

for

其中，

表示敏感词库，计算每个敏感词向量与特征词向量的相似度与敏感级别的乘积，取其中的最大值表示特征词的最大敏感度；in,

Represents the sensitive word library, calculates the product of the similarity between each sensitive word vector and the feature word vector and the sensitivity level, and takes the maximum value to represent the maximum sensitivity of the feature word;

设置阈值参数θ，当

大于θ时，特征词aj才具有敏感性，否则不认为是敏感词，敏感度记为0，即特征词的敏感度为Set the threshold parameter θ, when

When it is greater than θ, the feature word aj has sensitivity, otherwise it is not considered as a sensitive word, and the sensitivity is recorded as 0, that is, the sensitivity of the feature word is

2.4 标注内容的敏感度计算：2.4 Sensitivity calculation of annotation content:

定义第i个标注

的敏感度

为：define the ith label

sensitivity

for:

式中，标注

包含m个特征词，j表示是标注

的第j个特征词；

为标注

的敏感度，即为标注

包含的m个特征词敏感度累加的和；得出特征词与标注内容的敏感度后，将标注内容按敏感度划分为高、中、低、非敏感4个级别。In the formula, label

Contains m feature words, j means it is a label

The jth feature word of ;

to label

The sensitivity of , that is, the labeling

The sum of the sensitivities of the m feature words included; after obtaining the sensitivity of the feature words and the labeled content, the labeled content is divided into four levels of high, medium, low, and insensitive according to the sensitivity.

进一步的改进，所述步骤三包括如下步骤：A further improvement, the step 3 includes the following steps:

3.1构建地理标注信息的白名单，每次人工发现算法误识别的非敏感数据时，将非敏感数据加入白名单，提高容错率；3.1 Build a whitelist of geo-labeled information, and add the non-sensitive data to the whitelist every time the non-sensitive data misidentified by the algorithm is manually discovered to improve the fault tolerance rate;

3.2敏感数据通过白名单筛选后，对高敏感级别的标注内容，采取直接删除的措施；对中、低敏感级别的标注，提取其中的敏感特征词，随机选择删除、替换、泛化的脱敏手段进行处理，然后重新计算脱敏后的标注敏感度，迭代预设次数后若仍不符合公开要求，则完全删除对应标注；其中，选择替换时在非敏感词库中选择与当前敏感特征词相似度最大的非敏感词作为替换，选择泛化操作时即将标注描述的具体内容抽象化，使描述范围包括更多的非敏感信息。3.2 After the sensitive data is screened through the whitelist, take measures to directly delete the marked content of high sensitivity level; extract the sensitive feature words for the marked content of medium and low sensitivity level, and randomly select the desensitization of deletion, replacement and generalization Then, the desensitized label sensitivity is recalculated. If it still does not meet the disclosure requirements after iterating for a preset number of times, the corresponding label will be completely deleted; among them, when selecting replacement, select the current sensitive feature word in the non-sensitive word database. The non-sensitive words with the highest similarity are used as replacements, and the specific content of the label description is abstracted when the generalization operation is selected, so that the description scope includes more non-sensitive information.

本发明的优点：Advantages of the present invention:

本发明利用了地图标注的位置信息，实现了对地理空间数据中标注内容的遍历、整合和敏感信息脱敏的自动处理，克服了现有人工处理下工作繁琐，效率低下且容易出错以及出现漏洞的现象。The present invention utilizes the location information marked on the map, realizes the traversal and integration of the marked content in the geospatial data and the automatic processing of sensitive information desensitization, and overcomes the cumbersome work, low efficiency, error-prone and loopholes under the existing manual processing. The phenomenon.

附图说明Description of drawings

图1为本发明工作流程图。Fig. 1 is the working flow chart of the present invention.

具体实施方式Detailed ways

如图1所示的一种基于KDTree的图像数据库数据处理方法，包括如下步骤As shown in Figure 1, a KDTree-based image database data processing method includes the following steps

（1）基于KDTree对地图标注信息遍历和整合；(1) Traverse and integrate map annotation information based on KDTree;

（2）基于词语相似度的进行敏感信息检测；(2) Sensitive information detection based on word similarity;

（3）基于特征词敏感度的进行脱敏。(3) Desensitization based on the sensitivity of feature words.

具体内容如下：The details are as follows:

（1）一种基于KDTree的地图标注信息遍历和整合的方法：(1) A method of traversing and integrating map annotation information based on KDTree:

该方法利用了地图标注信息在位置分布上的相关性，考虑所有单字构成的点，贪心地认为在一定范围内，距离当前单字所在点最近的另一个点，能满足两者整合成词的条件，当无法满足整合成词的条件时，则在该范围内考虑其他点，具体流程如下：This method takes advantage of the correlation of map annotation information in the location distribution, considers the points formed by all words, and greedily thinks that within a certain range, another point closest to the point where the current word is located can meet the conditions for integrating the two into a word , when the conditions for integrating into a word cannot be met, other points are considered within this range. The specific process is as follows:

步骤一：提取属性表中所有单字在地图上的位置坐标和字符内容，根据每个字的二维坐标构建KDTree。KDTree是一种分割k维数据空间的数据结构，常用于范围搜索和最近邻搜索。Step 1: Extract the location coordinates and character content of all words in the attribute table on the map, and build KDTree according to the two-dimensional coordinates of each word. KDTree is a data structure that divides the k-dimensional data space, and is often used for range search and nearest neighbor search.

步骤二：由于人们往往按从上到下、从左到右的顺序进行阅读，因此在整合标注内容之前，将所有单字按纵坐标由大到小的顺序进行排列，以便将后续访问的单字按顺序加入到当前已整合的字段内，得到初步的队列Q。创建一个标记数组vis[]，记录每个单字是否已处理，初始化为0。遍历队列Q直至队列为空：Step 2: Since people tend to read from top to bottom and from left to right, before integrating the marked content, arrange all the words in descending order of the ordinate, so that the words accessed later can be sorted by The sequence is added to the currently integrated field to obtain a preliminary queue Q. Create a flag array vis[] that records whether each word has been processed, initialized to 0. Traverse the queue Q until the queue is empty:

步骤三：若当前点p已处理，即vis[p]=1，则跳至队列中的下一个点；若vis[p]=0，则在KDTree中查找距离当前单字点p范围为阈值[0, ε]的点，得到当前单字点p的近邻结点集，在近邻结点集中按距离点p由近到远的顺序查找一个满足与当前单字点p整合条件的单字点q，若成功找到，则将当前单字点替换为q，并置vis[q]为1；步骤四：重复步骤三，直至近邻结点集中没有与当前单字点可以整合的单字点，则整合在一起的单字点作为一条标注，并按照初步的队列Q中单字的排列顺序，处理下一个未处理的单字；Step 3: If the current point p has been processed, that is, vis[p]=1, then jump to the next point in the queue; if vis[p]=0, search the range from the current single-word point p in KDTree to the threshold [ 0, ε], get the neighbor node set of the current unigram point p, and search for a unigram point q that satisfies the integration condition with the current unigram point p in the order of the distance point p from the nearest to the farthest in the neighbor node set, if successful If found, replace the current single-character point with q, and set vis[q] to 1; Step 4: Repeat step 3 until there is no single-character point that can be integrated with the current single-character point in the set of adjacent nodes, then the integrated single-character point As a mark, and according to the arrangement order of the words in the preliminary queue Q, process the next unprocessed word;

步骤五：重复步骤三到四，至初步的队列Q中单字均完成处理，得到地图上的各条标注。Step 5: Repeat steps 3 to 4 until all words in the preliminary queue Q are processed, and each label on the map is obtained.

另外，在将一个未处理的点整合入一个已处理的字段时，还需满足所构成的新字段s中所有的字处于同一直线且每相邻两个字的距离相近，即在地图上的位置分布满足构成一个标注的条件，否则不进行整合。整合时对纵坐标几乎无变化，而横坐标变化较大的字段，在合并时要将横坐标较小者放在标注内容的前端，即从左至右的顺序。In addition, when integrating an unprocessed point into a processed field, it is also necessary to satisfy that all the words in the formed new field s are on the same straight line and the distance between every two adjacent words is similar, that is, the distance between the two adjacent words on the map must be satisfied. The location distribution satisfies the conditions for forming a label, otherwise no integration is performed. During integration, there is almost no change in the ordinate, and the field with a large change in the abscissa should be placed at the front of the label content when merging, that is, the order from left to right.

队列Q为空后，所有单字处理完毕，整合后得到的标注集记为S={s₁,s₂,…,s_n}。After the queue Q is empty, all words are processed, and the label set obtained after integration is recorded as S={s ₁ ,s ₂ ,…,s _n }.

（2）一种基于词语相似度的敏感信息检测方法：(2) A sensitive information detection method based on word similarity:

对于标注集S，为了保护数据安全，需要进行基于词语相似度的敏感信息检测，主要包括以下四步：For the label set S, in order to protect data security, sensitive information detection based on word similarity is required, which mainly includes the following four steps:

步骤一：对标注内容采用中文分词技术进行分词：Step 1: Use Chinese word segmentation technology to segment the marked content:

首先，针对标注集S的每条标注内容s_i，采用中文分词技术和词向量构建技术将其转换为多个词向量。本发明使用jieba分词将地图标注划分成多个词语，得到s_i={a₁,a₂,…,a_m}，a₁…a_m为划分后得到的m个特征词。First, for each tag content _si of tag set S, it is converted into multiple word vectors by using Chinese word segmentation technology and word vector construction technology. The present invention uses jieba word segmentation to divide the map annotation into multiple words, and obtains s _i ={a ₁ ,a ₂ ,...,am }, where a ₁ ... _am are _m characteristic words obtained after division.

步骤二：对特征词和敏感词采用词向量构建技术转换为词向量：Step 2: Convert feature words and sensitive words into word vectors using word vector construction technology:

本发明使用word2vec将特征词转换为词向量，每个特征词a_j转换后的词向量记为A_j。同样地，将敏感词库中所有敏感词b_k转换为词向量，记为B_k。此时特征词与敏感词的相似程度可量化为特征词向量与敏感词向量的相似度，即取两个向量内积空间的夹角的余弦值为相似度

，取值范围为[0,1]，相似度越接近1表示两个词相似程度越大。The present invention uses word2vec to convert the feature words into word vectors, and the converted word vector of each feature word a _j is denoted as A _j . Similarly, convert all sensitive words b _k in the sensitive thesaurus into word vectors, denoted as B _k . At this time, the similarity between the feature word and the sensitive word can be quantified as the similarity between the feature word vector and the sensitive word vector, that is, taking the cosine of the angle between the inner product spaces of the two vectors as the similarity

, the value range is [0, 1], the closer the similarity is to 1, the greater the similarity between the two words.

步骤三：特征词的敏感度计算：Step 3: Sensitivity calculation of feature words:

在敏感词库中，每个敏感词c_k对应一个敏感级别

，L值越大，则该敏感词的敏感程度越高。遍历敏感词库，对特征词a_j，定义其最大敏感度为In the sensitive thesaurus, each sensitive word _ck corresponds to a sensitivity level

, the larger the L value, the higher the sensitivity of the sensitive word. Traverse the sensitive thesaurus, define the maximum sensitivity of the feature word a _j as

其中，

表示敏感词库，计算每个敏感词向量与特征词向量的相似度与敏感级别的乘积，取其中的最大值表示特征词

的最大敏感度。in,

Represents a sensitive thesaurus, calculates the product of the similarity between each sensitive word vector and the feature word vector and the sensitivity level, and takes the maximum value to represent the feature word

maximum sensitivity.

设置阈值参数θ，当

大于θ时，该特征词

才具有敏感性，否则不认为是敏感词，其敏感度记为0，即特征词

的敏感度为Set the threshold parameter θ, when

When greater than θ, the feature word

Only have sensitivity, otherwise it is not considered a sensitive word, and its sensitivity is recorded as 0, that is, a characteristic word

The sensitivity is

步骤四：标注内容的敏感度计算：Step 4: Sensitivity calculation of annotation content:

由于一条标注可由多个特征词构成，因此衡量标注的敏感度需考虑其所有的特征词的敏感度。另外，往往分布在靠近标注内容尾部的特征词，其敏感度对标注的敏感性影响程度更大。例如“湖南省武警医院”，虽然“武警”为敏感词，但该医院对公众开放，不属于敏感的标注内容。而如“某某核电站”、“某某军事基地”等，尾部的名词对标注的敏感性起决定性作用。因此，考虑到特征词的位置分布，定义标注的敏感度为：Since a label can be composed of multiple feature words, measuring the sensitivity of the label needs to consider the sensitivity of all its feature words. In addition, the sensitivity of the feature words that are often distributed near the end of the annotation content has a greater impact on the sensitivity of the annotation. For example, "Hunan Provincial Armed Police Hospital", although "armed police" is a sensitive word, the hospital is open to the public and is not a sensitive label. For example, "a certain nuclear power plant", "a certain military base", etc., the nouns at the end play a decisive role in the sensitivity of the labeling. Therefore, considering the location distribution of the feature words, the sensitivity of the annotation is defined as:

式中，标注

包含m个特征词，j表示

是该标注的第j个特征词，

为其敏感度，

为1,2,…,m的累加和。得出特征词与标注内容的敏感度后，可将标注内容按敏感度划分为高、中、低、非敏感4个级别。In the formula, label

Contains m feature words, j represents

is the j-th feature word of the annotation,

for its sensitivity ,

is the cumulative sum of 1,2,…,m. After obtaining the sensitivity of the feature words and the labeled content, the labeled content can be divided into four levels of high, medium, low, and insensitive according to the sensitivity.

（3）一种基于特征词敏感度的脱敏方法。(3) A desensitization method based on the sensitivity of feature words.

根据方法(2)检测出标注内容中的敏感数据后，需要对敏感数据按敏感级别分别进行处理。After detecting the sensitive data in the marked content according to method (2), the sensitive data needs to be processed according to the sensitivity level.

步骤一：对于一些数据本身不敏感的标注内容，由于可能包含了与敏感词相似的特征词，而被算法误识别为敏感数据，对此可以构建地理标注信息的白名单，每次人工发现算法误识别的非敏感数据时，将其加入白名单，以不断提高本发明模型的容错率。Step 1: For some labeled content that is not sensitive to the data itself, it may be mistakenly identified as sensitive data by the algorithm because it may contain feature words similar to sensitive words. For this, a whitelist of geographic labeling information can be constructed, and the algorithm is manually discovered each time. In case of wrongly identified non-sensitive data, it will be added to the white list to continuously improve the fault tolerance rate of the model of the present invention.

步骤二：为保证地图数据的可用性，不宜将所有敏感数据均做删除处理。算法识别出的敏感数据通过白名单筛选后，可以采用一些脱敏手段将其敏感度降低至非敏感级别，以达到对外公布的要求。对高敏感级别的标注内容，由于其泄露后极易威胁地理信息安全，因此采取直接删除的措施。对中、低等敏感级别的标注，提取其中的敏感特征词，随机选择删除、替换、泛化等脱敏手段进行处理，重新计算其标注敏感度，迭代一定次数后若仍不符合公开要求，则删除该标注。Step 2: To ensure the availability of map data, it is not appropriate to delete all sensitive data. After the sensitive data identified by the algorithm is filtered through the whitelist, some desensitization methods can be used to reduce its sensitivity to a non-sensitive level to meet the requirements of public disclosure. For the marked content with high sensitivity level, since it is very easy to threaten the security of geographic information after it is leaked, measures are taken to delete it directly. For labels with medium and low sensitivity levels, extract the sensitive feature words in them, randomly select desensitization methods such as deletion, replacement, and generalization for processing, and recalculate the labeling sensitivity. If it still does not meet the disclosure requirements after a certain number of iterations, delete the annotation.

其中，替换时在非敏感词库中选择与当前敏感特征词相似度最大的非敏感词作为替换，泛化操作即将标注描述的具体内容抽象化，使描述范围包括更多的非敏感信息，例如“解放军后勤基地”经泛化和替换操作后转换为“仓库”。Among them, when replacing, select the non-sensitive word with the greatest similarity to the current sensitive feature word in the non-sensitive word database as the replacement, and the generalization operation will abstract the specific content of the label description, so that the description scope includes more non-sensitive information, such as "PLA Logistics Base" is converted into "Warehouse" after generalization and replacement operations.

Claims

1. A KDTree-based image database data processing method is characterized by comprising the following steps:

step one, traversing and integrating map labeling information based on KDTree to obtain a labeling set S = { S } formed by n labels₁,s₂,…,s_n}; sn represents the nth label;

the method specifically comprises the following steps:

1.1, extracting position coordinates and character contents of all single characters on a map in an attribute table of the map to form a data set, and constructing a KDTree according to a two-dimensional coordinate of each character;

1.2, arranging all the single characters according to the sequence of the ordinate from big to small to obtain a primary queue Q; creating a mark array vis [ ]) for recording whether each single character in the queue Q is processed or not, initializing the mark array to 0, and traversing the queue Q until the queue Q is empty;

1.3, processing the single characters in sequence according to the arrangement sequence of the single characters in the primary queue Q;

if the current single character point p is not processed, namely vis [ p ] =0, executing 1.4 steps, and juxtaposing vis [ p ] as 1;

if the current single sub-point p is processed, namely vis [ p ] =1, jumping to the next point in the queue Q;

1.4 searching points which are within a threshold value of [0, epsilon ] from the range of the current single character point p in the constructed KDTree to obtain a neighbor node set of the current single character point p, wherein epsilon represents a parameter of an integration range, and 1.5-2 times of the word width corresponding to the previous single character point p is taken; searching a single character point q meeting the integration condition with the current single character point p in the neighbor node set according to the sequence of the distance points p from near to far, if the single character point q is found successfully, replacing the current single character point with q, and juxtaposing vis [ q ] as 1; the single character point q is the point corresponding to the single character q in the KDTree;

1.5 repeating the step 1.4 until no single character point which can be integrated with the current single character point exists in the neighbor node set, and taking the integrated single character point as a label;

1.6 when no single character point which can be integrated with the current single character point exists in the neighbor node set; processing the next unprocessed single character according to the arrangement sequence of the single characters in the preliminary queue Q;

1.7 repeating the steps 1.3-1.6 until the single characters in the preliminary queue Q are all processed; obtaining each label on the map

Secondly, sensitive information detection based on word similarity is carried out on the label set S, and sensitivity grading is carried out on the map label content;

and thirdly, performing corresponding desensitization treatment according to the sensitivity level of the map labeling content.

2. The KDTree-based image database data processing method according to claim 1, wherein in step 1.4, the integration condition is as follows:

the single character point q is not processed, namely vis [ q ] = 0;

in the first case, when the integrated field only contains one current single character point p, the current single character point p is integrated with a single character point q which is closest to the previous single character point p in the neighbor node set; when only a single character point p exists in the neighbor node set, the single character point p forms a label;

and secondly, when the integrated field contains two or more words, namely, when the field formed by a plurality of words is integrated with the single word point q, judging whether all the words in the new field s formed by the single word point q and the integrated field are in the same straight line and whether the range R of the array formed by the distance between every two adjacent words meets the following conditions:

wherein Len represents the number of the single characters contained in the new field s,

indicates the new field siIndividual character point and jth individual character pointjThe distance of (a) to (b),

and

respectively represent taking the maximum value anda minimum value; gamma is 0.2-0.5 times of the width of the word in the new field s;

if all of the 2 integration conditions are satisfied, the fields formed by the plurality of words are integrated with the single word point q, and if at least one of the integration conditions is not satisfied, the fields formed by the plurality of words are not integrated with the single word point q.

3. The KDTree-based image database data processing method according to claim 2, wherein before integration, interference of duplicate words is first eliminated, if word frames corresponding to p and q intersect and the contents of the words are the same, then p and q are duplicate words, and q is deleted from the attribute table to realize deduplication.

4. The KDTree-based image database data processing method according to claim 1, wherein in step 1.7, for horizontally distributed labels, the individual characters in the label are arranged in order from small to large on the abscissa, and in order from left to right.

5. The KDTree-based image database data processing method of claim 1, wherein the second step comprises the steps of:

2.1: performing word segmentation on the marked content by adopting a Chinese word segmentation technology:

aiming at each piece of labeled content si of the label set S, converting the labeled content si into a plurality of word vectors by adopting a Chinese word segmentation technology and a word vector construction technology; obtaining si = { a1, a2, …, am }, wherein a1 … am is m feature words obtained after division;

2.2: converting the characteristic words and the sensitive words into word vectors by adopting a word vector construction technology:

converting the characteristic words into word vectors by using word2vec, and recording the word vectors after the j-th characteristic word Aj is converted as Aj; similarly, all sensitive words Bk in the sensitive word stock are converted into word vectors which are recorded as Bk; the similarity degree of the feature words and the sensitive words is quantized into the similarity degree of the feature word vectors and the sensitive word vectors, namely, the cosine value of an included angle of inner product spaces of the two vectors is taken as the similarity degree, the value range is [0,1], and the closer the similarity degree is to 1, the greater the similarity degree of the two words is;

representing the similarity between Aj and Bk;

2.3 sensitivity calculation of feature words:

in the sensitive word bank, each sensitive word ck corresponds to one sensitive level

，

The larger the value is, the higher the sensitivity degree of the sensitive word is; traversing the sensitive word stock, and defining the maximum sensitivity of the characteristic words aj for the characteristic words aj

Is composed of

Wherein,

representing a sensitive word bank, calculating the product of the similarity of each sensitive word vector and the feature word vector and the sensitivity level, and taking the maximum value to represent the maximum sensitivity of the feature words;

setting a threshold parameter theta when

When the value is larger than theta, the characteristic word aj has sensitivity, otherwise, the characteristic word aj is not considered as a sensitive word, the sensitivity is marked as 0, namely, the sensitivity of the characteristic word is

2.4 sensitivity calculation for annotated content:

defining the ith annotation

Sensitivity of (2)

Comprises the following steps:

in the formula, notation

Containing m feature words, j represents a label

The jth feature word of (1);

for marking

Sensitivity of, i.e. labelling

The sum of the sensitivity accumulation of the contained m characteristic words; after the sensitivities of the feature words and the labeled contents are obtained, the labeled contents are divided into 4 levels of high sensitivity, medium sensitivity, low sensitivity and non-sensitivity according to the sensitivities.

6. The KDTree-based image database data processing method of claim 1, wherein the third step comprises the steps of:

3.1, constructing a white list of the geographic marking information, and adding the non-sensitive data into the white list every time when the non-sensitive data which is wrongly identified by the algorithm is manually found, so that the fault tolerance rate is improved;

3.2 after the sensitive data are screened by the white list, the marked content with high sensitive level is directly deleted; labeling the medium and low sensitivity levels, extracting sensitive characteristic words in the medium and low sensitivity levels, randomly selecting a desensitization means of deletion, replacement and generalization for processing, then recalculating the desensitization labeled sensitivity, and completely deleting the corresponding label if the desensitization labeled sensitivity does not meet the public requirement after iteration preset times; when the non-sensitive word with the maximum similarity to the current sensitive characteristic word is selected as the replacement in the selection replacement, the specific content of the label description is abstracted when the generalization operation is selected, so that the description range comprises more non-sensitive information.