[go: up one dir, main page]

CN112463804B - A Data Processing Method of Image Database Based on KDTree - Google Patents

A Data Processing Method of Image Database Based on KDTree Download PDF

Info

Publication number
CN112463804B
CN112463804B CN202110139298.4A CN202110139298A CN112463804B CN 112463804 B CN112463804 B CN 112463804B CN 202110139298 A CN202110139298 A CN 202110139298A CN 112463804 B CN112463804 B CN 112463804B
Authority
CN
China
Prior art keywords
word
words
sensitive
sensitivity
point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN202110139298.4A
Other languages
Chinese (zh)
Other versions
CN112463804A (en
Inventor
王浩
秦拯
陈嘉欣
欧露
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202110139298.4A priority Critical patent/CN112463804B/en
Publication of CN112463804A publication Critical patent/CN112463804A/en
Application granted granted Critical
Publication of CN112463804B publication Critical patent/CN112463804B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2246Trees, e.g. B+trees
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • G06F21/6254Protecting personal data, e.g. for financial or medical purposes by anonymising data, e.g. decorrelating personal data from the owner's identification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Bioethics (AREA)
  • Remote Sensing (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于KDTree的图像数据库数据处理方法,包括如下步骤:步骤一、基于KDTree对地图标注信息进行遍历和整合,得到标注集S={s1,s2,…,sn};步骤二、对标注集S进行基于词语相似度的敏感信息检测,将地图标注内容进行敏感度分级;步骤三、根据地图标注内容的敏感度等级进行对应的脱敏处理。本发明利用了地图标注的位置信息,实现了对地理空间数据中标注内容的遍历、整合和敏感信息脱敏的自动处理,克服了现有人工处理下工作繁琐,效率低下且容易出错以及出现漏洞的现象。

Figure 202110139298

The invention discloses a KDTree-based image database data processing method, comprising the following steps: step 1, traversing and integrating map labeling information based on KDTree to obtain label set S={s1,s2,...,sn}; step 2 Step 3: Perform corresponding desensitization processing according to the sensitivity level of the map annotation content. The present invention utilizes the location information marked on the map, realizes the traversal and integration of the marked content in the geospatial data and the automatic processing of sensitive information desensitization, and overcomes the cumbersome work, low efficiency, error-prone and loopholes under the existing manual processing. The phenomenon.

Figure 202110139298

Description

一种基于KDTree的图像数据库数据处理方法A Data Processing Method of Image Database Based on KDTree

技术领域technical field

本发明涉及地理信息数据处理领域,尤其涉及一种基于KDTree的图像数据库数据处理方法。The invention relates to the field of geographic information data processing, in particular to a KDTree-based image database data processing method.

背景技术Background technique

近年来,计算机技术的迅猛发展推动了地理信息系统技术的发展与进步。地理信息系统(Geographic Information System,GIS)涉及地理学、测绘学、计算机科学与技术等多个学科,以计算机为工具,将地理空间数据作为研究对象,把地图这一独特的可视化效果和地理分析功能与数据库操作集成在一起,为地理、规划与管理等许多行业和部门提供决策信息。In recent years, the rapid development of computer technology has promoted the development and progress of geographic information system technology. Geographic Information System (GIS) involves many disciplines such as geography, surveying and mapping, computer science and technology. It uses computers as tools, takes geospatial data as the research object, and combines the unique visualization effect of maps with geographic analysis. Functionality is integrated with database operations to provide decision-making information for many industries and sectors such as geography, planning and management.

目前,随着移动互联网的发展,人们日常出行时对基于地理信息的服务需求不断增加。各类电子地图的广泛应用,在给人们工作、生活提供便利的同时,也带来不少安全隐患问题。其中,地图标注信息安全保护是值得研究的一个问题。地图标注信息中可能包含一些敏感内容,比如国家战略资源、军事禁地、国防设施等。对此,国家出台了多项法律规定和政策,如《公开地图内容表示补充规定(试行)》、《基础地理信息公开表示内容的规定(试行)》等,明确规定了公开地图中能表示和不能表示的内容,从法律政策层面加强了地理信息安全保护工作。目前各机构单位采用了内网隔离的技术手段以保障数据安全,并要求在将地理信息发布至外网之前,需要对内网中的敏感数据进行脱敏处理。At present, with the development of the mobile Internet, people's demand for services based on geographic information in their daily travel is increasing. The wide application of various electronic maps not only brings convenience to people's work and life, but also brings many hidden safety problems. Among them, the security protection of map annotation information is a problem worthy of study. The map annotation information may contain some sensitive content, such as national strategic resources, military forbidden areas, and national defense facilities. In this regard, the state has promulgated a number of legal regulations and policies, such as the Supplementary Provisions on the Representation of Public Map Content (Trial), and the Provisions on the Public Representation of Basic Geographic Information (Trial), etc., which clearly stipulate that the representation and The content that cannot be expressed has strengthened the security protection of geographic information from the legal and policy level. At present, various institutions have adopted the technical means of intranet isolation to ensure data security, and require that sensitive data in the intranet need to be desensitized before publishing geographic information to the extranet.

在现实场景中,地图上的标注内容以属性表的形式保存,为使得表达效果美观、完整,部分标注由多个单字构成,如属性表中用多条由单字构成的记录来表示一个地名,当在地图上搜索关键字时容易遗漏部分标注内容。在该情况下,现存的脱敏处理工作不得不依赖于人工检查地图中每个区域的标注内容,对此进行审核识别和处理,这种方式仍存在内容遗漏问题,且市级以上的地图区域庞大、内容复杂,人工处理工作繁琐,效率低下。因此,迫切需要研究一种基于计算机技术整合地图中的标注信息并进行脱敏处理的方法,这对维护地理信息安全具有重要意义。In the real scene, the labeling content on the map is saved in the form of an attribute table. In order to make the expression effect beautiful and complete, some labels are composed of multiple single characters. For example, in the attribute table, multiple records composed of single characters are used to represent a place name. When searching for keywords on the map, it is easy to miss some annotations. In this case, the existing desensitization work has to rely on manually checking the labeled content of each area in the map, and then review, identify and process it. This method still has the problem of content omission, and the map area above the city level It is huge, complex in content, cumbersome in manual processing and low in efficiency. Therefore, there is an urgent need to study a method for integrating and desensitizing the annotation information in the map based on computer technology, which is of great significance for maintaining the security of geographic information.

对由多个单字构成的标注内容,常见的划分方法为词法分析,即将字符序列转换为单词序列,把接收到的一串连续的字符切分成单个的词,再将得到的词与敏感词库进行匹配,进而检测出敏感信息。然而,当对标注包含的多个单字执行常规的增加、删除或修改操作,可能导致这些单字在属性表中乱序、重复排列,仅采用简单的分词无法处理该情况。For the labeling content composed of multiple words, the common division method is lexical analysis, that is, converting the sequence of characters into a sequence of words, dividing the received string of consecutive characters into individual words, and then combining the obtained words with the sensitive vocabulary. Matching is performed to detect sensitive information. However, when performing routine addition, deletion or modification operations on multiple words included in the annotation, these words may be out of order and repeated in the attribute table, which cannot be handled by simple word segmentation.

虽然标注内容在属性表中的排列不规则,但在地图中每个字符均关联一个位置坐标,待整合的标注字段在位置分布上有很强的相关性,如排列更紧密、从上至下、从左至右、几乎分布在一条直线上等。本发明利用了地图标注的位置信息,实现了对地理空间数据中标注内容的遍历、整合和敏感信息脱敏处理。Although the labeling content is irregularly arranged in the attribute table, each character in the map is associated with a location coordinate, and the labeling fields to be integrated have a strong correlation in location distribution, such as closer arrangement, top-to-bottom , from left to right, almost in a straight line, etc. The invention utilizes the location information marked on the map, and realizes the traversal, integration and desensitization processing of the marked content in the geospatial data.

名词解释:KDTree(k-dimensional树):是一种对k维空间中的实例点进行存储以便对其进行快速检索的树形数据结构。Glossary: KDTree (k-dimensional tree): It is a tree data structure that stores instance points in k-dimensional space for fast retrieval.

jieba分词:一款非常流行中文开源分词包,具有高性能、准确率、可扩展性等特点,目前主要支持python,其它语言也有相关版本。jieba word segmentation: a very popular Chinese open source word segmentation package, with high performance, accuracy, scalability and other characteristics, currently mainly supports python, other languages also have related versions.

word2vec:是一群用来产生词向量的相关模型。这些模型为浅而双层的神经网络,用来训练以重新建构语言学之词文本。网络以词表现,并且需猜测相邻位置的输入词,在word2vec中词袋模型假设下,词的顺序是不重要的。训练完成之后,word2vec模型可用来映射每个词到一个向量,可用来表示词对词之间的关系,该向量为神经网络之隐藏层。word2vec: is a group of related models used to generate word vectors. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic word texts. The network is represented by words and needs to guess the input words in adjacent positions. Under the assumption of the bag-of-words model in word2vec, the order of words is not important. After the training is completed, the word2vec model can be used to map each word to a vector, which can be used to represent the relationship between words and words, which is the hidden layer of the neural network.

发明内容SUMMARY OF THE INVENTION

本发明提出了一种基于KDTree(k-dimensional树)的图像数据库数据处理方法。本发明利用了地图标注的位置信息,实现了对地理空间数据中标注内容的遍历、整合和敏感信息脱敏的自动处理,克服了现有人工处理下工作繁琐,效率低下且容易出错以及出现漏洞的现象。The invention proposes an image database data processing method based on KDTree (k-dimensional tree). The present invention utilizes the location information marked on the map, realizes the traversal and integration of the marked content in the geospatial data and the automatic processing of sensitive information desensitization, and overcomes the cumbersome work, low efficiency, error-prone and loopholes under the existing manual processing. The phenomenon.

为实现上述目的,本发明的技术方案如下所示:To achieve the above object, technical scheme of the present invention is as follows:

一种基于KDTree的图像数据库数据处理方法,包括如下步骤:A KDTree-based image database data processing method, comprising the following steps:

步骤一、基于KDTree对地图标注信息进行遍历和整合,得到由n条标注构成的标注集S={s1,s2,…,sn};sn表示第n条标注;Step 1. Traverse and integrate the map annotation information based on KDTree, and obtain an annotation set S={s 1 ,s 2 ,...,s n } composed of n annotations; sn represents the nth annotation;

步骤二、对标注集S进行基于词语相似度的敏感信息检测,将地图标注内容进行敏感度分级;Step 2: Sensitive information detection based on word similarity is performed on the label set S, and sensitivity grading is performed on the map label content;

步骤三、根据地图标注内容的敏感度等级进行对应的脱敏处理。Step 3: Perform corresponding desensitization processing according to the sensitivity level of the content marked on the map.

进一步的改进,所述步骤一包括如下步骤:A further improvement, the step 1 includes the following steps:

1.1提取地图的属性表中所有单字在地图上的位置坐标和字符内容形成数据集,根据每个字的二维坐标构建KDTree;1.1 Extract the position coordinates and character content of all words on the map in the attribute table of the map to form a data set, and construct KDTree according to the two-dimensional coordinates of each word;

1.2将所有单字按纵坐标由大到小的顺序进行排列,得到初步的队列Q;创建一个标记数组vis[],用于记录队列Q中每个单字是否已处理,初始化为0,遍历队列Q直至队列Q为空;1.2 Arrange all words in descending order of the ordinate to get the initial queue Q; create a mark array vis[] to record whether each word in the queue Q has been processed, initialize it to 0, and traverse the queue Q until the queue Q is empty;

1.3按照初步的队列Q中单字的排列顺序依次处理单字;1.3 Process the words in turn according to the arrangement order of the words in the preliminary queue Q;

若当前单字点p未处理,即vis[p]=0,执行1.4步,并置vis[p]为1;If the current word point p is not processed, that is, vis[p]=0, execute step 1.4, and set vis[p] to 1;

若当前单子点p已处理,即vis[p]=1,跳至队列Q中的下一个点;If the current single point p has been processed, that is, vis[p]=1, jump to the next point in the queue Q;

1.4 在构建好的KDTree中查找距离当前单字点p范围为阈值[0, ε]的点,得到当前单字点p的近邻结点集,其中ε表示整合范围的参数,取前单字点p所对应字宽的1.5-2倍;在近邻结点集中按距离点p由近到远的顺序查找一个满足与当前单字点p整合条件的单字点q,若成功找到,则将当前单字点替换为q,并置vis[q]为1;单字点q即单字q在KDTree中对应的点;1.4 Find the point in the constructed KDTree whose distance from the current word point p is the threshold [0, ε], and obtain the set of neighbor nodes of the current word point p, where ε represents the parameter of the integration range, and take the corresponding point of the previous word point p. 1.5-2 times the word width; search for a single-word point q that satisfies the integration condition with the current single-word point p in the order of the distance point p from the nearest to the farthest in the neighbor node set, if it is successfully found, replace the current single-word point with q , and juxtapose vis[q] to 1; the single-character point q is the corresponding point of the single-character q in KDTree;

1.5重复步骤1.4至近邻结点集中没有与当前单字点可以整合的单字点,则整合在一起的单字点作为一条标注;1.5 Repeat step 1.4 until there is no single-character point that can be integrated with the current single-character point in the set of adjacent nodes, then the integrated single-character point is used as a label;

1.6当近邻结点集中没有与当前单字点可以整合的单字点时,按照初步的队列Q中单字的排列顺序,处理下一个未处理的单字;1.6 When there is no single word point that can be integrated with the current single word point in the set of adjacent nodes, the next unprocessed single word is processed according to the arrangement order of the words in the preliminary queue Q;

1.7重复步骤1.3-1.6至初步的队列Q中单字均完成处理,得到地图上的各条标注。1.7 Repeat steps 1.3-1.6 until the words in the preliminary queue Q are all processed, and each label on the map is obtained.

进一步的改进,步骤1.4中,所述整合条件要求如下:Further improvement, in step 1.4, the integration conditions are required as follows:

单字点q未处理过,即vis[q]=0;The single-character point q has not been processed, that is, vis[q]=0;

情况一、当已整合字段中只包含一个当前单字点p,则当前单字点p与近邻结点集中与前单字点p距离最近的单字点q进行整合;当近邻结点集中只有单字点p,则单字点p自身构成一个标注;Situation 1. When the integrated field contains only one current single-character point p, the current single-character point p is integrated with the single-character point q that is closest to the previous single-character point p in the set of adjacent nodes; Then the single character point p itself constitutes a label;

情况二、当已整合字段中包含两个及以上字时,即由多个字构成的字段与单字点q进行整合时,判断单字点q与已整合字段构成的新字段s中所有的字是否处于同一直线且由每相邻两个字的距离构成的数组的极差R是否满足:Case 2. When the integrated field contains two or more characters, that is, when the field composed of multiple characters is integrated with the single-character point q, determine whether all the words in the new field s composed of the single-character point q and the integrated field are not. Whether the range R of the arrays that are on the same straight line and formed by the distance of every two adjacent words satisfies:

Figure 848847DEST_PATH_IMAGE001
Figure 848847DEST_PATH_IMAGE001

其中,Len表示构成的新字段s中所包含的单字个数,

Figure 354915DEST_PATH_IMAGE002
表示新字段s中第i个单字点与第j个单字点j的距离,
Figure 633449DEST_PATH_IMAGE003
Figure 578271DEST_PATH_IMAGE004
分别表示取最大值和最小值; γ为新字段s中单字宽度的0.2-0.5倍;Among them, Len represents the number of words contained in the new field s formed,
Figure 354915DEST_PATH_IMAGE002
Indicates the distance between the i -th word point and the j-th word point j in the new field s,
Figure 633449DEST_PATH_IMAGE003
and
Figure 578271DEST_PATH_IMAGE004
Respectively represent the maximum and minimum values; γ is 0.2-0.5 times the width of the word in the new field s;

上述2个整合条件,若均满足则多个字构成的字段与单字点q进行整合,若至少有一个不满足,则多个字构成的字段不与单字点q进行整合。If both of the above two integration conditions are satisfied, the field composed of multiple characters is integrated with the single-character dot q, and if at least one of them is not satisfied, the field composed of multiple characters is not integrated with the single-character dot q.

进一步的改进,在整合前,首先排除重复字的干扰,若p、q对应的字框相交、且字的内容相同,则p、q是重复字,在属性表中删除q实现去重。For further improvement, before integration, first eliminate the interference of repeated words. If the word boxes corresponding to p and q intersect and the content of the words is the same, then p and q are repeated words. Delete q in the attribute table to achieve deduplication.

进一步的改进,步骤1.7中针对水平分布的标注,将标注中的单字按照横坐标自小到大的顺序,从左到右按照顺序排列。A further improvement, in step 1.7, for the labeling of the horizontal distribution, the words in the label are arranged in the order of the abscissa from small to large, and from left to right.

进一步的改进,所述步骤二包括如下步骤:A further improvement, the step 2 includes the following steps:

2.1:对标注内容采用中文分词技术进行分词:2.1: Use Chinese word segmentation technology to segment the marked content:

针对标注集S的每条标注内容si,采用中文分词技术和词向量构建技术转换为多个词向量;得到si={a1,a2,…,am},a1…am为划分后得到的m个特征词;For each tag content si of tag set S, use Chinese word segmentation technology and word vector construction technology to convert into multiple word vectors; get si={a1,a2,...,am}, a1...am are m obtained after division feature word;

2.2:对特征词和敏感词采用词向量构建技术转换为词向量:2.2: Convert feature words and sensitive words into word vectors using word vector construction technology:

使用word2vec将特征词转换为词向量,第j个特征词aj转换后的词向量记为Aj;同样地,将敏感词库中所有敏感词bk转换为词向量,记为Bk;将特征词与敏感词的相似程度量化为特征词向量与敏感词向量的相似度,即取两个向量内积空间的夹角的余弦值为相似度,取值范围为[0,1],相似度越接近1表示两个词相似程度越大;Use word2vec to convert the feature words into word vectors, and the converted word vector of the jth feature word aj is denoted as Aj; similarly, convert all sensitive words bk in the sensitive thesaurus into word vectors, denoted as Bk; The similarity of sensitive words is quantified as the similarity between the feature word vector and the sensitive word vector, that is, the cosine of the angle between the inner product spaces of the two vectors is taken as the similarity, and the value range is [0, 1], and the closer the similarity is 1 indicates that the two words are more similar;

Figure 930755DEST_PATH_IMAGE005
Figure 930755DEST_PATH_IMAGE005

Figure 935620DEST_PATH_IMAGE006
表示Aj与Bk的相似度;
Figure 935620DEST_PATH_IMAGE006
Represents the similarity between Aj and Bk;

2.3特征词的敏感度计算:2.3 Sensitivity calculation of feature words:

在敏感词库中,每个敏感词ck对应一个敏感级别

Figure 373555DEST_PATH_IMAGE007
Figure 918806DEST_PATH_IMAGE007
值越大,则敏感词的敏感程度越高;遍历敏感词库,对特征词aj,定义特征词aj最大敏感度
Figure 453692DEST_PATH_IMAGE008
为In the sensitive thesaurus, each sensitive word ck corresponds to a sensitivity level
Figure 373555DEST_PATH_IMAGE007
,
Figure 918806DEST_PATH_IMAGE007
The larger the value, the higher the sensitivity of the sensitive word; traverse the sensitive word database, define the maximum sensitivity of the feature word aj to the feature word aj
Figure 453692DEST_PATH_IMAGE008
for

Figure 301563DEST_PATH_IMAGE009
Figure 301563DEST_PATH_IMAGE009

其中,

Figure 289110DEST_PATH_IMAGE010
表示敏感词库,计算每个敏感词向量与特征词向量的相似度与敏感级别的乘积,取其中的最大值表示特征词的最大敏感度;in,
Figure 289110DEST_PATH_IMAGE010
Represents the sensitive word library, calculates the product of the similarity between each sensitive word vector and the feature word vector and the sensitivity level, and takes the maximum value to represent the maximum sensitivity of the feature word;

设置阈值参数θ,当

Figure 513418DEST_PATH_IMAGE008
大于θ时,特征词aj才具有敏感性,否则不认为是敏感词,敏感度记为0,即特征词的敏感度为Set the threshold parameter θ, when
Figure 513418DEST_PATH_IMAGE008
When it is greater than θ, the feature word aj has sensitivity, otherwise it is not considered as a sensitive word, and the sensitivity is recorded as 0, that is, the sensitivity of the feature word is

Figure 902811DEST_PATH_IMAGE011
Figure 902811DEST_PATH_IMAGE011

2.4 标注内容的敏感度计算:2.4 Sensitivity calculation of annotation content:

定义第i个标注

Figure 187162DEST_PATH_IMAGE012
的敏感度
Figure 662006DEST_PATH_IMAGE013
为:define the ith label
Figure 187162DEST_PATH_IMAGE012
sensitivity
Figure 662006DEST_PATH_IMAGE013
for:

Figure 486742DEST_PATH_IMAGE014
Figure 486742DEST_PATH_IMAGE014

式中,标注

Figure 668325DEST_PATH_IMAGE012
包含m个特征词,j表示是标注
Figure 185894DEST_PATH_IMAGE012
的第j个特征词;
Figure 148034DEST_PATH_IMAGE013
为标注
Figure 714144DEST_PATH_IMAGE012
的敏感度,即为标注
Figure 812550DEST_PATH_IMAGE012
包含的m个特征词敏感度累加的和;得出特征词与标注内容的敏感度后,将标注内容按敏感度划分为高、中、低、非敏感4个级别。In the formula, label
Figure 668325DEST_PATH_IMAGE012
Contains m feature words, j means it is a label
Figure 185894DEST_PATH_IMAGE012
The jth feature word of ;
Figure 148034DEST_PATH_IMAGE013
to label
Figure 714144DEST_PATH_IMAGE012
The sensitivity of , that is, the labeling
Figure 812550DEST_PATH_IMAGE012
The sum of the sensitivities of the m feature words included; after obtaining the sensitivity of the feature words and the labeled content, the labeled content is divided into four levels of high, medium, low, and insensitive according to the sensitivity.

进一步的改进,所述步骤三包括如下步骤:A further improvement, the step 3 includes the following steps:

3.1构建地理标注信息的白名单,每次人工发现算法误识别的非敏感数据时,将非敏感数据加入白名单,提高容错率;3.1 Build a whitelist of geo-labeled information, and add the non-sensitive data to the whitelist every time the non-sensitive data misidentified by the algorithm is manually discovered to improve the fault tolerance rate;

3.2敏感数据通过白名单筛选后,对高敏感级别的标注内容,采取直接删除的措施;对中、低敏感级别的标注,提取其中的敏感特征词,随机选择删除、替换、泛化的脱敏手段进行处理,然后重新计算脱敏后的标注敏感度,迭代预设次数后若仍不符合公开要求,则完全删除对应标注;其中,选择替换时在非敏感词库中选择与当前敏感特征词相似度最大的非敏感词作为替换,选择泛化操作时即将标注描述的具体内容抽象化,使描述范围包括更多的非敏感信息。3.2 After the sensitive data is screened through the whitelist, take measures to directly delete the marked content of high sensitivity level; extract the sensitive feature words for the marked content of medium and low sensitivity level, and randomly select the desensitization of deletion, replacement and generalization Then, the desensitized label sensitivity is recalculated. If it still does not meet the disclosure requirements after iterating for a preset number of times, the corresponding label will be completely deleted; among them, when selecting replacement, select the current sensitive feature word in the non-sensitive word database. The non-sensitive words with the highest similarity are used as replacements, and the specific content of the label description is abstracted when the generalization operation is selected, so that the description scope includes more non-sensitive information.

本发明的优点:Advantages of the present invention:

本发明利用了地图标注的位置信息,实现了对地理空间数据中标注内容的遍历、整合和敏感信息脱敏的自动处理,克服了现有人工处理下工作繁琐,效率低下且容易出错以及出现漏洞的现象。The present invention utilizes the location information marked on the map, realizes the traversal and integration of the marked content in the geospatial data and the automatic processing of sensitive information desensitization, and overcomes the cumbersome work, low efficiency, error-prone and loopholes under the existing manual processing. The phenomenon.

附图说明Description of drawings

图1为本发明工作流程图。Fig. 1 is the working flow chart of the present invention.

具体实施方式Detailed ways

如图1所示的一种基于KDTree的图像数据库数据处理方法,包括如下步骤As shown in Figure 1, a KDTree-based image database data processing method includes the following steps

(1)基于KDTree对地图标注信息遍历和整合;(1) Traverse and integrate map annotation information based on KDTree;

(2)基于词语相似度的进行敏感信息检测;(2) Sensitive information detection based on word similarity;

(3)基于特征词敏感度的进行脱敏。(3) Desensitization based on the sensitivity of feature words.

具体内容如下:The details are as follows:

(1)一种基于KDTree的地图标注信息遍历和整合的方法:(1) A method of traversing and integrating map annotation information based on KDTree:

该方法利用了地图标注信息在位置分布上的相关性,考虑所有单字构成的点,贪心地认为在一定范围内,距离当前单字所在点最近的另一个点,能满足两者整合成词的条件,当无法满足整合成词的条件时,则在该范围内考虑其他点,具体流程如下:This method takes advantage of the correlation of map annotation information in the location distribution, considers the points formed by all words, and greedily thinks that within a certain range, another point closest to the point where the current word is located can meet the conditions for integrating the two into a word , when the conditions for integrating into a word cannot be met, other points are considered within this range. The specific process is as follows:

步骤一:提取属性表中所有单字在地图上的位置坐标和字符内容,根据每个字的二维坐标构建KDTree。KDTree是一种分割k维数据空间的数据结构,常用于范围搜索和最近邻搜索。Step 1: Extract the location coordinates and character content of all words in the attribute table on the map, and build KDTree according to the two-dimensional coordinates of each word. KDTree is a data structure that divides the k-dimensional data space, and is often used for range search and nearest neighbor search.

步骤二:由于人们往往按从上到下、从左到右的顺序进行阅读,因此在整合标注内容之前,将所有单字按纵坐标由大到小的顺序进行排列,以便将后续访问的单字按顺序加入到当前已整合的字段内,得到初步的队列Q。创建一个标记数组vis[],记录每个单字是否已处理,初始化为0。遍历队列Q直至队列为空:Step 2: Since people tend to read from top to bottom and from left to right, before integrating the marked content, arrange all the words in descending order of the ordinate, so that the words accessed later can be sorted by The sequence is added to the currently integrated field to obtain a preliminary queue Q. Create a flag array vis[] that records whether each word has been processed, initialized to 0. Traverse the queue Q until the queue is empty:

步骤三:若当前点p已处理,即vis[p]=1,则跳至队列中的下一个点;若vis[p]=0,则在KDTree中查找距离当前单字点p范围为阈值[0, ε]的点,得到当前单字点p的近邻结点集,在近邻结点集中按距离点p由近到远的顺序查找一个满足与当前单字点p整合条件的单字点q,若成功找到,则将当前单字点替换为q,并置vis[q]为1; 步骤四:重复步骤三,直至近邻结点集中没有与当前单字点可以整合的单字点,则整合在一起的单字点作为一条标注,并按照初步的队列Q中单字的排列顺序,处理下一个未处理的单字;Step 3: If the current point p has been processed, that is, vis[p]=1, then jump to the next point in the queue; if vis[p]=0, search the range from the current single-word point p in KDTree to the threshold [ 0, ε], get the neighbor node set of the current unigram point p, and search for a unigram point q that satisfies the integration condition with the current unigram point p in the order of the distance point p from the nearest to the farthest in the neighbor node set, if successful If found, replace the current single-character point with q, and set vis[q] to 1; Step 4: Repeat step 3 until there is no single-character point that can be integrated with the current single-character point in the set of adjacent nodes, then the integrated single-character point As a mark, and according to the arrangement order of the words in the preliminary queue Q, process the next unprocessed word;

步骤五:重复步骤三到四,至初步的队列Q中单字均完成处理,得到地图上的各条标注。Step 5: Repeat steps 3 to 4 until all words in the preliminary queue Q are processed, and each label on the map is obtained.

另外,在将一个未处理的点整合入一个已处理的字段时,还需满足所构成的新字段s中所有的字处于同一直线且每相邻两个字的距离相近,即在地图上的位置分布满足构成一个标注的条件,否则不进行整合。整合时对纵坐标几乎无变化,而横坐标变化较大的字段,在合并时要将横坐标较小者放在标注内容的前端,即从左至右的顺序。In addition, when integrating an unprocessed point into a processed field, it is also necessary to satisfy that all the words in the formed new field s are on the same straight line and the distance between every two adjacent words is similar, that is, the distance between the two adjacent words on the map must be satisfied. The location distribution satisfies the conditions for forming a label, otherwise no integration is performed. During integration, there is almost no change in the ordinate, and the field with a large change in the abscissa should be placed at the front of the label content when merging, that is, the order from left to right.

队列Q为空后,所有单字处理完毕,整合后得到的标注集记为S={s1,s2,…,sn}。After the queue Q is empty, all words are processed, and the label set obtained after integration is recorded as S={s 1 ,s 2 ,…,s n }.

(2)一种基于词语相似度的敏感信息检测方法:(2) A sensitive information detection method based on word similarity:

对于标注集S,为了保护数据安全,需要进行基于词语相似度的敏感信息检测,主要包括以下四步:For the label set S, in order to protect data security, sensitive information detection based on word similarity is required, which mainly includes the following four steps:

步骤一:对标注内容采用中文分词技术进行分词:Step 1: Use Chinese word segmentation technology to segment the marked content:

首先,针对标注集S的每条标注内容si,采用中文分词技术和词向量构建技术将其转换为多个词向量。本发明使用jieba分词将地图标注划分成多个词语,得到si={a1,a2,…,am},a1…am为划分后得到的m个特征词。First, for each tag content si of tag set S, it is converted into multiple word vectors by using Chinese word segmentation technology and word vector construction technology. The present invention uses jieba word segmentation to divide the map annotation into multiple words, and obtains s i ={a 1 ,a 2 ,...,am }, where a 1 ... am are m characteristic words obtained after division.

步骤二:对特征词和敏感词采用词向量构建技术转换为词向量:Step 2: Convert feature words and sensitive words into word vectors using word vector construction technology:

本发明使用word2vec将特征词转换为词向量,每个特征词aj转换后的词向量记为Aj。同样地,将敏感词库中所有敏感词bk转换为词向量,记为Bk。此时特征词与敏感词的相似程度可量化为特征词向量与敏感词向量的相似度,即取两个向量内积空间的夹角的余弦值为相似度

Figure 438704DEST_PATH_IMAGE015
,取值范围为[0,1],相似度越接近1表示两个词相似程度越大。The present invention uses word2vec to convert the feature words into word vectors, and the converted word vector of each feature word a j is denoted as A j . Similarly, convert all sensitive words b k in the sensitive thesaurus into word vectors, denoted as B k . At this time, the similarity between the feature word and the sensitive word can be quantified as the similarity between the feature word vector and the sensitive word vector, that is, taking the cosine of the angle between the inner product spaces of the two vectors as the similarity
Figure 438704DEST_PATH_IMAGE015
, the value range is [0, 1], the closer the similarity is to 1, the greater the similarity between the two words.

Figure 888140DEST_PATH_IMAGE016
Figure 888140DEST_PATH_IMAGE016

步骤三:特征词的敏感度计算:Step 3: Sensitivity calculation of feature words:

在敏感词库中,每个敏感词ck对应一个敏感级别

Figure 992362DEST_PATH_IMAGE018
,L值越大,则该敏感词的敏感程度越高。遍历敏感词库,对特征词aj,定义其最大敏感度为In the sensitive thesaurus, each sensitive word ck corresponds to a sensitivity level
Figure 992362DEST_PATH_IMAGE018
, the larger the L value, the higher the sensitivity of the sensitive word. Traverse the sensitive thesaurus, define the maximum sensitivity of the feature word a j as

Figure 210854DEST_PATH_IMAGE019
Figure 210854DEST_PATH_IMAGE019

其中,

Figure 7908DEST_PATH_IMAGE021
表示敏感词库,计算每个敏感词向量与特征词向量的相似度与敏感级别的乘积,取其中的最大值表示特征词
Figure 679061DEST_PATH_IMAGE023
的最大敏感度。in,
Figure 7908DEST_PATH_IMAGE021
Represents a sensitive thesaurus, calculates the product of the similarity between each sensitive word vector and the feature word vector and the sensitivity level, and takes the maximum value to represent the feature word
Figure 679061DEST_PATH_IMAGE023
maximum sensitivity.

设置阈值参数θ,当

Figure 852553DEST_PATH_IMAGE025
大于θ时,该特征词
Figure 925552DEST_PATH_IMAGE023
才具有敏感性,否则不认为是敏感词,其敏感度记为0,即特征词
Figure 893508DEST_PATH_IMAGE026
的敏感度为Set the threshold parameter θ, when
Figure 852553DEST_PATH_IMAGE025
When greater than θ, the feature word
Figure 925552DEST_PATH_IMAGE023
Only have sensitivity, otherwise it is not considered a sensitive word, and its sensitivity is recorded as 0, that is, a characteristic word
Figure 893508DEST_PATH_IMAGE026
The sensitivity is

Figure 51956DEST_PATH_IMAGE027
Figure 51956DEST_PATH_IMAGE027

步骤四:标注内容的敏感度计算:Step 4: Sensitivity calculation of annotation content:

由于一条标注可由多个特征词构成,因此衡量标注的敏感度需考虑其所有的特征词的敏感度。另外,往往分布在靠近标注内容尾部的特征词,其敏感度对标注的敏感性影响程度更大。例如“湖南省武警医院”,虽然“武警”为敏感词,但该医院对公众开放,不属于敏感的标注内容。而如“某某核电站”、“某某军事基地”等,尾部的名词对标注的敏感性起决定性作用。因此,考虑到特征词的位置分布,定义标注的敏感度为:Since a label can be composed of multiple feature words, measuring the sensitivity of the label needs to consider the sensitivity of all its feature words. In addition, the sensitivity of the feature words that are often distributed near the end of the annotation content has a greater impact on the sensitivity of the annotation. For example, "Hunan Provincial Armed Police Hospital", although "armed police" is a sensitive word, the hospital is open to the public and is not a sensitive label. For example, "a certain nuclear power plant", "a certain military base", etc., the nouns at the end play a decisive role in the sensitivity of the labeling. Therefore, considering the location distribution of the feature words, the sensitivity of the annotation is defined as:

Figure 497981DEST_PATH_IMAGE028
Figure 497981DEST_PATH_IMAGE028

式中,标注

Figure DEST_PATH_IMAGE030
包含m个特征词,j表示
Figure 691065DEST_PATH_IMAGE023
是该标注的第j个特征词,
Figure DEST_PATH_IMAGE032
为其敏感度
Figure DEST_PATH_IMAGE034
为1,2,…,m的累加和。得出特征词与标注内容的敏感度后,可将标注内容按敏感度划分为高、中、低、非敏感4个级别。In the formula, label
Figure DEST_PATH_IMAGE030
Contains m feature words, j represents
Figure 691065DEST_PATH_IMAGE023
is the j-th feature word of the annotation,
Figure DEST_PATH_IMAGE032
for its sensitivity ,
Figure DEST_PATH_IMAGE034
is the cumulative sum of 1,2,…,m. After obtaining the sensitivity of the feature words and the labeled content, the labeled content can be divided into four levels of high, medium, low, and insensitive according to the sensitivity.

(3)一种基于特征词敏感度的脱敏方法。(3) A desensitization method based on the sensitivity of feature words.

根据方法(2)检测出标注内容中的敏感数据后,需要对敏感数据按敏感级别分别进行处理。After detecting the sensitive data in the marked content according to method (2), the sensitive data needs to be processed according to the sensitivity level.

步骤一:对于一些数据本身不敏感的标注内容,由于可能包含了与敏感词相似的特征词,而被算法误识别为敏感数据,对此可以构建地理标注信息的白名单,每次人工发现算法误识别的非敏感数据时,将其加入白名单,以不断提高本发明模型的容错率。Step 1: For some labeled content that is not sensitive to the data itself, it may be mistakenly identified as sensitive data by the algorithm because it may contain feature words similar to sensitive words. For this, a whitelist of geographic labeling information can be constructed, and the algorithm is manually discovered each time. In case of wrongly identified non-sensitive data, it will be added to the white list to continuously improve the fault tolerance rate of the model of the present invention.

步骤二:为保证地图数据的可用性,不宜将所有敏感数据均做删除处理。算法识别出的敏感数据通过白名单筛选后,可以采用一些脱敏手段将其敏感度降低至非敏感级别,以达到对外公布的要求。对高敏感级别的标注内容,由于其泄露后极易威胁地理信息安全,因此采取直接删除的措施。对中、低等敏感级别的标注,提取其中的敏感特征词,随机选择删除、替换、泛化等脱敏手段进行处理,重新计算其标注敏感度,迭代一定次数后若仍不符合公开要求,则删除该标注。Step 2: To ensure the availability of map data, it is not appropriate to delete all sensitive data. After the sensitive data identified by the algorithm is filtered through the whitelist, some desensitization methods can be used to reduce its sensitivity to a non-sensitive level to meet the requirements of public disclosure. For the marked content with high sensitivity level, since it is very easy to threaten the security of geographic information after it is leaked, measures are taken to delete it directly. For labels with medium and low sensitivity levels, extract the sensitive feature words in them, randomly select desensitization methods such as deletion, replacement, and generalization for processing, and recalculate the labeling sensitivity. If it still does not meet the disclosure requirements after a certain number of iterations, delete the annotation.

其中,替换时在非敏感词库中选择与当前敏感特征词相似度最大的非敏感词作为替换,泛化操作即将标注描述的具体内容抽象化,使描述范围包括更多的非敏感信息,例如“解放军后勤基地”经泛化和替换操作后转换为“仓库”。Among them, when replacing, select the non-sensitive word with the greatest similarity to the current sensitive feature word in the non-sensitive word database as the replacement, and the generalization operation will abstract the specific content of the label description, so that the description scope includes more non-sensitive information, such as "PLA Logistics Base" is converted into "Warehouse" after generalization and replacement operations.

Claims (6)

1. A KDTree-based image database data processing method is characterized by comprising the following steps:
step one, traversing and integrating map labeling information based on KDTree to obtain a labeling set S = { S } formed by n labels1,s2,…,sn}; sn represents the nth label;
the method specifically comprises the following steps:
1.1, extracting position coordinates and character contents of all single characters on a map in an attribute table of the map to form a data set, and constructing a KDTree according to a two-dimensional coordinate of each character;
1.2, arranging all the single characters according to the sequence of the ordinate from big to small to obtain a primary queue Q; creating a mark array vis [ ]) for recording whether each single character in the queue Q is processed or not, initializing the mark array to 0, and traversing the queue Q until the queue Q is empty;
1.3, processing the single characters in sequence according to the arrangement sequence of the single characters in the primary queue Q;
if the current single character point p is not processed, namely vis [ p ] =0, executing 1.4 steps, and juxtaposing vis [ p ] as 1;
if the current single sub-point p is processed, namely vis [ p ] =1, jumping to the next point in the queue Q;
1.4 searching points which are within a threshold value of [0, epsilon ] from the range of the current single character point p in the constructed KDTree to obtain a neighbor node set of the current single character point p, wherein epsilon represents a parameter of an integration range, and 1.5-2 times of the word width corresponding to the previous single character point p is taken; searching a single character point q meeting the integration condition with the current single character point p in the neighbor node set according to the sequence of the distance points p from near to far, if the single character point q is found successfully, replacing the current single character point with q, and juxtaposing vis [ q ] as 1; the single character point q is the point corresponding to the single character q in the KDTree;
1.5 repeating the step 1.4 until no single character point which can be integrated with the current single character point exists in the neighbor node set, and taking the integrated single character point as a label;
1.6 when no single character point which can be integrated with the current single character point exists in the neighbor node set; processing the next unprocessed single character according to the arrangement sequence of the single characters in the preliminary queue Q;
1.7 repeating the steps 1.3-1.6 until the single characters in the preliminary queue Q are all processed; obtaining each label on the map
Secondly, sensitive information detection based on word similarity is carried out on the label set S, and sensitivity grading is carried out on the map label content;
and thirdly, performing corresponding desensitization treatment according to the sensitivity level of the map labeling content.
2. The KDTree-based image database data processing method according to claim 1, wherein in step 1.4, the integration condition is as follows:
the single character point q is not processed, namely vis [ q ] = 0;
in the first case, when the integrated field only contains one current single character point p, the current single character point p is integrated with a single character point q which is closest to the previous single character point p in the neighbor node set; when only a single character point p exists in the neighbor node set, the single character point p forms a label;
and secondly, when the integrated field contains two or more words, namely, when the field formed by a plurality of words is integrated with the single word point q, judging whether all the words in the new field s formed by the single word point q and the integrated field are in the same straight line and whether the range R of the array formed by the distance between every two adjacent words meets the following conditions:
Figure 142102DEST_PATH_IMAGE002
wherein Len represents the number of the single characters contained in the new field s,
Figure DEST_PATH_IMAGE003
indicates the new field siIndividual character point and jth individual character pointjThe distance of (a) to (b),
Figure 873297DEST_PATH_IMAGE004
and
Figure DEST_PATH_IMAGE005
respectively represent taking the maximum value anda minimum value; gamma is 0.2-0.5 times of the width of the word in the new field s;
if all of the 2 integration conditions are satisfied, the fields formed by the plurality of words are integrated with the single word point q, and if at least one of the integration conditions is not satisfied, the fields formed by the plurality of words are not integrated with the single word point q.
3. The KDTree-based image database data processing method according to claim 2, wherein before integration, interference of duplicate words is first eliminated, if word frames corresponding to p and q intersect and the contents of the words are the same, then p and q are duplicate words, and q is deleted from the attribute table to realize deduplication.
4. The KDTree-based image database data processing method according to claim 1, wherein in step 1.7, for horizontally distributed labels, the individual characters in the label are arranged in order from small to large on the abscissa, and in order from left to right.
5. The KDTree-based image database data processing method of claim 1, wherein the second step comprises the steps of:
2.1: performing word segmentation on the marked content by adopting a Chinese word segmentation technology:
aiming at each piece of labeled content si of the label set S, converting the labeled content si into a plurality of word vectors by adopting a Chinese word segmentation technology and a word vector construction technology; obtaining si = { a1, a2, …, am }, wherein a1 … am is m feature words obtained after division;
2.2: converting the characteristic words and the sensitive words into word vectors by adopting a word vector construction technology:
converting the characteristic words into word vectors by using word2vec, and recording the word vectors after the j-th characteristic word Aj is converted as Aj; similarly, all sensitive words Bk in the sensitive word stock are converted into word vectors which are recorded as Bk; the similarity degree of the feature words and the sensitive words is quantized into the similarity degree of the feature word vectors and the sensitive word vectors, namely, the cosine value of an included angle of inner product spaces of the two vectors is taken as the similarity degree, the value range is [0,1], and the closer the similarity degree is to 1, the greater the similarity degree of the two words is;
Figure 928978DEST_PATH_IMAGE006
Figure DEST_PATH_IMAGE007
representing the similarity between Aj and Bk;
2.3 sensitivity calculation of feature words:
in the sensitive word bank, each sensitive word ck corresponds to one sensitive level
Figure 745624DEST_PATH_IMAGE008
Figure 544953DEST_PATH_IMAGE008
The larger the value is, the higher the sensitivity degree of the sensitive word is; traversing the sensitive word stock, and defining the maximum sensitivity of the characteristic words aj for the characteristic words aj
Figure DEST_PATH_IMAGE009
Is composed of
Figure 68338DEST_PATH_IMAGE010
Wherein,
Figure DEST_PATH_IMAGE011
representing a sensitive word bank, calculating the product of the similarity of each sensitive word vector and the feature word vector and the sensitivity level, and taking the maximum value to represent the maximum sensitivity of the feature words;
setting a threshold parameter theta when
Figure 560500DEST_PATH_IMAGE009
When the value is larger than theta, the characteristic word aj has sensitivity, otherwise, the characteristic word aj is not considered as a sensitive word, the sensitivity is marked as 0, namely, the sensitivity of the characteristic word is
Figure 864442DEST_PATH_IMAGE012
2.4 sensitivity calculation for annotated content:
defining the ith annotation
Figure DEST_PATH_IMAGE013
Sensitivity of (2)
Figure 201882DEST_PATH_IMAGE014
Comprises the following steps:
Figure DEST_PATH_IMAGE015
in the formula, notation
Figure 907670DEST_PATH_IMAGE013
Containing m feature words, j represents a label
Figure 242837DEST_PATH_IMAGE013
The jth feature word of (1);
Figure 34075DEST_PATH_IMAGE016
for marking
Figure DEST_PATH_IMAGE017
Sensitivity of, i.e. labelling
Figure 112890DEST_PATH_IMAGE017
The sum of the sensitivity accumulation of the contained m characteristic words; after the sensitivities of the feature words and the labeled contents are obtained, the labeled contents are divided into 4 levels of high sensitivity, medium sensitivity, low sensitivity and non-sensitivity according to the sensitivities.
6. The KDTree-based image database data processing method of claim 1, wherein the third step comprises the steps of:
3.1, constructing a white list of the geographic marking information, and adding the non-sensitive data into the white list every time when the non-sensitive data which is wrongly identified by the algorithm is manually found, so that the fault tolerance rate is improved;
3.2 after the sensitive data are screened by the white list, the marked content with high sensitive level is directly deleted; labeling the medium and low sensitivity levels, extracting sensitive characteristic words in the medium and low sensitivity levels, randomly selecting a desensitization means of deletion, replacement and generalization for processing, then recalculating the desensitization labeled sensitivity, and completely deleting the corresponding label if the desensitization labeled sensitivity does not meet the public requirement after iteration preset times; when the non-sensitive word with the maximum similarity to the current sensitive characteristic word is selected as the replacement in the selection replacement, the specific content of the label description is abstracted when the generalization operation is selected, so that the description range comprises more non-sensitive information.
CN202110139298.4A 2021-02-02 2021-02-02 A Data Processing Method of Image Database Based on KDTree Expired - Fee Related CN112463804B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110139298.4A CN112463804B (en) 2021-02-02 2021-02-02 A Data Processing Method of Image Database Based on KDTree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110139298.4A CN112463804B (en) 2021-02-02 2021-02-02 A Data Processing Method of Image Database Based on KDTree

Publications (2)

Publication Number Publication Date
CN112463804A CN112463804A (en) 2021-03-09
CN112463804B true CN112463804B (en) 2021-06-15

Family

ID=74802248

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110139298.4A Expired - Fee Related CN112463804B (en) 2021-02-02 2021-02-02 A Data Processing Method of Image Database Based on KDTree

Country Status (1)

Country Link
CN (1) CN112463804B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114297704A (en) * 2021-12-23 2022-04-08 中国电信股份有限公司 Data desensitization method and device, storage medium and electronic equipment
CN115774769A (en) * 2022-11-17 2023-03-10 北京中知智慧科技有限公司 Sensitive word checking processing method and device
CN117351263A (en) * 2023-09-04 2024-01-05 兰州交通大学 A micromap review and discrimination method based on deep learning

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109446288A (en) * 2018-10-18 2019-03-08 重庆邮电大学 One kind being based on the internet Spark concerning security matters map detection algorithm

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5968109A (en) * 1996-10-25 1999-10-19 Navigation Technologies Corporation System and method for use and storage of geographic data on physical media
US9633474B2 (en) * 2014-05-20 2017-04-25 Here Global B.V. Method and apparatus for generating a composite indexable linear data structure to permit selection of map elements based on linear elements
US10244278B2 (en) * 2015-12-28 2019-03-26 The Nielsen Company (Us), Llc Methods and apparatus to perform identity matching across audience measurement systems
CN106874415B (en) * 2017-01-23 2019-08-06 国网山东省电力公司电力科学研究院 Environmental sensitive area database construction method and server based on generalized information system
CN108257119B (en) * 2018-01-08 2020-09-01 浙江大学 Near-shore sea area floating hazardous chemical detection early warning method based on near-ultraviolet image processing
CN109257385A (en) * 2018-11-16 2019-01-22 重庆邮电大学 A kind of location privacy protection strategy based on difference privacy

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109446288A (en) * 2018-10-18 2019-03-08 重庆邮电大学 One kind being based on the internet Spark concerning security matters map detection algorithm

Also Published As

Publication number Publication date
CN112463804A (en) 2021-03-09

Similar Documents

Publication Publication Date Title
CN112256939B (en) Text entity relation extraction method for chemical field
CN114692155B (en) Vulnerability code map construction and code vulnerability detection method based on knowledge map
CN112463804B (en) A Data Processing Method of Image Database Based on KDTree
US11163761B2 (en) Vector embedding models for relational tables with null or equivalent values
CN114722137A (en) Security policy configuration method and device based on sensitive data identification and electronic equipment
CN120804234B (en) Knowledge base construction method, knowledge base construction device, computer equipment and storage medium
CN112632406B (en) Query method, device, electronic device and storage medium
KR20190124986A (en) Searching Method for Related Law
CN106570196A (en) Video program searching method and device
CN120256628A (en) A method and system for identifying file information based on keywords
Li et al. Multi-task deep learning model based on hierarchical relations of address elements for semantic address matching
CN120448352A (en) A data classification and grading method and device based on artificial intelligence large model technology
CN110110218B (en) An identity association method and terminal
US20240054135A1 (en) Machine Analysis Of Hydrocarbon Studies
CN112035723A (en) Resource library determination method and device, storage medium and electronic device
Guermazi et al. GeoRoBERTa: a transformer-based approach for semantic address matching
Wu et al. Design of a Computer‐Based Legal Information Retrieval System
CN115221383B (en) An automatic disaster event extraction method for public information sources
Yeh et al. Fast visual retrieval using accelerated sequence matching
Zheng et al. Fine-grained image-text retrieval via complementary feature learning
Sharma et al. Attributed paths for layout-based document retrieval
Cavojsky et al. Search by pattern in gps trajectories
CN115408532B (en) A method, system, device, and storage medium for constructing weapon and equipment knowledge graphs for open-source intelligence.
Li et al. Proceedings of the 2nd ACM SIGSPATIAL International Workshop on Searching and Mining Large Collections of Geospatial Data
Cheng et al. Retrieving Articles and Image Labeling Based on Relevance of Keywords

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20210615