CN106203171A

CN106203171A - Big data platform Security Index system and method

Info

Publication number: CN106203171A
Application number: CN201610578952.0A
Authority: CN
Inventors: 陈天莹; 向雷; 何剑
Original assignee: China Electronic Technology Cyber Security Co Ltd
Current assignee: China Electronic Technology Cyber Security Co Ltd
Priority date: 2016-06-03
Filing date: 2016-07-21
Publication date: 2016-12-07

Abstract

A security indexing system for a big data platform, which receives search keyword information submitted by a client and performs a search based on the keyword in the search domain to identify documents matching the keyword, and the server provides documents that match the keyword, and the big data platform is secure The indexing system includes an index generation logic module that processes documents provided by the server to generate a corresponding set of inverted index files containing a feature index, a key server, a search request analysis module, a search engine pool, a search result generation module, a security index module, metadata engine module. The invention also discloses a big data platform security indexing method.

Description

Big data platform security index system and method

技术领域technical field

本发明涉及一种用于大数据平台的系统及方法，尤其涉及一种大数据平台安全索引系统及方法。The invention relates to a system and method for a big data platform, in particular to a big data platform security index system and method.

背景技术Background technique

大数据平台在存储、处理、传输等过程中面临诸多安全风险。目前最好的办法就是对数据加密，断绝他人窥探隐私的可能性。搜索引擎技术很好的解决了用户在信息海洋里高效定位所需信息的难题。但通常大数据平台的搜索引擎实现都不安全，一个常见的示例是倒排索引文件以明文的形式存储，攻击者通过非法手段窃取倒排索引文件，再结合语言模型等技术就能高质量的还原出文件内容。虽然原始文件内容被加密存储在服务器，但还是会造成隐私泄露。Big data platforms face many security risks in the process of storage, processing, and transmission. The best way at present is to encrypt the data to cut off the possibility of others prying into the privacy. Search engine technology has solved the problem of users efficiently locating the information they need in the ocean of information. But usually the search engine implementations of big data platforms are not secure. A common example is that the inverted index files are stored in plain text. Attackers steal the inverted index files through illegal means, and combined with language models and other technologies, high-quality Restore the contents of the file. Although the original file content is encrypted and stored on the server, it still causes privacy leaks.

为了满足大数据平台安全与隐私保护需求，大数据平台搜索引擎的索引文件也需要加密。但要做到却并不容易：首先，索引是个很大的文件，而加解密的过程又非常耗时；其次，大数据平台不断更新的特性注定索引文件也会被频繁修改，这些工作必然要涉及加密和解密；再次，大数据平台高速增长的特性也决定了索引文件势必非常巨大，这也将导致系统迟缓；最后，索引文件被加密后是不能直接响应业务请求的，额外的解密工作会再次拉长用户的等待时间。In order to meet the security and privacy protection requirements of the big data platform, the index files of the search engine on the big data platform also need to be encrypted. But it is not easy to do it: first, the index is a large file, and the process of encryption and decryption is very time-consuming; second, the continuous update of the big data platform means that the index file will also be frequently modified, and these tasks must be It involves encryption and decryption; thirdly, the rapid growth of big data platforms also determines that the index files are bound to be very large, which will also cause the system to slow down; finally, after the index files are encrypted, they cannot directly respond to business requests, and the additional decryption work will Prolong the user's waiting time again.

发明内容Contents of the invention

为了解决上述问题，本发明提供一种大数据平台安全索引系统及方法。In order to solve the above problems, the present invention provides a big data platform security indexing system and method.

一种大数据平台安全索引系统，接收客户端提交的搜索关键字信息并在搜索域内根据关键字进行搜索以标识匹配关键字的文档，服务器提供与关键字匹配的文档，所述大数据平台安全索引系统包含对所述服务器提供的文档进行处理以生成包含特征索引的相应倒排索引文件集的索引生成逻辑模块、密钥服务器、搜索请求分析模块、搜索引擎池、搜索结果生成模块；所述大数据平台安全索引系统还包括将所述索引生成逻辑模块处理产生的包含特征索引的倒排索引文件集按照单一特征智能分段并加密以密文形式安全存储在分布式文件系统hadoop中的安全索引模块、管理所述安全索引模块生成的索引段文件的元数据引擎模块；所述元数据引擎模块为管理所述安全索引模块生成的段文件并能根据特征快速定位段的索引引擎，包括包含搜索域中的段中的特征的完整索引的元数据索引模块以及被元数据索引模块索引并索引安全索引的元数据索引扩展模块；所述安全索引模块包括将所述索引生成逻辑模块生成的倒排索引文件集写入索引缓存并能支持索引缓存更新的高速缓存模块、分析所述高速缓存中的索引缓存数据并按需生成持久化任务以及分析元数据引擎中的安全索引段生成相应段优化任务的优化器。A security indexing system for a big data platform, which receives search keyword information submitted by a client and performs a search based on the keyword in the search domain to identify documents matching the keyword, and the server provides documents that match the keyword, and the big data platform is secure The indexing system includes an index generation logic module, a key server, a search request analysis module, a search engine pool, and a search result generation module for processing the documents provided by the server to generate a corresponding inverted index file set containing feature indexes; The big data platform security indexing system also includes the security function of intelligently segmenting and encrypting the inverted index file set containing the feature index generated by the index generation logic module according to a single feature and storing it securely in ciphertext in the distributed file system Hadoop. Index module, the metadata engine module that manages the index segment file that described security index module generates; Described metadata engine module is the segment file that management described security index module generates and can locate the index engine of segment quickly according to feature, comprises a metadata indexing module that searches for a complete index of features in segments in a domain and a metadata indexing extension module that is indexed by the metadata indexing module and that indexes a secure index; the secure indexing module includes an inverted The index file set is written into the index cache and can support the cache module of index cache update, analyze the index cache data in the cache and generate persistence tasks on demand, and analyze the security index segments in the metadata engine to generate corresponding segment optimization The optimizer for the task.

其中，所述优化器包括检查所述高速缓存中需要持久化的索引缓存数据并生成持久化任务和分析所述元数据引擎中记录的安全索引段的状态信息生成段优化任务的分析器、根据所述分析器分析出的任务生成任务队列的任务队列模块、处理所述任务队列模块中记录的任务的执行器。Wherein, the optimizer includes an analyzer that checks the index cache data that needs to be persisted in the cache and generates a persistence task, and analyzes the status information of the security index segment recorded in the metadata engine to generate a segment optimization task, according to The task analyzed by the analyzer generates a task queue module of the task queue, and an executor for processing the tasks recorded in the task queue module.

一种大数据平台安全索引方法，采用大数据平台安全索引系统来实现安全索引，包括如下步骤：A big data platform security indexing method, using a big data platform security index system to realize the security index, comprising the following steps:

s1、用户通过客户端提交需要查询的关键字,通过所述大数据平台安全索引系统接收客户端提交的关键字信息并在搜索域内根据关键字进行搜索以标识匹配关键字的文档；s1. The user submits the keyword to be queried through the client, receives the keyword information submitted by the client through the big data platform security index system, and searches according to the keyword in the search domain to identify documents matching the keyword;

s2、服务器提供与关键字匹配的文档给所述大数据平台安全索引系统；s2. The server provides documents matching keywords to the big data platform security indexing system;

s3、由所述大数据平台安全索引系统中的索引生成逻辑模块对所述服务器提供的与关键字匹配的文档进行处理以生成包含特征索引的相应倒排索引文件集，所述倒排文件及包含特征term和文档提取凭证docID；s3. The index generation logic module in the secure index system of the big data platform processes the documents provided by the server and matches the keywords to generate a corresponding inverted index file set including feature indexes, the inverted files and Contains feature term and document extraction credential docID;

s4、由所述大数据平台安全索引系统中的安全索引模块对s3中生成的倒排文件集按照单一特征进行智能分段，所述每一段尺寸和规范统一，并对每一段进行加密以密文形式存储在分布式文件系统Hadoop中；首先，通过所述安全索引模块中的高速缓存模块将所述倒排文件集写入索引缓存并同时支持在遇到新文档时进行索引缓存更新；其次，通过所述安全索引模块中的优化器对所述高速缓存模块中的索引缓存数据进行分析，并按需生成索引持久化任务以及段优化任务；s4. The security index module in the security index system of the big data platform intelligently segments the inverted file set generated in s3 according to a single feature, and the size and specification of each segment are unified, and each segment is encrypted to encrypt The text form is stored in the distributed file system Hadoop; First, the cache module in the safe index module writes the inverted file set into the index cache and supports updating the index cache when new documents are encountered; secondly , analyzing the index cache data in the cache module through the optimizer in the security index module, and generating index persistence tasks and segment optimization tasks on demand;

s5、通过所述大数据平台索引系统中的元数据引擎模块对s4中生成的索引段文件进行管理，并根据需要搜索定位的文件特征快读定位特征所在的段。s5. Manage the index segment files generated in s4 through the metadata engine module in the indexing system of the big data platform, and search and locate the file features according to the needs and quickly read the segment where the positioning features are located.

采用本发明的大数据平台安全索引系统及方发，实现了多级索引的智能加减密，不影响业务相应，省时省力，同时，能够实现索引的不断优化和更新，提高业务相应速度。By adopting the big data platform security index system and Fangfa of the present invention, the intelligent encryption and encryption of multi-level indexes is realized, which does not affect the business response, saves time and effort, and at the same time, can realize the continuous optimization and update of the index, and improve the business response speed.

附图说明Description of drawings

图1是本发明的大数据平台安全索引系统体系结构示意图；Fig. 1 is a schematic diagram of the architecture of the big data platform security index system of the present invention;

图2是本发明的大数据平台安全索引系统中的安全索引模块结构示意图；Fig. 2 is a schematic structural diagram of a security index module in the big data platform security index system of the present invention;

图3是本发明的大数据平台安全索引方法的文档生成逻辑示意图；Fig. 3 is a schematic diagram of document generation logic of the big data platform security indexing method of the present invention;

图4是本发明大数据平台安全索引方法的索引分段及加密存储示意图；Fig. 4 is a schematic diagram of index segmentation and encrypted storage of the big data platform security index method of the present invention;

图5是本发明的大数据平台安全索引方法的多层索引示意图；Fig. 5 is a multi-layer index schematic diagram of the big data platform security index method of the present invention;

图6是本发明的大数据平台安全索引方法的更新索引缓存的流程示意图；Fig. 6 is a schematic flow chart of updating the index cache of the big data platform security indexing method of the present invention;

图7是本发明的大数据平台安全索引方法的持久化索引的方法流程示意图；Fig. 7 is a schematic flow chart of a persistent index method of the big data platform security index method of the present invention;

图8是本发明的大数据平台安全索引方法的持久化索引方法的新特征持久化流程示意图；FIG. 8 is a schematic diagram of a new feature persistence process of the persistent index method of the big data platform security index method of the present invention;

图9是本发明的大数据平台安全索引方法的持久化索引方法的老特征持久化流程示意图；Fig. 9 is a schematic diagram of the old feature persistence process of the persistent index method of the big data platform security index method of the present invention;

图10是本发明的大数据平台安全索引方法的段优化方法流程示意图。Fig. 10 is a schematic flow chart of the segment optimization method of the big data platform security indexing method of the present invention.

具体实施方式detailed description

为了更好的理解本发明，下面结合附图详细说明本发明。In order to better understand the present invention, the present invention will be described in detail below in conjunction with the accompanying drawings.

如图1所示，本发明的一种大数据平台安全索引系统，接收客户端提交的搜索关键字信息并在搜索域内根据关键字进行搜索以标识匹配关键字的文档，服务器提供与关键字匹配的文档，所述大数据平台安全索引系统包含对所述服务器提供的文档进行处理以生成包含特征索引的相应倒排索引文件集的索引生成逻辑模块、密钥服务器、搜索请求分析模块、搜索引擎池、搜索结果生成模块；所述大数据平台安全索引系统还包括将所述索引生成逻辑模块处理产生的包含特征索引的倒排索引文件集按照单一特征智能分段并加密以密文形式安全存储在分布式文件系统hadoop中的安全索引模块、管理所述安全索引模块生成的索引段文件的元数据引擎模块；所述元数据引擎模块为管理所述安全索引模块生成的段文件并能根据特征快速定位段的索引引擎，包括包含搜索域中的段中的特征的完整索引的元数据索引模块以及被元数据索引模块索引并索引安全索引的元数据索引扩展模块；所述安全索引模块包括将所述索引生成逻辑模块生成的倒排索引文件集写入索引缓存并能支持索引缓存更新的高速缓存模块、分析所述高速缓存中的索引缓存数据并按需生成持久化任务以及分析元数据引擎中的安全索引段生成相应段优化任务的优化器。As shown in Figure 1, a security indexing system for a big data platform of the present invention receives the search keyword information submitted by the client and searches according to the keyword in the search field to identify documents matching the keyword, and the server provides documents that match the keyword documents, the big data platform security indexing system includes an index generation logic module for processing the documents provided by the server to generate a corresponding inverted index file set including a feature index, a key server, a search request analysis module, and a search engine pool, search result generation module; the big data platform security index system also includes processing the inverted index file set containing the feature index produced by the index generation logic module according to the single feature intelligent segmentation and encryption in ciphertext form for safe storage A security index module in the distributed file system hadoop, a metadata engine module that manages the index segment files generated by the security index module; the metadata engine module manages the segment files generated by the security index module and can according to the characteristics An indexing engine for quickly locating segments, including a metadata indexing module containing a full index of features in the segment in the search domain and a metadata indexing extension module indexed by the metadata indexing module and indexing a secure index; the secure indexing module includes The inverted index file set generated by the index generation logic module is written into the index cache and can support the cache module for updating the index cache, analyzing the index cache data in the cache and generating persistent tasks on demand and analyzing the metadata engine The secure index segment in generates the optimizer for the corresponding segment optimization task.

其中，如图2所示，所述优化器包括检查所述高速缓存中需要持久化的索引缓存数据并生成持久化任务和分析所述元数据引擎中记录的安全索引段的状态信息生成段优化任务的分析器、根据所述分析器分析出的任务生成任务队列的任务队列模块、处理所述任务队列模块中记录的任务的执行器。Wherein, as shown in Figure 2, the optimizer includes checking the index cache data that needs to be persisted in the cache and generating a persistence task and analyzing the status information of the security index segment recorded in the metadata engine to generate segment optimization An analyzer for tasks, a task queue module for generating a task queue based on the tasks analyzed by the analyzer, and an executor for processing tasks recorded in the task queue module.

s3、由所述大数据平台安全索引系统中的索引生成逻辑模块对所述服务器提供的与关键字匹配的文档进行处理以生成包含特征索引的相应倒排索引文件集，所述倒排文件及包含特征term和文档提取凭证docID；如图3所示，通过对文本数据进行文档分析之后，将文档处理成包含特征term和文档提取凭证docID的文件，然后这些经过处理之后的文件进行排序，并生成倒排文件；其中，特征term不限于关键字，可以是短语、数字、代码或者要在文档内进行搜索的任何类似的值；s3. The index generation logic module in the secure index system of the big data platform processes the documents provided by the server and matches the keywords to generate a corresponding inverted index file set including feature indexes, the inverted files and Contains feature term and document extraction voucher docID; as shown in Figure 3, after document analysis is performed on the text data, the document is processed into a file containing feature term and document extraction voucher docID, and then these processed files are sorted, and Generate an inverted file; wherein, the characteristic term is not limited to a keyword, but can be a phrase, a number, a code or any similar value to be searched within the document;

s4、由所述大数据平台安全索引系统中的安全索引模块对s3中生成的倒排文件集按照单一特征进行智能分段，所述每一段尺寸和规范统一，并对每一段进行加密以密文形式存储在分布式文件系统Hadoop中；首先，通过所述安全索引模块中的高速缓存模块将所述倒排文件集写入索引缓存并同时支持在遇到新文档时进行索引缓存更新；其次，通过所述安全索引模块中的优化器对所述高速缓存模块中的索引缓存数据进行分析，并按需生成索引持久化任务以及段优化任务；如图4所示，将索引倒排文件进行切片，按照单一特征智能分段，并对每一段进行加密以密文形式存储在Hadoop中，分段加密存储的方式使得磁盘读写代价低，查询效率稳定；s4. The security index module in the security index system of the big data platform intelligently segments the inverted file set generated in s3 according to a single feature, and the size and specification of each segment are unified, and each segment is encrypted to encrypt The text form is stored in the distributed file system Hadoop; First, the cache module in the safe index module writes the inverted file set into the index cache and supports updating the index cache when new documents are encountered; secondly , analyze the index cache data in the cache module through the optimizer in the security index module, and generate index persistence tasks and segment optimization tasks on demand; as shown in Figure 4, the index inverted file is processed Slices are intelligently segmented according to a single feature, and each segment is encrypted and stored in Hadoop in the form of ciphertext. The segmented encryption storage method makes disk read and write costs low and query efficiency stable;

s5、通过所述大数据平台索引系统中的元数据引擎模块对s4中生成的索引段文件进行管理，并根据需要搜索定位的文件特征快读定位特征所在的段；如图5所示，元数据引擎中的元数据索引扩展索引Hadoop中加密了的安全索引，元数据索引索引了元数据索引扩展。s5, manage the index segment file generated in s4 through the metadata engine module in the indexing system of the big data platform, and quickly read the segment where the positioning feature is located according to the file characteristics of searching and positioning as required; as shown in Figure 5, the metadata The metadata index extension in the data engine indexes the encrypted security index in Hadoop, and the metadata index indexes the metadata index extension.

其中，如图6所示，通过所述高速缓存模块对索引进行更新，包括如下步骤：Wherein, as shown in Figure 6, updating the index through the cache module includes the following steps:

g1：开始；g1: start;

g2、对文档进行特征分析，提取文档特征；g2. Perform feature analysis on the document and extract document features;

g3、根据文档特征，生成倒排文件；g3. Generate an inverted file according to the characteristics of the document;

g4、生成倒排文件列表；g4, generate an inverted file list;

g5、判断文档特征是否已经存在；如果特征已经存在，转到g6；如果特征不存在，转到g9；g5. Determine whether the document feature already exists; if the feature already exists, go to g6; if the feature does not exist, go to g9;

g6、根据特征定位到索引段；g6. Locate the index segment according to the characteristics;

g7、在索引段按照索引段的尺寸和规范追加新数据；g7. Add new data to the index segment according to the size and specification of the index segment;

g8、倒排文件是否有文件剩余；如果有文件剩余，转到g4；若没有文件剩余，转到g11；g8. Whether there are any remaining files in the inverted file; if there are remaining files, go to g4; if there are no remaining files, go to g11;

g9、根据文档特征，新建索引段并写入文件数据，转到g8；g9. According to the characteristics of the document, create a new index segment and write the file data, then go to g8;

g10、结束。g10, end.

如图7所示，通过高速缓存模块，按照索引持久化需求生成索引持久化任务，包括如下步骤：As shown in Figure 7, through the cache module, an index persistence task is generated according to the index persistence requirements, including the following steps:

c1、开始；c1, start;

c2、分析高速缓存中的索引缓存数据；c2, analyzing the index cache data in the cache;

c3、按照特征生成持久化任务；c3. Generate persistent tasks according to characteristics;

c4、根据需要进行持久化的特征，生成待处理特征列表；c4. Generate a feature list to be processed according to the features that need to be persisted;

c5、判断特征是否已经存在，若是，转到c8；若否，转到c6；c5. Determine whether the feature already exists, if so, go to c8; if not, go to c6;

c6、进行新特征持久化；c6. Persist new features;

c7、判断特征列表是否为空；若是，转到c9；若否，转到c4；c7, determine whether the feature list is empty; if so, go to c9; if not, go to c4;

c8、进行老特征持久化，转到c6；c8. Persist the old features and transfer to c6;

c9、结束。c9, end.

如图8所示，所述新特征持久化包括如下步骤：As shown in Figure 8, the persistence of the new feature includes the following steps:

x1、开始；x1, start;

x2、新建空段模板；x2, Create a new empty segment template;

x3、在新建的段中写入data数据；x3, write data data in the newly created segment;

x4、更新head数据；x4. Update head data;

x5、对段进行加密；x5, encrypt the segment;

x6、通知元数据引擎。x6. Notify metadata engine.

如图9所示，所述老特征持久化包括如下步骤：As shown in Figure 9, the persistence of the old features includes the following steps:

l1、开始；l1, start;

l2、定位到特征所在的最新段；l2. Locate the latest segment where the feature is located;

l3、提取索引段对应的文件；l3. Extract the file corresponding to the index segment;

l4、解密文件；l4. Decrypt files;

l5、更新文件；l5. Update files;

l6、加密文件；l6. Encrypted files;

l7、将加密后的文件存入Hadoop；l7. Store the encrypted file in Hadoop;

l8、进入旧段删除倒计时；l8. Enter the countdown to delete the old segment;

l9、通知元数据引擎。l9. Notify the metadata engine.

如图10所示，所述优化器对索引缓存数据进行分析，生成段优化任务包括如下步骤：As shown in Figure 10, the optimizer analyzes the index cache data, and generates segment optimization tasks including the following steps:

d1、开始；d1, start;

d2、分析安全索引；d2. Analyze security index;

d3、根据d2中的分析结果生成优化任务；d3, generating optimization tasks according to the analysis results in d2;

d4、判断是否需要删除段，若是，转到d5；若否，转到d6；d4, judging whether the segment needs to be deleted, if so, go to d5; if not, go to d6;

d5、删除过期段；d5, delete the expired segment;

d6、判断是否需要分裂段，若是，转到d7；若否，转到d9；d6, judge whether to split the segment, if so, go to d7; if not, go to d9;

d7、将段分裂为多个新段；d7, split the segment into multiple new segments;

d8、进入旧段删除倒计时；d8. Enter the countdown to delete the old segment;

d9、判断是否需要合并段；若是，转到d10；若否，转到d12；d9, judging whether to merge segments; if so, go to d10; if not, go to d12;

d10、合并段；d10, merging segment;

d11、进入旧段删除倒计时；d11. Enter the countdown to delete the old segment;

d12、通知元数据引擎。d12. Notify the metadata engine.

简单地说，本发明提供的一种大数据平台安全索引系统针对在搜索域内搜索文档(或文件)集以找到与用户相关的文档。搜索通常涉及从用户获取一组关键字以指示搜索并且然后标识搜索域内匹配这些关键字的所有文档。所得的候选文档集包含来自搜索域的可能相关的所有文档。接着可以将排序算法应用到候选文档以预测文档与用户的相关性。接着通常以预测的相关性的降序向用户呈现候选文档。这种类型的搜索的实施例通常利用将关键字与文档关联的倒排索引结构。Briefly, a secure indexing system for a big data platform provided by the present invention is aimed at searching a set of documents (or files) within a search domain to find documents related to a user. Searching generally involves obtaining a set of keywords from a user to indicate a search and then identifying all documents within the search domain that match those keywords. The resulting set of candidate documents contains all potentially relevant documents from the search domain. A ranking algorithm can then be applied to the candidate documents to predict the relevance of the documents to the user. The candidate documents are then presented to the user, typically in descending order of predicted relevance. Embodiments of this type of search typically utilize an inverted index structure that associates keywords with documents.

索引生成逻辑模块负责处理服务器提交的文档，在本发明中索引生成逻辑输出描述为由term和docID组成的倒排文件集。特征(term)不限于关键字，它可以是短语、数字、代码或要在文档内搜索的任何类似的值。令牌(docID)是指提取文档的凭证，通常是一个指示文件路径的字符串，服务器提交文档时被要求附上相应的令牌。安全索引模块能“聪明”地把大索引文件切分成小段(segment)并且以密文形式存储在分布式文件系统中。索引生成逻辑模块提交的倒排文件集合，被高速缓存模块接管首先写入到索引缓存，然后通知元数据引擎模块，这能有效降低索引被更新的频次，对于提供稳定可靠的搜索服务非常重要，同时也是实时搜索实现的关键。高速缓存是昂贵且非常有限的，所以之后会被持久化到Hadoop，这也是优化器非常重要的工作之一。The index generation logic module is responsible for processing the documents submitted by the server. In the present invention, the index generation logic output is described as an inverted file set composed of term and docID. A term is not limited to keywords, it can be a phrase, number, code, or any similar value to be searched within a document. The token (docID) refers to the credential for extracting the document, usually a string indicating the file path, and the server is required to attach the corresponding token when submitting the document. The secure index module can "smartly" divide large index files into small segments and store them in the distributed file system in ciphertext. The collection of inverted files submitted by the index generation logic module is taken over by the cache module and first written to the index cache, and then the metadata engine module is notified, which can effectively reduce the frequency of index updates, which is very important for providing stable and reliable search services. It is also the key to real-time search. The cache is expensive and very limited, so it will be persisted to Hadoop later, which is also one of the very important jobs of the optimizer.

大索引文件被切分成一系列的倒排文件，这些倒排文件按照一定的规则填充在各个段中，然后再被加密后存储到分布式文件系统。大索引文件并不真实的存在，完全由精心设计的段文件集替代，首先段文件被设计成只存储单一的特征，这对有针对性地加载索引片段将非常有用，另外统一尺寸和规范的段文件也使得加密与解密工作变得可控。The large index file is divided into a series of inverted files, which are filled in each segment according to certain rules, and then encrypted and stored in the distributed file system. Large index files do not really exist, and are completely replaced by well-designed segment file sets. First, segment files are designed to store only a single feature, which will be very useful for loading index segments in a targeted manner. In addition, uniform size and standardized Segment files also make encryption and decryption controllable.

元数据引擎模块管理规模巨大的段文件，比如快速定位出某特征在特定时间段内关于某主题的相关段文件，还有优化分析器与执行段优化时所需的元信息。它由元数据索引和元数据索引扩展组成。实际应用中，元数据引擎模块中包含一个维护有段文件的状态信息和定位在具体约束下某特征关联的段文件的元信息。段文件的状态信息包括段存储饱和度、大小、索引特征的时间跨度等优化器需求的信息。定位段的约束包括索引时的主题标签、索引时间的范围、最新的段等信息。The metadata engine module manages large-scale segment files, such as quickly locating the relevant segment files of a certain feature in a specific period of time on a topic, as well as the meta-information needed for optimizing the analyzer and performing segment optimization. It consists of a metadata index and a metadata index extension. In practical applications, the metadata engine module contains a metadata engine module that maintains the state information of the segment file and locates the meta information of the segment file associated with a certain feature under specific constraints. The status information of the segment file includes information required by the optimizer such as segment storage saturation, size, and time span of index features. Constraints for locating segments include information such as the hashtag at index time, the range of index time, and the latest segment.

元数据索引模块是在搜索域中的段文件中找到的特征的完整索引。它被结构化成支持各种类型的搜索并且可以独立于元数据索引扩展使用。元数据引擎内的标志指示在元数据索引扩展中是否有信息可供使用，也就是与需求相关的索引。对于每一个特征都存在该标志，以便提供对如何和何时使用元数据索引扩展中索引的控制。这样设计对于兼顾响应时效和适应大数据平台高速且几近无上限的内容增长非常必要。比如，优化器需求远没有搜索请求分析的急迫，还有陈旧数据搜索的时效需求会更低。它们将分属于不同索引，有时可能还会拆分成多条索引，保证每类需求的响应时间都能满足要求。The metadata indexing module is a complete index of features found in segment files in the search domain. It is structured to support various types of searches and can be used independently of metadata indexing extensions. A flag within the metadata engine indicates whether information is available in the metadata index extension, that is, the index associated with the requirement. This flag exists for each characteristic to provide control over how and when indexes in the Metadata Indexing extension are used. This design is necessary to take into account the response time and adapt to the high-speed and almost unlimited content growth of the big data platform. For example, the optimizer needs are far less urgent than search request analysis, and the timeliness requirements for stale data searches will be lower. They will belong to different indexes, and sometimes they may be split into multiple indexes to ensure that the response time of each type of demand can meet the requirements.

元数据索引扩展的本质是索引的容器，它们真正的记录了元数据引擎要管理的数据，其中索引分裂依据索引大小、主题标签和时间跨度。可以这样说，整个体系结构的核心就是一颗索引树。元数据索引索引了元数据索引扩展，元数据索引扩展又索引了安全索引，最大的区别在于安全索引被设计为分段加密存储。因为B+树具有磁盘读写代价低和查询效率稳定的优势，通常元数据索引和元数据索引扩展采用B+树索引实现方式。The essence of the metadata index extension is the container of the index, which actually records the data to be managed by the metadata engine, where the index split is based on the index size, topic label and time span. It can be said that the core of the entire architecture is an index tree. The metadata index indexes the metadata index extension, and the metadata index extension indexes the security index. The biggest difference is that the security index is designed as segmented encrypted storage. Because the B+ tree has the advantages of low disk read and write costs and stable query efficiency, usually the metadata index and metadata index extension adopt the B+ tree index implementation.

优化器它由分析器、任务队列模块以及执行器组成。分析器通过检查高速缓存中需要持久化的索引缓存数据生成持久化任务和分析元数据引擎模块中记录的安全索引段的状态信息生成段优化任务。执行器处理任务队列中的任务，包括持久化索引和安全索引段优化两类任务。Optimizer It consists of analyzer, task queue module and executor. The analyzer generates a persistent task by checking the index cache data that needs to be persisted in the cache and analyzes the status information of the secure index segment recorded in the metadata engine module to generate a segment optimization task. The executor processes tasks in the task queue, including persistent index and secure index segment optimization tasks.

可以用许多方式使用高速缓存模块以支持更新索引缓存更新以及支持持久化高速缓存模块中的索引缓存。The cache module can be used in many ways to support updating index cache updates as well as to support index caches in persistent cache modules.

采用本发明的大数据平台安全索引系统及方发，实现了多级索引的智能加减密，不影响业务相应，省时省力，同时，能够实现索引的不断优化和更新，提高业务相应速度。索引倒排文件分段加密存储，使得磁盘读写代价低，查询效率稳定，加减密过程快速，不影响系统稳定性；索引缓存更新、新老特征持久化以及段优化，保证了索引文件的完整性和稳定性，避免特征重复冗余，提高业务相应速度。By adopting the big data platform security index system and Fangfa of the present invention, the intelligent encryption and encryption of multi-level indexes is realized, which does not affect the business response, saves time and effort, and at the same time, can realize the continuous optimization and update of the index, and improve the business response speed. Index inverted files are encrypted and stored in segments, which makes the disk read and write costs low, the query efficiency is stable, the encryption process is fast, and does not affect system stability; index cache update, old and new feature persistence, and segment optimization ensure the security of index files. Integrity and stability, avoid feature duplication and redundancy, and improve business response speed.

以上所述，仅为本发明较佳的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明披露的技术范围内，根据本发明的技术方案及其发明构思加以等同替换或改变，都应涵盖在本发明的保护范围之内。The above is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Anyone familiar with the technical field within the technical scope disclosed in the present invention, according to the technical solution of the present invention Any equivalent replacement or change of the inventive concepts thereof shall fall within the protection scope of the present invention.

Claims

1. A security indexing system for a big data platform, which receives search keyword information submitted by a client and searches according to the keyword in the search field to identify documents matching the keyword, and the server provides documents that match the keyword, and the big data The platform security index system includes an index generation logic module for processing the documents provided by the server to generate a corresponding inverted index file set containing feature indexes, a key server, a search request analysis module, a search engine pool, and a search result generation module, It is characterized in that,

The big data platform security index system also includes intelligently segmenting and encrypting the inverted index file set containing the feature index generated by the index generation logic module according to a single feature, and encrypting it in ciphertext to securely store it in the distributed file system hadoop A security index module, a metadata engine module that manages index segment files generated by the security index module;

The metadata engine module is an index engine that manages the segment files generated by the security index module and can quickly locate segments according to characteristics, including a metadata index module that includes a complete index of the features in the segments in the search domain and the metadata indexed by metadata The indexing module indexes and indexes the metadata indexing extension module of the security index;

The secure index module includes a cache module that writes the inverted index file set generated by the index generation logic module into the index cache and supports index cache update, analyzes the index cache data in the cache, and generates persistent optimization task and the optimizer that analyzes the security index segment in the metadata engine and generates the corresponding segment optimization task.

2. The big data platform security indexing system according to claim 1, characterized in that,

The optimizer includes an analyzer that checks the index cache data that needs to be persisted in the cache and generates a persistence task, and analyzes the status information of the secure index segment recorded in the metadata engine to generate a segment optimization task, according to the The task analyzed by the analyzer generates a task queue module of the task queue, and an executor for processing the tasks recorded in the task queue module.

3. a big data platform security indexing method, is characterized in that, described big data platform security indexing method adopts big data platform security indexing system to realize security indexing, comprises the steps:

s1. The user submits the keyword to be queried through the client, receives the keyword information submitted by the client through the big data platform security index system, and searches according to the keyword in the search domain to identify documents matching the keyword;

s2. The server provides documents matching keywords to the big data platform security indexing system;

s3. The index generation logic module in the secure index system of the big data platform processes the documents provided by the server and matches the keywords to generate a corresponding inverted index file set including feature indexes, the inverted files and Contains feature term and document extraction credential docID;

s4. The security index module in the security index system of the big data platform intelligently segments the inverted file set generated in s3 according to a single feature, and the size and specification of each segment are unified, and each segment is encrypted to encrypt The file format is stored in the distributed file system Hadoop; first, write the inverted file set into the index cache through the cache module in the security index module and support the index cache update at the same time; secondly, through the security index The optimizer in the module analyzes the index cache data in the cache module, and generates index persistence tasks and segment optimization tasks on demand;

s5. Manage the index segment files generated in s4 through the metadata engine module in the indexing system of the big data platform, and search and locate the file features according to the needs and quickly read the segment where the positioning features are located.

4. The big data platform security indexing method according to claim 3, characterized in that,

Updating the index through the cache module includes the following steps:

g1: start;

g2. Perform feature analysis on the document and extract document features;

g3. Generate an inverted file according to the characteristics of the document;

g4, generate an inverted file list;

g5. Determine whether the document feature already exists; if the feature already exists, go to g6; if the feature does not exist, go to g9;

g6. Locate the index segment according to the feature;

g7. Add new data to the index segment according to the size and specification of the index segment;

g8. Whether there are any remaining files in the inverted file; if there are remaining files, go to g4; if there are no remaining files, go to g11;

g9. According to the characteristics of the document, create a new index segment and write the file data, then go to g8;

g10, end.

5. The big data platform security indexing method according to claim 3, characterized in that,

Through the cache module, index persistence tasks are generated according to the index persistence requirements, including the following steps:

c1, start;

c2, analyzing the index cache data in the cache;

c3. Generate persistent tasks according to characteristics;

c4. Generate a feature list to be processed according to the features that need to be persisted;

c5. Determine whether the feature already exists, if so, go to c8; if not, go to c6;

c6. Persist new features;

c7, determine whether the feature list is empty; if so, go to c9; if not, go to c4;

c8. Persist the old features and transfer to c6;

c9, end.

6. The big data platform security indexing method according to claim 5, characterized in that,

The persistence of the new feature comprises the following steps:

x1, start;

x2, Create a new empty segment template;

x3, write data data in the newly created segment;

x4. Update head data;

x5, encrypt the segment;

x6. Notify the metadata engine module.

7. The big data platform security indexing method according to claim 5, characterized in that,

The persistence of the old features comprises the following steps:

l1, start;

l2. Locate the latest segment where the feature is located;

l3. Extract the file corresponding to the index segment;

l4. Decrypt files;

l5. Update files;

l6. Encrypted files;

l7. Store the encrypted file in Hadoop;

l8. Enter the countdown to delete the old segment;

l9. Notify the metadata engine module.

8. The big data platform indexing method according to claim 3, characterized in that,

The optimizer analyzes the index cache data, and generates segment optimization tasks including the following steps:

d1, start;

d2. Analyze security index;

d3, generating optimization tasks according to the analysis results in d2;

d4, judging whether the segment needs to be deleted, if so, go to d5; if not, go to d6;

d5, delete the expired segment;

d6, judge whether to split the segment, if so, go to d7; if not, go to d9;

d7, split the segment into multiple new segments;

d8. Enter the countdown to delete the old segment;

d9, judging whether to merge segments; if so, go to d10; if not, go to d12;

d10, merging segment;

d11. Enter the countdown to delete the old segment;

d12. Notify the metadata engine module.