WO2021134594A1

WO2021134594A1 - Data processing method and apparatus

Info

Publication number: WO2021134594A1
Application number: PCT/CN2019/130773
Authority: WO
Inventors: 张景芳; 张尧; 黄焰
Original assignee: 华为技术有限公司
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2021-07-08
Also published as: CN114902636B; CN114902636A

Abstract

Embodiments of the present invention provide a data processing method and apparatus. The method comprises: obtaining the metadata of data in a first data system; and converting the metadata of the data in the first data system from a first format into a second format, wherein the first format is the format of the metadata of the data in the first data system, and the second format is a metadata format that can be identified by the first data system and a second data system. The method provided by the embodiments of the present invention is used for defining, for a plurality of data systems, a unified metadata format that can be identified by the plurality of data systems, thereby simplifying data transmission between different data systems.

Description

Data processing method and device

Technical field

本申请涉及存储领域，尤其涉及一种数据处理方法及装置。This application relates to the storage field, and in particular to a data processing method and device.

Background technique

数据湖是一个以原始格式存储数据的存储库或系统，无需事先对数据进行结构化处理。一个数据湖可以存储结构化数据(如关系型数据库中的表)，半结构化数据(如CSV、日志、XML、JSON)，非结构化数据(如电子邮件、文档、PDF)和二进制数据(如图形、音频、视频)。为了管理数据湖中的数据，要求所有进入数据湖的数据必须提供相关元数据，并将不同类型的数据的元数据组织为目录服务，后续在使用数据时，可以基于目录服务对数据进行分析，为用户提供分析后的数据。A data lake is a repository or system that stores data in its original format, without prior structured processing of the data. A data lake can store structured data (such as tables in a relational database), semi-structured data (such as CSV, logs, XML, JSON), unstructured data (such as emails, documents, PDF), and binary data ( Such as graphics, audio, video). In order to manage the data in the data lake, it is required that all data entering the data lake must provide related metadata, and organize the metadata of different types of data into a directory service. When using the data in the future, the data can be analyzed based on the directory service. Provide users with analyzed data.

当前各个数据湖中的目录服务互相独立，各个数据湖采用的目录服务出自不同的厂商。例如，在安防领域，每个省或市采购的是不同厂商的数据湖产品，而不同厂商的数据湖产品具有不同的目录服务，所以在高一级的单位例如省或中央想要对低一级的单位例如市或省的数据进行整合时，由于无法识别各市或者各省的目录服务，则无法直接进行整合。目前的一种实现方式为：如果数据湖A(例如省)和数据湖B(例如市)的目录服务不同，但数据湖A需要获取数据湖B的数据，则数据湖A需要通知数据湖B，告诉数据湖B它所要访问的数据。数据湖B将数据湖A所要访问数据的元数据放到交换平台上，交换平台将数据湖A所要访问数据的元数据转换为数据湖A可以识别的目录服务，并通知数据湖A从所述交换平台获取所述元数据，然后基于所述元数据获取数据湖B中的数据。Currently, the directory services in each data lake are independent of each other, and the directory services used by each data lake are from different vendors. For example, in the security field, each province or city purchases data lake products from different vendors, and data lake products from different vendors have different catalog services. Therefore, in higher-level units such as the province or the central government, they want to compare data lake products from different vendors. For example, when the data of cities or provinces are integrated, they cannot be directly integrated because they cannot identify the directory services of each city or province. One current implementation method is: If the directory services of data lake A (for example, province) and data lake B (for example, city) are different, but data lake A needs to obtain data from data lake B, then data lake A needs to notify data lake B , Tell Data Lake B the data it wants to access. Data Lake B puts the metadata of the data to be accessed by Data Lake A on the exchange platform, and the exchange platform converts the metadata of the data to be accessed by Data Lake A into a directory service that can be identified by Data Lake A, and informs Data Lake A from the said The exchange platform obtains the metadata, and then obtains the data in the data lake B based on the metadata.

在上述实现方式中，需要把数据湖A的目录服务中定义的元数据的格式直接转换为数据湖B的目录服务中定义的元数据格式，这样，在数据湖B需要获取多个具有不同的目录服务的数据湖的数据时，交换平台需要知道所有数据湖的目录服务，导致元数据格式转换的复杂度比较高。In the above implementation, the metadata format defined in the catalog service of Data Lake A needs to be directly converted to the metadata format defined in the catalog service of Data Lake B. In this way, data lake B needs to obtain multiple data with different When the data of the data lake of the catalog service is used, the exchange platform needs to know the catalog service of all the data lakes, which leads to a relatively high complexity of metadata format conversion.

发明内容Summary of the invention

本发明提供一种数据处理方法及装置。本发明通过定义两个数据系统都能识别的元数据格式，简化了的两个数据系统之间的通信。The invention provides a data processing method and device. The invention simplifies the communication between the two data systems by defining the metadata format that can be recognized by the two data systems.

本发明第一方面提供一种数据处理方法，所述方法包括：获取第一数据系统中的数据的元数据；将所述第一数据系统中的数据的元数据由第一格式转换为第二格式，所述第一格式为所述第一数据系统中的数据的元数据的格式，所述第二格式是所述第一数据系统和第二数据系统都能够识别的元数据格式。A first aspect of the present invention provides a data processing method. The method includes: acquiring metadata of data in a first data system; and converting the metadata of data in the first data system from a first format to a second format. Format, the first format is a metadata format of data in the first data system, and the second format is a metadata format that can be recognized by both the first data system and the second data system.

本发明实施例提供的数据处理方法中通过给不同的数据系统定义统一的元数据格式，这样在第一数据系统需要获取第二数据系统的数据时，第二数据系统先把第二数据系统的元数据格式转换为目录模板定义的元数据格式，第一数据系统获取被转换为统一格式的元数据后，再将其转换为第一数据系统的中元数据的格式，这样，每个数据系统只要能够识别统一的元数据格式，即可与其他数据湖之间进行数据传输，从而简化了不同数据系统之间的通信。In the data processing method provided by the embodiment of the present invention, a unified metadata format is defined for different data systems, so that when the first data system needs to obtain data from the second data system, the second data system first The metadata format is converted to the metadata format defined by the catalog template. After the first data system obtains the metadata that has been converted into a unified format, it then converts it into the metadata format of the first data system. In this way, each data system As long as the unified metadata format can be recognized, data can be transferred with other data lakes, which simplifies the communication between different data systems.

在第一方面的一种可能的实现中，所述方法还包括预先定义元数据模板，所述元数据模板采用的元数据格式为所述第二格式。In a possible implementation of the first aspect, the method further includes predefining a metadata template, and the metadata format adopted by the metadata template is the second format.

在第一方面一种可能的实现方式中，所述方法还包括：将转换为第二格式的元数据发送至消息平台，以使所述第二数据系统从所述消息平台获取所述转换为第二格式的元数据。In a possible implementation of the first aspect, the method further includes: sending the metadata converted into the second format to the messaging platform, so that the second data system obtains the converted into the message platform from the messaging platform. Metadata in the second format.

第二数据系统将元数据发送至消息平台，这样，第一数据系统可以从消息平台获取第二系统的元数据，从而不需要主动从第二数据系统中获取元数据。The second data system sends the metadata to the messaging platform. In this way, the first data system can obtain the metadata of the second system from the messaging platform, so there is no need to actively obtain the metadata from the second data system.

在第一方面一种可能的实现方式中，在所述将转换为第二格式的元数据发送至消息平台包括：判断所述转换为第二格式的元数据是否满足预设的发布规则；In a possible implementation of the first aspect, the sending the metadata converted into the second format to the messaging platform includes: judging whether the metadata converted into the second format meets a preset publishing rule;

当确定所述转换为第二格式的元数据满足预设的发布规则后，将所述转化为第二格式的元数据发送至所述消息平台。When it is determined that the metadata converted into the second format meets a preset publishing rule, the metadata converted into the second format is sent to the messaging platform.

在第二数据系统通过设置元数据的发布规则，只有符合发布规则的元数据才会发布至消息平台，这样，不需要进行权限认证，也可以保证数据的安全。By setting metadata publishing rules in the second data system, only metadata that meets the publishing rules will be published to the messaging platform. In this way, authorization authentication is not required, and data security can also be ensured.

在一种可能的实现方式中，所述方法还包括：In a possible implementation manner, the method further includes:

发送发布元数据队列创建请求至所述消息平台，以使所述消息平台建立发布元Send a release metadata queue creation request to the message platform, so that the message platform establishes a release metadata

数据队列；Data queue

将转换为第二格式的元数据发送至所述消息平台，包括：Sending the metadata converted into the second format to the messaging platform includes:

将所述转换为第二格式的元数据写入所述消息平台的所述发布元数据队列。The metadata converted into the second format is written into the publishing metadata queue of the messaging platform.

在第一数据系统和第二数据系统会形成数据联邦，并建立发布元数据队列，通过发布元数据队列建立第一数据系统和第二数据系统的数据传输通道，优化了数据传输过程。A data federation is formed in the first data system and the second data system, and a publishing metadata queue is established. The data transmission channel between the first data system and the second data system is established through the publishing metadata queue, which optimizes the data transmission process.

获取所述消息平台发送的所述发布元数据队列的地址，并将所述发布元数据队列的地址发送至所述第二数据系统。Obtain the address of the publishing metadata queue sent by the messaging platform, and send the address of the publishing metadata queue to the second data system.

发送原始元数据队列创建请求至所述消息平台，以使所述消息平台建立原始元数据队列；Sending an original metadata queue creation request to the message platform, so that the message platform establishes an original metadata queue;

将第一数据系统中的元数据的格式由第一格式转换为第二格式包括：Converting the format of the metadata in the first data system from the first format to the second format includes:

将所述原始元数据队列中的所述第一数据系统的元数据由所述第一格式的转换为第二格式。Converting the metadata of the first data system in the original metadata queue from the first format to the second format.

通过建立原始元数据队列，可以为转换的数据提供缓存，保证了数据的可靠性。By establishing the original metadata queue, a buffer can be provided for the converted data, which ensures the reliability of the data.

本发明第二方面提供一种数据处理方法，所述方法包括：获取第一数据系统的数据的元数据；将所述第一数据系统的数据元数据的格式由第二格式转换为第三格式，所述第二格式为所述第一数据系统和所述第二数据系统都能识别的元数据格式，所述第三格式为所述第二数据系统中的数据的元数据的格式。A second aspect of the present invention provides a data processing method, the method comprising: acquiring metadata of data of a first data system; and converting the format of the data metadata of the first data system from a second format to a third format The second format is a metadata format that can be recognized by both the first data system and the second data system, and the third format is a metadata format of data in the second data system.

在一种可能的实现方式中，所述方法包括预先定义元数据模板，所述元数据模板采用的元数据格式为所述第二格式。In a possible implementation manner, the method includes predefining a metadata template, and the metadata format adopted by the metadata template is the second format.

在一种可能的实现方式中所述获取第一数据系统的元数据包括：In a possible implementation manner, the acquiring metadata of the first data system includes:

从消息平台获取所述第一数据系统的数据的元数据。The metadata of the data of the first data system is obtained from the message platform.

通过消息平台获取第一数据系统的元数据，可以使所述第二数据系统不访问所述第一系统就可以获取所述第一系统的元数据。Obtaining the metadata of the first data system through the messaging platform allows the second data system to obtain the metadata of the first system without accessing the first system.

判断所述元数据是否满足预设的接收规则；Judging whether the metadata meets preset receiving rules;

当确定所述转化后的元数据满足所述预设的接收规则后，将所述第一数据系统的数据的元数据的格式由第一格式转换为第二格式。When it is determined that the converted metadata satisfies the preset receiving rule, the format of the metadata of the data of the first data system is converted from the first format to the second format.

通过设置接收规则，可以过滤掉所述第二数据系统不需要的数据，从而保证了数据安全。By setting the receiving rule, data not needed by the second data system can be filtered out, thereby ensuring data security.

接收所述第一数据系统发送的所述消息平台中的发布元数据队列的地址；Receiving the address of the publishing metadata queue in the message platform sent by the first data system;

所述从消息平台获取第一数据系统的数据的元数据包括：The obtaining the metadata of the data of the first data system from the message platform includes:

根据所述发布元数据队列的地址从所述发布元数据地址获取所述元数据。Acquiring the metadata from the publishing metadata address according to the address of the publishing metadata queue.

本发明第三方面提供一种数据处理装置，所述数据处理装置包括多个模块，所述多个模块用于执行第一方面的数据处理方法中各步骤的功能。所达成的有益效果也与第一方面所提供的各个方法的有益效果相同，在此不再赘述。A third aspect of the present invention provides a data processing device, the data processing device includes a plurality of modules, and the plurality of modules are used to perform the functions of each step in the data processing method of the first aspect. The beneficial effects achieved are also the same as the beneficial effects of the various methods provided in the first aspect, and will not be repeated here.

本发明第四方面提供一种数据处理装置，所述数据处理装置包括多个模块，所述多个模块用于执行第二方面的数据处理方法中各步骤的功能。所达成的有益效果也与第一方面所提供的各个方法的有益效果相同，在此不再赘述。A fourth aspect of the present invention provides a data processing device, the data processing device includes a plurality of modules, and the plurality of modules are used to perform the functions of each step in the data processing method of the second aspect. The beneficial effects achieved are also the same as the beneficial effects of the various methods provided in the first aspect, and will not be repeated here.

本发明第五方面提供一种服务器，所述服务器包括处理器及存储器，所处存储器中存储有程序指令，所述处理器执行所述程序指令以实现第一方面所提供的方法的功能。A fifth aspect of the present invention provides a server. The server includes a processor and a memory, where program instructions are stored in the memory, and the processor executes the program instructions to implement the functions of the method provided in the first aspect.

本发明第六方面提供一种服务器所述服务器包括处理器及存储器，所处存储器中存储有程序指令，所述处理器执行所述程序指令以实现第二方面所提供的方法的功能。A sixth aspect of the present invention provides a server. The server includes a processor and a memory, where program instructions are stored in the memory, and the processor executes the program instructions to implement the functions of the method provided in the second aspect.

本发明第七方面提供一种存储介质，所述存储介质中存储有程序指令，所述程序指令被处理器执行以实现第一方面所提供的方法的功能。A seventh aspect of the present invention provides a storage medium in which program instructions are stored, and the program instructions are executed by a processor to implement the functions of the method provided in the first aspect.

本发明第八方面提供一种存储介质，所述存储介质中存储有程序指令，所述程序指令被处理器执行以实现第二方面所提供的方法的功能。An eighth aspect of the present invention provides a storage medium in which program instructions are stored, and the program instructions are executed by a processor to implement the functions of the method provided in the second aspect.

Description of the drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art.

图1为本发明实施例所提供的数据湖的结构的示意图。FIG. 1 is a schematic diagram of the structure of a data lake provided by an embodiment of the present invention.

图2为本发明实施例中所定义的目录模板中的元数据的格式。Fig. 2 is a format of metadata in a directory template defined in an embodiment of the present invention.

图3为本发明实施例中的联邦系统的结构图。Fig. 3 is a structural diagram of a federated system in an embodiment of the present invention.

图4为本发明实施例中数据湖之间注册为联邦的方法的流程图。Fig. 4 is a flowchart of a method for registering as a federation between data lakes in an embodiment of the present invention.

图5为本发明实施例中数据湖之间进行数据传输的方法的流程图。Fig. 5 is a flowchart of a method for data transmission between data lakes in an embodiment of the present invention.

图6为本发明实施例中Hive数据库的元数据格式。Fig. 6 is the metadata format of the Hive database in the embodiment of the present invention.

图7为本发明实施例中一个数据湖中Hive数据库的元数据格式被转换为目录模板所定义的元数据格式的示意图。FIG. 7 is a schematic diagram of the metadata format of a Hive database in a data lake being converted into a metadata format defined by a catalog template in an embodiment of the present invention.

图8为本发明实施例中目录模板所定义的元数据格式被转换为另外一个数据湖中的元数据格式的示意图。FIG. 8 is a schematic diagram of the metadata format defined by the catalog template in the embodiment of the present invention being converted to the metadata format in another data lake.

图9为本发明实施例中为多个数据湖形成数据联邦的应用场景的示意图。FIG. 9 is a schematic diagram of an application scenario of forming a data federation for multiple data lakes in an embodiment of the present invention.

图10为本发明实施例中第一数据处理装置和第二数据处理装置的功能模块图。Fig. 10 is a functional block diagram of a first data processing device and a second data processing device in an embodiment of the present invention.

图11为本发明实施例中联邦服务所运行的服务器的结构图。Fig. 11 is a structural diagram of a server operated by a federated service in an embodiment of the present invention.

Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments.

元数据：又称中介数据、中继数据，为描述数据的数据，主要是描述数据属性的信息，用来支持如指示存储位置、历史数据、资源查找、文件记录等功能。元数据算是一种电子式目录，为了达到编制目录的目的，必须描述数据的属性，进而达成协助数据检索的目的。Metadata: also known as intermediary data and relay data. It is data describing data, mainly information describing data attributes, used to support functions such as indicating storage location, historical data, resource search, and file recording. Metadata can be regarded as an electronic catalog. In order to achieve the purpose of compiling the catalog, the attributes of the data must be described, so as to achieve the purpose of assisting data retrieval.

元数据的格式：定义了元数据所包括的属性，各个属性的表达方式及属性之间的排列顺序。Metadata format: defines the attributes included in the metadata, the expression of each attribute, and the order in which the attributes are arranged.

本发明实施例提供的数据湖可以部署在服务器集群中，也可以部署在云环境中，如图1所示，为数据湖部署在云环境下的结构示意图。The data lake provided by the embodiment of the present invention can be deployed in a server cluster or in a cloud environment, as shown in FIG. 1, which is a schematic diagram of the structure of the data lake deployed in a cloud environment.

数据湖主要包括两部分，一部分是计算资源10，另一部分是存储资源20。存储资源20一般包括多个存储设备201，存入数据湖的数据一般被分布存储在多个存储设备201中。计算资源10一般采用的是云环境中的虚拟机101，虚拟机101部署在计算服务器(图未示)中。在数据存入数据湖时，虚拟机101确定数据所存入的存储设备，并获取数据的元数据，并将所述元数据记录至目录服务102中。The data lake mainly includes two parts, one part is computing resources 10, and the other part is storage resources 20. The storage resource 20 generally includes multiple storage devices 201, and the data stored in the data lake is generally distributed and stored in the multiple storage devices 201. The computing resource 10 generally uses a virtual machine 101 in a cloud environment, and the virtual machine 101 is deployed in a computing server (not shown). When data is stored in the data lake, the virtual machine 101 determines the storage device where the data is stored, obtains metadata of the data, and records the metadata in the directory service 102.

如果数据湖是部署在服务器集群中，则数据湖的计算资源为至少一个计算服务器，而存储资源则为存储服务器。If the data lake is deployed in a server cluster, the computing resources of the data lake are at least one computing server, and the storage resources are storage servers.

目前，市场上有很多厂商提供数据湖产品，不同厂商的数据湖产品所设置的目录服务都不尽相同，但在一些场景下，需要数据湖之间的数据可以互相流动，比如，在高一级的单位例如省或中央想要对低一级的单位例如市或省的数据进行整合时，由于无法识别各市或者各省的目录服务，则无法直接进行整合。目前的一种实现方式为：如果数据湖A(例如省)和数据湖B(例如市)的目录服务不同，但数据湖A需要获取数据湖B的数据，则数据湖A需要通知数据湖B，告诉数据湖B它所要访问的数据。数据湖B需要确认数据湖A有没有权限访问其所要访问数据，如果有，则将数据湖A所要访问数据及其元数据放到交换平台上，交换平台将数据湖A所要访问数据的元数据转换为数据湖A可以识别的元数据，并通知数据湖A从所述交换平台获取所述元数据，并基于所获取的元数据从数据湖A中获取数据。At present, there are many vendors on the market that provide data lake products, and the directory services set by the data lake products of different vendors are not the same. However, in some scenarios, data between data lakes need to be able to flow between each other, for example, in the first grade of high school. When a level unit such as a province or the central government wants to integrate the data of a lower level unit such as a city or province, it cannot be directly integrated because it cannot identify the directory service of each city or province. One current implementation method is: If the directory services of data lake A (for example, province) and data lake B (for example, city) are different, but data lake A needs to obtain data from data lake B, then data lake A needs to notify data lake B , Tell Data Lake B the data it wants to access. Data Lake B needs to confirm whether Data Lake A has the right to access the data it wants to access. If so, put the data and metadata to be accessed by Data Lake A on the exchange platform, and the exchange platform will store the metadata of the data that Data Lake A wants to access. Convert it into metadata that can be identified by Data Lake A, and notify Data Lake A to obtain the metadata from the exchange platform, and obtain data from Data Lake A based on the obtained metadata.

另外，数据湖A需要主动获取数据湖B的数据，所以数据湖A需要提前知道数据湖B中有哪些数据，在数据湖规模比较大的时候，很难预知数据湖B中有哪些数据。In addition, data lake A needs to actively obtain data from data lake B, so data lake A needs to know in advance what data is in data lake B. When the scale of data lake is relatively large, it is difficult to predict what data is in data lake B.

另外，数据湖A在每次获取数据湖B的数据时，数据湖B还需要对数据湖A进行权限认证，从而影响数据的访问效率。In addition, every time data lake A obtains data from data lake B, data lake B also needs to authenticate data lake A, which affects the efficiency of data access.

本发明实施例提供的数据处理方法中设置有目录模板，该目录模板为采用不同目录服务的数据湖定义了统一的元数据格式，所有的数据湖都可以识别目录模板中定义的元数据的格式。这样，在数据湖A需要获取数据湖B的数据时，数据湖B先把数据湖B的目录服务中的元数据个格式转换为目录模板定义的元数据格式，数据湖A获取被转换为目录模板定义的格式的元数据后，再将其转换为数据湖A的目录服务中的元数据的格式，这样，每个数据湖只要能够识别目录模板中所定义的元数据格式，即可与其他数据湖之间进行数据传输，从而简化了不同数据湖之间的的通信。The data processing method provided by the embodiment of the present invention is provided with a catalog template, which defines a unified metadata format for data lakes that adopt different catalog services, and all data lakes can recognize the metadata format defined in the catalog template . In this way, when data lake A needs to obtain data from data lake B, data lake B first converts the metadata format in the catalog service of data lake B into the metadata format defined by the catalog template, and the acquisition of data lake A is converted into catalog After the metadata in the format defined by the template, it is converted to the metadata format in the catalog service of Data Lake A. In this way, as long as each data lake can recognize the metadata format defined in the catalog template, it can be combined with other data. Data transmission between data lakes simplifies the communication between different data lakes.

另外，数据湖A和数据湖B会形成数据联邦，形成数据联邦后，当数据湖B产生新的元数据时，数据湖B会把新产生的元数据发布至消息平台，消息平台在接收到数据湖B发布的元数据后，会把接收的元数据推送给数据湖A，这样，数据湖A不需要主动从数据湖B中获取元数据。In addition, Data Lake A and Data Lake B will form a data federation. After forming a data federation, when data lake B generates new metadata, data lake B will publish the newly generated metadata to the messaging platform, and the messaging platform will receive After the metadata released by Data Lake B, the received metadata will be pushed to Data Lake A. In this way, Data Lake A does not need to actively obtain metadata from Data Lake B.

另外，在数据湖B会设置元数据的发布规则，只有符合发布规则的元数据才会发布至消息平台，而数据湖A中也设置有元数据订阅规则，只有符合订阅规则的元数据才会被存储至数据湖A，这样，不需要进行权限认证，也可以保证数据的安全。In addition, metadata publishing rules will be set in Data Lake B. Only metadata that meets the publishing rules will be published to the messaging platform. Data Lake A also has metadata subscription rules. Only metadata that meets the subscription rules will be published. It is stored in Data Lake A. In this way, there is no need for permission authentication, and data security can also be ensured.

由于数据湖中包括多种类型的数据，例如属于结构化数据的Hive数据库、Hbase数据库，属于非结构化数据的视频、音频等，所以在目录服务中，根据不同类型的数据设定了不同的元数据格式。对不同的厂商，在建立数据湖的目录服务时，对每种类型的数据的元数据的描述格式会有不同，所以本发明实施例提供一种目录模板，用于为把不同厂家的目录服务中的元数据的格式转化为统一的元数据格式。Since the data lake includes multiple types of data, such as Hive database and Hbase database that belong to structured data, and video and audio that belong to unstructured data, different types of data are set in the directory service. Metadata format. For different vendors, when establishing the catalog service of the data lake, the description format of the metadata of each type of data will be different. Therefore, the embodiment of the present invention provides a catalog template for providing catalog services of different vendors. The format of metadata in is converted to a unified metadata format.

如图2所示，为本发明实施例中定义的目录模板中Hive数据库的元数据格式的示意图。As shown in FIG. 2, it is a schematic diagram of the metadata format of the Hive database in the directory template defined in the embodiment of the present invention.

在图2所示的模板中，为元数据的每个属性定义了表达方式及顺序。例如，若数据湖A中的目录服务中对Hive的元数据的表名是用TalX表示，列用ColX表示，数据湖B中的目录服务中对Hive的元数据的表名是用TalY表示，列用ColX表示，则在转化为所述目录模板定义的格式之后，则表名用Name表示，列用Column表示。In the template shown in Figure 2, the expression and sequence are defined for each attribute of the metadata. For example, if the table name for Hive metadata in the directory service in Data Lake A is represented by TalX and the column is represented by ColX, the table name for Hive metadata in the directory service in Data Lake B is represented by TalY. The column is represented by ColX. After being transformed into the format defined by the catalog template, the table name is represented by Name, and the column is represented by Column.

如图3所示，为形成联邦的两个数据湖的示意图。在本发明实施例中，所述数据湖30及数据湖31同时提供目录服务服务301及联邦服务302，所述目录服务服务301和联邦服务302在云环境下，可以由不同的虚拟机提供，也可以是同一虚拟机中的不同进程；在服务器集群中，可以由不同服务器提供，也可以是同一服务器中的不同进程。消息平台33用于为数据湖30及数据湖31提供队列服务，为数据湖30和31提供原始元数据队列331及发布元数据队列332，以方便数据湖30和数据湖31之间进行元数据交换。所述消息平台可以是由多个虚拟机构成的虚拟机集群，也可以是多个服务器构成的服务器集群。As shown in Figure 3, it is a schematic diagram of two data lakes forming a federation. In the embodiment of the present invention, the data lake 30 and the data lake 31 provide a directory service service 301 and a federation service 302 at the same time. The directory service service 301 and the federation service 302 can be provided by different virtual machines in a cloud environment. It can also be different processes in the same virtual machine; in a server cluster, it can be provided by different servers, or it can be different processes in the same server. The message platform 33 is used to provide queue services for the data lake 30 and the data lake 31, and provide the original metadata queue 331 and the publishing metadata queue 332 for the data lakes 30 and 31 to facilitate metadata between the data lake 30 and the data lake 31 exchange. The message platform may be a virtual machine cluster composed of multiple virtual machines, or a server cluster composed of multiple servers.

请同时参阅图4，为数据湖30需要与数据湖31形成数据联邦以获取数据湖31中的数据时向数据湖31注册的方法的流程图。Please also refer to FIG. 4, which is a flowchart of a method for registering with the data lake 31 when the data lake 30 needs to form a data federation with the data lake 31 to obtain data in the data lake 31.

步骤S401，数据湖30的联邦服务302向数据湖31提供联邦服务的虚拟机(也可以是服务器或者进程，以下简称联邦服务)发送注册请求。In step S401, the federated service 302 of the data lake 30 sends a registration request to the virtual machine (which may also be a server or process, hereinafter referred to as federated service) that provides federated services in the data lake 31.

在本发明实施例中，会在数据湖中安装所述联邦服务的应用程序，当数据湖30需要获取数据湖31中的数据时，用户会启动所述联邦服务302的应用程序，并选择需要建立联邦的数据湖31，然后通过联邦服务提供的注册功能发送注册请求至所述数据湖31，所述注册请求中携带所述数据湖30的联邦服务的地址。In the embodiment of the present invention, the application of the federation service will be installed in the data lake. When the data lake 30 needs to obtain the data in the data lake 31, the user will start the application of the federation service 302 and select the required The federal data lake 31 is established, and then a registration request is sent to the data lake 31 through the registration function provided by the federal service, and the registration request carries the address of the federal service of the data lake 30.

步骤S402，当所述数据湖31的联邦服务312接收到所述注册请求后，分别发送原始元数据队列创建请求及发布元数据队列创建请求至消息平台33，同时将所述联邦服务312及联邦服务302的地址发送至所述消息平台33。Step S402: After the federation service 312 of the data lake 31 receives the registration request, it sends an original metadata queue creation request and a release metadata queue creation request to the message platform 33, and simultaneously sends the federation service 312 and the federation service 312 to the message platform 33. The address of the service 302 is sent to the message platform 33.

步骤S403，所述消息平台33接收到所述原始元数据队列创建请求及发布元数据队列创建请求，分别建立原始元数据队列和发布元数据队列。In step S403, the message platform 33 receives the original metadata queue creation request and the publishing metadata queue creation request, and establishes the original metadata queue and the publishing metadata queue respectively.

所述原始元数据队列及发布元数据队列是消息平台33中运行的进程，消息平台在创建所述原始元数据队列及发布元数据队列时，为每个队列分配一定的存储空间，其中所述原始元数据队列用于存储数据湖30中新产生的元数据，所述发布元数据队列用于存储数据湖30需要发布的元数据。关于元数据存入原始元数据队列及发布元数据队列的过程将在下文描述。The original metadata queue and the publishing metadata queue are processes running in the messaging platform 33. When creating the original metadata queue and the publishing metadata queue, the messaging platform allocates a certain amount of storage space for each queue. The original metadata queue is used to store metadata newly generated in the data lake 30, and the publishing metadata queue is used to store metadata that the data lake 30 needs to publish. The process of storing metadata in the original metadata queue and publishing the metadata queue will be described below.

步骤S404，在所述原始元数据队列和所述发布元数据队列建成之后，所述消息平台33返回所述原始元数据队列及所述发布元数据队列的地址至所述联邦服务312，所述联邦服务312将所述发布元数据队列的地址发送至所述联邦服务302，将所述原始元数据队列的地址返回给目录服务311。这样，联邦服务302或者目录服务311根据所述地址分别从所述发布元数据队列或所述原始元数据队列中获取元数据。如此数据湖30与数据湖31之间形成数据联邦。Step S404: After the original metadata queue and the publishing metadata queue are established, the message platform 33 returns the addresses of the original metadata queue and the publishing metadata queue to the federated service 312, the The federation service 312 sends the address of the publishing metadata queue to the federation service 302, and returns the address of the original metadata queue to the directory service 311. In this way, the federation service 302 or the directory service 311 respectively obtains metadata from the publishing metadata queue or the original metadata queue according to the address. In this way, a data federation is formed between the data lake 30 and the data lake 31.

下面将结合图5形成联邦的数据湖之间传输数据的过程。The following will combine Figure 5 to form the process of data transmission between the data lakes of the federation.

如图5所示，在数据湖30及数据湖31形成数据联邦之后，数据湖30及数据湖31之间即可进行元数据的传输，数据传输的过程如图5所示。As shown in FIG. 5, after the data lake 30 and the data lake 31 form a data federation, metadata can be transmitted between the data lake 30 and the data lake 31, and the data transmission process is shown in FIG. 5.

步骤S501，提供目录服务311的虚拟机(也可以是进程或者服务器，以下均称为目录服务)侦测数据湖31中的目录服务中产生的新的元数据。In step S501, a virtual machine (which may also be a process or a server, which is referred to as a directory service hereinafter) that provides the directory service 311 detects new metadata generated in the directory service in the data lake 31.

当用户在数据湖31中生成新的表或文件时，所述目录服务311获取新生成的表或文件的元数据，并进行存储。在本发明实施例中，在所述目录服务中增加一个插件，用于侦测所述目录服务是否获取了新的元数据。When a user generates a new table or file in the data lake 31, the directory service 311 obtains and stores the metadata of the newly generated table or file. In the embodiment of the present invention, a plug-in is added to the directory service to detect whether the directory service has acquired new metadata.

步骤S502，所述目录服务311将新产生的元数据根据所述原始元数据队列的地址，发送元数据写入请求至所述消息平台33。In step S502, the directory service 311 sends a metadata write request to the message platform 33 based on the newly generated metadata according to the address of the original metadata queue.

在建立数据联邦时，所述联邦服务312将原始元数据队列的地址发送至了目录服务，所以在所述目录服务311产生新的元数据时，即可生成元数据写入请求以将所述元数据写入请求发送至所述消息平台33。When establishing a data federation, the federation service 312 sends the address of the original metadata queue to the directory service. Therefore, when the directory service 311 generates new metadata, a metadata write request can be generated to transfer the The metadata write request is sent to the message platform 33.

步骤S503，所述消息平台33将所述新产生的元数据写入所述原始元数据队列中。In step S503, the message platform 33 writes the newly generated metadata into the original metadata queue.

步骤S504，所述消息平台33通知所述数据湖31的联邦服务312所述原始元数据队列中有新的源数据生成。In step S504, the message platform 33 notifies the federation service 312 of the data lake 31 that there is new source data generated in the original metadata queue.

步骤S505，所述联邦服务301从所述原始元数据队列中获取所述元数据，然后将所述元数据的格式转换为所述目录模板中所定义的元数据的格式。In step S505, the federation service 301 obtains the metadata from the original metadata queue, and then converts the format of the metadata into the metadata format defined in the catalog template.

例如，若所述元数据为Hive数据库中的元数据，在数据湖31中，其元数据的表达方式如图6所示，在将其通过图2所示的目录模板中定义的Hive的元数据的格式转换后，如图7所示。在联邦服务312中预先定义了元数据模板中的各个属性与Hive元数据中各个属性的对应关系。在转换时，首先读取图2所示的元数据模板，然后获取Hive元数据中与元数据模板中的属性相同的属性对应的值，并把所获得的值填入所述元数据模板对应的位置处，例如在Hive元数据中属性“Name”与元数据模板中的属性“表名”表示相同的属性，则从Hive元数据中获取属性“Name”的值“Hive-table1”写入元数据模板定义的属性“表名”中。对于其他属性做类似的操作，即可将图6所示的数据湖31中的Hive数据库的元数据的格式转换为图7所示的元数据模板所定义的格式。另外，也可对某些属性的值进行简化，只提取关键信息即可，例如对于图6中的DB属性，数据湖31中的表述比较复杂，还包括了标识这个表所属的数据库的元数据的全局标识，则在转化为元数据模板的格式时，只提取数据库的标识“hive_db”即可。For example, if the metadata is the metadata in the Hive database, in the data lake 31, the expression of the metadata is shown in FIG. 6, and it is passed through the Hive metadata defined in the catalog template shown in FIG. After the data format is converted, as shown in Figure 7. The corresponding relationship between each attribute in the metadata template and each attribute in the Hive metadata is predefined in the federation service 312. During conversion, first read the metadata template shown in Figure 2, then obtain the value corresponding to the attribute in the Hive metadata that is the same as the attribute in the metadata template, and fill the obtained value into the metadata template corresponding For example, the attribute "Name" in the Hive metadata and the attribute "table name" in the metadata template represent the same attribute, and the value of the attribute "Name" is obtained from the Hive metadata and written in "Hive-table1" In the attribute "table name" defined by the metadata template. Doing similar operations for other attributes can convert the format of the metadata of the Hive database in the data lake 31 shown in FIG. 6 into the format defined by the metadata template shown in FIG. 7. In addition, the value of some attributes can also be simplified by only extracting key information. For example, for the DB attribute in Figure 6, the expression in the data lake 31 is more complicated, and it also includes metadata identifying the database to which this table belongs. When converting to the format of the metadata template, only the database identifier "hive_db" can be extracted.

步骤S506，根据所述发布规则确定所述元数据是否可以发布至所述数据湖31的联邦数据湖30。Step S506: Determine whether the metadata can be published to the federal data lake 30 of the data lake 31 according to the publishing rule.

在本发明实施例中，会对发布至联邦数据湖30的元数据进行过滤，只有满足发布规则的数据才会被发布至联邦数据湖。所述发布规则由用户根据所述目录模板中元数据的属性进行的设定。所述发布规则包括规则和动作。In the embodiment of the present invention, the metadata published to the federal data lake 30 is filtered, and only data that meets the publishing rules will be published to the federal data lake. The publishing rule is set by the user according to the attributes of the metadata in the catalog template. The publishing rules include rules and actions.

例如，所述发布规则的规则设置为：创建者＝张三&生成时间>2019/9/18&表类型＝财务，动作设置为发布，表示满足所述规则的数据可以进行发布。For example, the rule setting of the publishing rule is: creator=Zhang San&generation time>2019/9/18&table type=financial, and the action is set to publish, which means that data that meets the rule can be published.

在所述规则中，创建者、生成时间、表类型都是元数据模板中定义的属性。根据所述规则，在2019年9月18号后，张三创建了一张有关财务的表后，这个表相关的元数据信息就会同步到形成联邦的其他数据湖中。In the rules, the creator, generation time, and table type are all attributes defined in the metadata template. According to the rules, after September 18, 2019, after Zhang San creates a table about finances, the metadata information related to this table will be synchronized to other data lakes that form the federation.

又例如，所述发布规则的规则设置为：元数据来源＝XX公司，动作设置为不发布，则如果元数据是来自XX公司的，则不同步给数据湖31的联邦数据湖30中。For another example, the rule of the publishing rule is set as: metadata source=XX company, and the action is set to not publish. If the metadata comes from XX company, it will not be synchronized to the federal data lake 30 of the data lake 31.

步骤S507，当根据所述发布规则确定所述元数据可以发布至所述数据湖31时，所述联邦服务312发送所述元数据至所述消息平台33。Step S507: When it is determined according to the publishing rule that the metadata can be published to the data lake 31, the federated service 312 sends the metadata to the messaging platform 33.

在数据湖30注册成为数据湖31的联邦数据湖时，数据湖31在消息平台33中申请建立了发布元数据队列，并将发布元数据队列的地址返回给了数据湖31的联邦服务312。这样，所述联邦服务312在发送所述元数据至所述消息平台33时，可携带所述发布元数据队列的地址。When the data lake 30 is registered as the federal data lake of the data lake 31, the data lake 31 applies for the establishment of a publishing metadata queue in the message platform 33, and returns the address of the publishing metadata queue to the federal service 312 of the data lake 31. In this way, when the federation service 312 sends the metadata to the messaging platform 33, it can carry the address of the metadata publishing queue.

步骤S508，所述消息平台33将接收到的元数据存储至所述发布元数据队列。In step S508, the message platform 33 stores the received metadata in the publishing metadata queue.

步骤S509，所述消息平台33通知所述数据湖30的联邦服务302从所述发布元数据队列中获取数据。In step S509, the message platform 33 notifies the federation service 302 of the data lake 30 to obtain data from the publishing metadata queue.

在联邦数据湖的注册过程中，所述联邦服务312将所述联邦服务302的地址传给了所述消息平台，所以，所述消息平台可根据所述联邦服务302的地址通知所述述联邦服务302获取所述元数据。During the registration process of the Federal Data Lake, the federation service 312 passed the address of the federation service 302 to the messaging platform, so the messaging platform can notify the federation according to the address of the federation service 302 The service 302 obtains the metadata.

步骤S510，所述数据湖30的联邦服务302根据所述发布元数据队列的地址从所述发布元数据队列中获取所述元数据。In step S510, the federation service 302 of the data lake 30 obtains the metadata from the publishing metadata queue according to the address of the publishing metadata queue.

在联邦数据湖的注册过程中，所述联邦服务312将所述发布元数据队列的地址发送给了所述联邦服务302，所以所述联邦服务302可以根据所述发布元数据队列的地址从所述发布元数据队列中获取所述元数据。During the registration process of the Federal Data Lake, the federation service 312 sends the address of the publishing metadata queue to the federation service 302, so the federation service 302 can send the address of the publishing metadata queue from all sources. Obtain the metadata from the release metadata queue.

步骤S511，所述联邦服务302根据订阅规则确定所述元数据是否是所述数据湖B需要的数据。In step S511, the federation service 302 determines whether the metadata is the data required by the data lake B according to the subscription rule.

所述订阅规则也根据所述目录模板中的元数据的属性设置，所述订阅规则也包括规则和动作两部分，如果所获取的元数据符合设定的规则，则执行相应的动作。The subscription rules are also set according to the attributes of the metadata in the directory template. The subscription rules also include two parts: rules and actions. If the acquired metadata meets the set rules, corresponding actions are executed.

例如，所述订阅规则的规则可以设置为：元数据来源＝XX，动作设置为不接收，表示如果从数据湖A获取的元数据是XX的，则不接收该元数据。For example, the rule of the subscription rule can be set as: metadata source=XX, and the action is set to not receive, which means that if the metadata obtained from Data Lake A is XX, the metadata will not be received.

通过设定订阅规则，可以过滤掉数据湖30不需要的数据。By setting subscription rules, data that is not needed by the data lake 30 can be filtered out.

步骤S512，所述联邦服务302将所述元数据转换为数据湖30中的元数据的格式。In step S512, the federation service 302 converts the metadata into the format of the metadata in the data lake 30.

在联邦服务302中定义了数据湖30中的元数据的中的各个属性与元数据模板中的元数据的各个属性的对应关系。通过定义的对应关系，可以将所述元数据模板中的元数据格式转化为数据湖30中的元数据格式，转换后的如图8所示。The federation service 302 defines the correspondence between each attribute of the metadata in the data lake 30 and each attribute of the metadata in the metadata template. Through the defined corresponding relationship, the metadata format in the metadata template can be converted into the metadata format in the data lake 30, and the converted format is shown in FIG. 8.

步骤S513，所述联邦服务302将转换后的元数据发送至数据湖30的目录服务301。In step S513, the federation service 302 sends the converted metadata to the directory service 301 of the data lake 30.

步骤S514，数据湖30的目录服务301将转换后的元数据存储至所述数据湖30的目录服务中。In step S514, the catalog service 301 of the data lake 30 stores the converted metadata into the catalog service of the data lake 30.

在实际应用中，数据湖31可以为低一级的单位例如市或省的数据湖，而数据湖30为高一级的单位例如省或中央的数据湖。这样，由于数据湖30可以直接从数据湖31中获取数据，所以可以很方便对低一级单位的数据进行整合。In practical applications, the data lake 31 may be a lower-level unit such as a city or provincial data lake, and the data lake 30 may be a higher-level unit such as a provincial or central data lake. In this way, since the data lake 30 can directly obtain data from the data lake 31, it is convenient to integrate the data of lower-level units.

如图9所示，为本发明实施例的一种应用场景。数据湖B分别和数据湖D及数据湖E形成目录服务联邦，则数据湖B可以分别获取数据湖D和数据湖E中的元数据，数据湖A分别和数据湖B和数据湖C形成目录服务联邦，则数据湖A可以分别从获取数据湖D和数据湖E中的元数据。假如数据湖D和数据湖E是市级单位的数据湖，数据湖B和数据湖C是省级单位的数据湖，数据湖A为中央单位的数据湖，则上级单位可以很方便的对下级单位的元数据进行整合，且同级别的单位之间的目录服务互相隔离。As shown in FIG. 9, it is an application scenario of an embodiment of the present invention. Data Lake B forms a catalog service federation with Data Lake D and Data Lake E, then Data Lake B can obtain metadata in Data Lake D and Data Lake E respectively, and Data Lake A forms a catalog with Data Lake B and Data Lake C respectively Service federation, data lake A can obtain metadata in data lake D and data lake E respectively. If Data Lake D and Data Lake E are the data lakes of municipal units, Data Lake B and Data Lake C are the data lakes of provincial units, and Data Lake A is the data lake of the central unit. The metadata of the unit is integrated, and the directory services between units of the same level are isolated from each other.

上述实施例仅以数据湖为例进行说明，但本发明同样适用于其他数据系统，例如数据库，数据仓库等。数据库和数据仓库都会有元数据，所以与数据湖的实现方式基本相同，在此不再赘述。The above embodiments only take the data lake as an example for description, but the present invention is also applicable to other data systems, such as databases, data warehouses, and so on. Both the database and the data warehouse have metadata, so the implementation method is basically the same as that of the data lake, so I won't repeat it here.

如图10所示，为第一数据处理装置1001和第二数据处理装置2001的功能模块图。所述第一数据处理装置1001和所述第二数据处理装置2001主要用于实现联邦服务302及联邦服务312的功能。As shown in FIG. 10, it is a functional block diagram of the first data processing device 1001 and the second data processing device 2001. The first data processing device 1001 and the second data processing device 2001 are mainly used to implement the functions of the federated service 302 and the federated service 312.

所述第一数据处理装置1001包括定义模块1002、创建模块1003、获取模块1004、转换模块1005、及发布模块1006。所述第二数据处理装置包括定义模块2001、获取模块2002及转换模块2003。The first data processing device 1001 includes a definition module 1002, a creation module 1003, an acquisition module 1004, a conversion module 1005, and a publishing module 1006. The second data processing device includes a definition module 2001, an acquisition module 2002, and a conversion module 2003.

所述定义模块1002及定义模块2002分别用于在第一数据处理装置1001及第二数据处理装置2001中定义图2所示的目录模板中的元数据的格式。定义的方法具体请参考有关图2的描述。The definition module 1002 and the definition module 2002 are used to define the format of the metadata in the catalog template shown in FIG. 2 in the first data processing device 1001 and the second data processing device 2001, respectively. For the definition method, please refer to the description of Figure 2.

所述创建模块1003用于在接收到第二数据处理装置2001的注册请求后，在消息平台中创建原始元数据队列及发布元数据队列，关于具体创建的过程，请参考图4的相关描述，在此不再赘述。The creation module 1003 is configured to create an original metadata queue and publish a metadata queue in the messaging platform after receiving the registration request of the second data processing device 2001. For the specific creation process, please refer to the relevant description in FIG. 4. I won't repeat them here.

所述获取模块1004用于获取数据湖30中数据的元数据，具体获取方式请参考图5中步骤S501至步骤S505的描述，在此不再赘述。The acquisition module 1004 is used to acquire the metadata of the data in the data lake 30. For a specific acquisition method, please refer to the description of step S501 to step S505 in FIG. 5, which will not be repeated here.

所述转换模块1005用于将数据湖30中的数据的元数据的格式转换为目录模板中定义的元数据格式。具体转换方式请看考图5中步骤S505的描述，在此不再赘述。The conversion module 1005 is used to convert the metadata format of the data in the data lake 30 into the metadata format defined in the catalog template. For the specific conversion method, please refer to the description of step S505 in FIG. 5, which will not be repeated here.

所述发布模板1006用于将转换为目录模板中定义的元数据格式的元数据发布至消息平台的发布元数据队列，具体请参考图5中步骤507及508的描述，在此不再赘述。The publishing template 1006 is used to publish the metadata converted into the metadata format defined in the catalog template to the publishing metadata queue of the messaging platform. For details, please refer to the description of steps 507 and 508 in FIG. 5, which will not be repeated here.

所述第二数据处理装置2001的获取模块2003在接收到消息平台的通知后，从所述消息平台的发布元数据队列中获取被转换为目录模板所定义的格式的元数据，关于转换的具体过程请参考步骤S510的相关描述，在此不再赘述。After receiving the notification from the message platform, the acquiring module 2003 of the second data processing device 2001 acquires the metadata converted into the format defined by the catalog template from the publishing metadata queue of the message platform. For the process, please refer to the related description of step S510, which will not be repeated here.

所述转换模块2004用于根据预设的过滤规则对所获取的元数据进行过滤，在确定所述元数据是数据湖31能够接收的数据时，将所获取的元数据的格式由所述目录模板所定义的格式转换为数据湖31中的数据的元数据的格式，并将转换后的元数据存储至数据湖31的数据目录，具体请参考步骤S511至步骤S514的描述，在此不再赘述。The conversion module 2004 is configured to filter the acquired metadata according to preset filtering rules. When it is determined that the metadata is data that can be received by the data lake 31, the format of the acquired metadata is changed from the catalog The format defined by the template is converted into the metadata format of the data in the data lake 31, and the converted metadata is stored in the data directory of the data lake 31. For details, please refer to the description of step S511 to step S514, which will not be repeated here. Go into details.

当以上任一模块或单元以软件实现的时候，所述软件以计算机程序指令的方式存在，并被存储在存储器中，处理器可以用于执行所述程序指令以实现以上方法流程。所述处理器可以包括但不限于以下至少一种：中央处理单元(central processing unit，CPU)、微处理器、数字信号处理器(DSP)、微控制器(microcontroller unit，MCU)、或人工智能处理器等各类运行软件的计算设备，每种计算设备可包括一个或多个用于执行软件指令以进行运算或处理的核。该处理器可以是个单独的半导体芯片，也可以跟其他电路一起集成为一个半导体芯片，例如，可以跟其他电路(如编解码电路、硬件加速电路或各种总线和接口电路)构成一个SoC(片上系统)，或者也可以作为一个ASIC的内置处理器集成在所述ASIC当中，该集成了处理器的ASIC可以单独封装或者也可以跟其他电路封装在一起。该处理器除了包括用于执行软件指令以进行运算或处理的核外，还可进一步包括必要的硬件加速器，如现场可编程门阵列(field programmable gate array，FPGA)、PLD(可编程逻辑器件)、或者实现专用逻辑运算的逻辑电路。When any of the above modules or units are implemented by software, the software exists in the form of computer program instructions and is stored in the memory, and the processor can be used to execute the program instructions to implement the above method flow. The processor may include, but is not limited to, at least one of the following: central processing unit (CPU), microprocessor, digital signal processor (DSP), microcontroller (microcontroller unit, MCU), or artificial intelligence Various computing devices such as processors that run software. Each computing device may include one or more cores for executing software instructions to perform operations or processing. The processor can be a single semiconductor chip, or it can be integrated with other circuits to form a semiconductor chip. For example, it can form an SoC (on-chip) with other circuits (such as codec circuits, hardware acceleration circuits, or various bus and interface circuits). System), or it can be integrated into the ASIC as a built-in processor of an ASIC, and the ASIC integrated with the processor can be packaged separately or together with other circuits. In addition to the core used to execute software instructions for calculation or processing, the processor may further include necessary hardware accelerators, such as field programmable gate array (FPGA) and PLD (programmable logic device) , Or a logic circuit that implements dedicated logic operations.

当以上模块以硬件电路实现的时候，所述硬件电路可能以通用CPU(Central processing unit，中央处理器)、MCU(Micro controller Unit，微控制器)、MPU(Micro processing unit，微处理器)、DSP(Digital signal processing，数字信号处理器)、SoC(System on Chip，片上系统)来实现，当然也可以采用专用集成电路(application-specific integrated circuit，ASIC)实现，或可编程逻辑器件(programmable logic device，PLD)实现，上述PLD可以是复杂程序逻辑器件(complex programmable logical device，CPLD)，现场可编程门阵列(field-programmable gate array，FPGA)，通用阵列逻辑(generic array logic，GAL)或其任意组合，其可以运行必要的软件或不依赖于软件以执行以上方法流程。When the above modules are implemented by hardware circuits, the hardware circuits may be general-purpose CPU (Central Processing Unit, Central Processing Unit), MCU (Micro Controller Unit, Microcontroller), MPU (Micro Processing Unit, Microprocessor), DSP (Digital signal processing, digital signal processor), SoC (System on Chip, system-on-chip) to achieve, of course, it can also be implemented with application-specific integrated circuit (ASIC), or programmable logic device (programmable logic) device, PLD). The above-mentioned PLD can be a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a generic array logic (generic array logic, GAL) or its In any combination, it can run necessary software or does not rely on software to execute the above method flow.

如图11所示，为一个服务器的硬件结构图。所述服务器为所述数据湖30或者数据湖31中的联邦服务运行的服务器。所述服务器1101包括处理单元1102、存储单元1103及通信单元1104。As shown in Figure 11, it is a hardware structure diagram of a server. The server is a server operated by a federated service in the data lake 30 or the data lake 31. The server 1101 includes a processing unit 1102, a storage unit 1103, and a communication unit 1104.

处理单元1101有多种实现形式，例如可以为中央处理器(central processing unit，CPU)或图像处理器(graphics processing unit，GPU)，可以是单核处理器或多核处理器，如果实现为多核处理器，处理器的数目不做限定。The processing unit 1101 has multiple implementation forms, for example, it can be a central processing unit (CPU) or graphics processing unit (GPU), it can be a single-core processor or a multi-core processor, if it is implemented as multi-core processing The number of processors is not limited.

存储单元1103可以为动态随机存取存储器(Dynamic Random Access Memory，DRAM)，也可以为存储级内存(storage class memory,SCM)。存储单元1103用于存储程序代码和数据，以便于处理单元调用执行以实现相关功能。本发明实施例中可以存储联邦服务对应的程序代码，所述处理器调用所述程序代码，已执行图5所示的方法中数据湖30的联邦服务所执行的功能或者数据湖31的联邦服务所执行的功能。The storage unit 1103 may be a dynamic random access memory (Dynamic Random Access Memory, DRAM), or may be a storage class memory (storage class memory, SCM). The storage unit 1103 is used to store program codes and data so that the processing unit can call and execute to implement related functions. In the embodiment of the present invention, the program code corresponding to the federated service may be stored. The processor calls the program code and has executed the function performed by the federated service of the data lake 30 or the federated service of the data lake 31 in the method shown in FIG. The function performed.

所述通信单元1104用于与消息平台和其他服务器通信，以进行数据传输。The communication unit 1104 is used to communicate with the messaging platform and other servers for data transmission.

以上所述，仅为本发明的具体实施方式，但本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应以所述权利要求的保护范围为准。The above are only specific embodiments of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed by the present invention. It should be covered within the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

A data processing method, characterized in that the method includes:

Acquiring metadata of data in the first data system;

Convert the metadata of the data in the first data system from a first format to a second format, where the first format is the format of the metadata of the data in the first data system, and the second format is A metadata format that can be recognized by both the first data system and the second data system.

The data processing method according to claim 1, further comprising:

A metadata template is predefined, and the metadata format adopted by the metadata template is the second format.

The data processing method according to claim 1, wherein the method further comprises:

The metadata converted into the second format is sent to the messaging platform, so that the second data system obtains the metadata converted into the second format from the messaging platform.

The data processing method according to claim 3, wherein the sending the metadata converted into the second format to the messaging platform comprises:

Judging whether the metadata converted into the second format meets a preset publishing rule;

When it is determined that the metadata converted into the second format meets a preset publishing rule, the metadata converted into the second format is sent to the messaging platform.

The data processing method according to claim 3, further comprising:

Sending a publishing metadata queue creation request to the messaging platform, so that the messaging platform can establish a publishing metadata queue;

Sending the metadata converted into the second format to the messaging platform includes:

The metadata converted into the second format is written into the publishing metadata queue of the messaging platform.

The data processing method according to claim 5, further comprising:

Obtain the address of the publishing metadata queue sent by the messaging platform, and send the address of the publishing metadata queue to the second data system.

The data processing method according to claim 5, further comprising:

Sending an original metadata queue creation request to the message platform, so that the message platform establishes an original metadata queue;

Converting the format of the metadata in the first data system from the first format to the second format includes:

Converting the metadata of the first data system in the original metadata queue from the first format to the second format.

A data processing method, characterized in that the method includes:

Acquiring metadata of the data of the first data system;

Converting the format of the data metadata of the first data system from a second format to a third format, where the second format is a metadata format that can be recognized by both the first data system and the second data system, The third format is a format of metadata of data in the second data system.

8. The data processing method according to claim 8, further comprising: predefining a metadata template, and the metadata format adopted by the metadata template is the second format. .

The data processing method according to claim 8, wherein:

The acquiring metadata of the first data system includes:

The metadata of the data of the first data system is obtained from the message platform.

The data processing method according to claim 8, further comprising:

Judging whether the metadata meets preset receiving rules;

When it is determined that the converted metadata satisfies the preset receiving rule, the format of the metadata of the data of the first data system is converted from the first format to the second format.

The data processing method according to claim 10, further comprising:

Receiving the address of the publishing metadata queue in the message platform sent by the first data system;

The obtaining the metadata of the data of the first data system from the message platform includes:

Acquiring the metadata from the publishing metadata address according to the address of the publishing metadata queue.

A data processing device, characterized in that the device comprises:

An acquisition module, used to acquire metadata of data in the first data system;

The conversion module is used to convert the metadata of the data in the first data system from a first format to a second format, where the first format is the format of the metadata of the data in the first data system, so The second format is a metadata format that can be recognized by both the first data system and the second data system.

The data processing device according to claim 13, further comprising:

The definition module is used to predefine a metadata template, and the metadata format adopted by the metadata template is the second format.

The data processing device according to claim 13, wherein the device further comprises:

The publishing module is configured to send the metadata converted into the second format to the message platform, so that the second data system obtains the metadata converted into the second format from the message platform.

The data processing device according to claim 15, wherein the publishing module is further configured to:

The data processing device of claim 15, further comprising:

A creation module, configured to send a release metadata queue creation request to the message platform, so that the message platform establishes a release metadata queue;

The publishing module is specifically configured to write the metadata converted into the second format into the publishing metadata queue of the messaging platform.

The data processing device according to claim 17, wherein the creation module is specifically configured to:

The data processing device according to claim 17, wherein the creation module is further configured to:

The conversion module is specifically configured to: convert the metadata of the first data system in the original metadata queue from the first format to the second format.

A data processing device, characterized in that the device comprises:

An acquisition module for acquiring metadata of the data of the first data system;

A conversion module for converting the format of the data metadata of the first data system from a second format to a third format, where the second format is identifiable by both the first data system and the second data system The third format is the metadata format of the data in the second data system.

The data processing device according to claim 20, further comprising:

The data processing device according to claim 20, wherein the acquisition module is specifically configured to:

The data processing device according to claim 20, wherein the conversion module is further configured to:

Judging whether the metadata meets preset receiving rules;