WO2025001683A1

WO2025001683A1 - Data processing method and device, task scheduling method, and storage medium

Info

Publication number: WO2025001683A1
Application number: PCT/CN2024/095492
Authority: WO
Inventors: 关蕊; 张宁; 何文; 樊林
Original assignee: BOE Technology Group Co Ltd; Beijing BOE Technology Development Co Ltd
Current assignee: BOE Technology Group Co Ltd; Beijing BOE Technology Development Co Ltd
Priority date: 2023-06-25
Filing date: 2024-05-27
Publication date: 2025-01-02
Anticipated expiration: 2025-12-25
Also published as: CN116775685A

Abstract

The present invention relates to the field of data processing, and provides a data processing method and device, a task scheduling method, and a storage medium. According to embodiments of the present invention, when an SQL association query statement is obtained, a data table to be processed is obtained on the basis of data source information in the SQL association query statement; on the basis of data table information of said data table, the SQL association query statement is converted into an executable SQL query statement; and the executable SQL query statement is executed to obtain a query result. An SQL association query statement which is input by a user and conforms to a preset specification is converted to obtain an executable SQL query statement, and thus, data in different data sources can be fused together in real time simply by utilizing an SQL language.

Description

A data processing method, task scheduling method, device and storage medium

Technical Field

本发明涉及数据处理领域，尤其涉及一种数据处理方法、任务调度方法、装置和存储介质。The present invention relates to the field of data processing, and in particular to a data processing method, a task scheduling method, a device and a storage medium.

Background Art

随着大数据时代的到来，各企业开始通过大数据平台，对生产及运营过程中产生的数据进行统一的存储管理与分析处理。同一企业在存储数据时，可以根据需求将数据存储在不同的数据源中，但对数据进行分析时要将存储在不同数据源中的数据，融合到一起进行查询分析。一般可以使用ETL工具将不同种数据源的数据同步到某种数据仓库内，在数据仓库内进行分析，但这种方式需要预先同步数据，不仅流程复杂耗时长，且同步完成的数据已经成为历史数据，无法做到实时数据的计算分析。With the advent of the big data era, enterprises have begun to use big data platforms to uniformly store, manage, and analyze data generated during production and operations. When storing data, the same enterprise can store data in different data sources according to demand, but when analyzing data, the data stored in different data sources must be merged together for query analysis. Generally, ETL tools can be used to synchronize data from different data sources into a certain data warehouse and analyze it in the data warehouse, but this method requires pre-synchronization of data, which is not only complex and time-consuming, but the synchronized data has become historical data and cannot perform real-time data calculation and analysis.

发明内容Summary of the invention

本发明提供了一种数据处理方法、任务调度方法、装置和存储介质，以解决相关技术中的不足。The present invention provides a data processing method, a task scheduling method, a device and a storage medium to solve the deficiencies in the related art.

根据本发明实施例的第一方面，提供了一种数据处理方法，应用于大数据平台，所述大数据平台包含至少两个不同的数据源；所述方法包括：According to a first aspect of an embodiment of the present invention, there is provided a data processing method, which is applied to a big data platform, wherein the big data platform includes at least two different data sources; the method comprises:

展示数据处理任务页面，响应于用户在所述数据处理任务页面上的输入操作获取SQL关联查询语句，所述SQL关联查询语句包括多个第一待处理数据表的数据源信息，其中，所述多个第一待处理数据表中的至少两个第一待处理数据表来自不同数据源；Displaying a data processing task page, and acquiring an SQL associated query statement in response to an input operation of a user on the data processing task page, wherein the SQL associated query statement includes data source information of a plurality of first data tables to be processed, wherein at least two of the plurality of first data tables to be processed are from different data sources;

根据所述数据源信息获取第二待处理数据表；Acquire a second table of data to be processed according to the data source information;

将所述SQL关联查询语句中第一待处理数据表的数据源信息替换为对应所述第二待处理数据表的数据表信息，得到可执行SQL查询语句；Replace the data source information of the first to-be-processed data table in the SQL associated query statement with the corresponding The data table information of the second data table to be processed is used to obtain an executable SQL query statement;

执行所述可执行SQL查询语句得到查询结果。Execute the executable SQL query statement to obtain the query result.

在一些实施中，所述数据处理方法基于PySpark框架实现。In some implementations, the data processing method is implemented based on a PySpark framework.

在一些实施中，在得到查询结果之后，所述方法还包括：In some implementations, after obtaining the query result, the method further includes:

基于用户的输入操作，确定对所述查询结果进行处理的处理条件；Determining a processing condition for processing the query result based on the user's input operation;

基于所述处理条件对所述查询结果进行处理得到处理结果；Processing the query result based on the processing condition to obtain a processing result;

在所述查询结果中增加一列，并在该列中显示处理结果。A column is added to the query result, and the processing result is displayed in the column.

基于用户的配置操作，确定存储所述查询结果的目标数据库，以及存储所述查询结果的数据表名；Based on the configuration operation of the user, determine the target database for storing the query results and the name of the data table for storing the query results;

若所述目标数据库中不包括所述数据表名对应的数据表，则创建所述数据表名对应的数据表；If the target database does not include a data table corresponding to the data table name, create a data table corresponding to the data table name;

若所述目标数据库中包括所述数据表名对应的数据表，则根据用户的选择操作确定输出模式，并以所述输出模式将所述查询结果写入到所述目标数据库的数据表中。If the target database includes a data table corresponding to the data table name, an output mode is determined according to a selection operation of the user, and the query result is written into the data table of the target database in the output mode.

根据本发明实施例的第二方面，提供了一种任务调度方法，应用于大数据平台，所述方法包括：According to a second aspect of an embodiment of the present invention, a task scheduling method is provided, which is applied to a big data platform, and the method includes:

展示任务调度页面，基于用户在所述任务调度页面上的输入操作，创建调度任务；Displaying a task scheduling page, and creating a scheduling task based on the user's input operation on the task scheduling page;

基于用户的选择操作从预先创建的多个候选数据任务中确定所述调度任务的多个目标数据任务，所述候选数据任务包括数据处理任务，所述数据处理任务利用上述任一项所述的数据处理方法实现；Determining multiple target data tasks of the scheduling task from multiple pre-created candidate data tasks based on a user's selection operation, wherein the candidate data tasks include data processing tasks, and the data processing tasks are implemented using any of the above-mentioned data processing methods;

在所述任务调度页面上显示各目标数据任务的图标；Displaying icons of each target data task on the task scheduling page;

基于用户对各图标的移动操作，确定所述调度任务中各目标数据任务的执行顺序；Determining the execution order of each target data task in the scheduling task based on the user's movement operation on each icon;

根据所述执行顺序执行所述调度任务中的各目标数据任务。Execute each target data task in the scheduled task according to the execution order.

在一些实施中，所述候选数据任务还包括数据集成任务；所述方法还包括：In some implementations, the candidate data task also includes a data integration task; the method also includes include:

基于用户在数据集成页面上的操作信息，确定数据集成任务名称，以及数据来源信息和数据去向信息；Based on the user's operation information on the data integration page, determine the data integration task name, data source information, and data destination information;

根据所述数据来源信息和所述数据去向信息，创建与所述数据集成任务名称对应的数据集成任务。A data integration task corresponding to the data integration task name is created according to the data source information and the data destination information.

在一些实施中，所述方法还包括：In some implementations, the method further comprises:

基于用户对所述图标的选择操作，展示所述图标对应的目标数据任务的开发页面；Based on the user's selection operation of the icon, displaying the development page of the target data task corresponding to the icon;

响应于检测到用户在所述开发页面上的修改操作，保存修改后的目标数据任务。In response to detecting a modification operation of the user on the development page, the modified target data task is saved.

在一些实施中，所述输入操作包括调度任务的首次运行时间；In some implementations, the input operation includes a first run time of a scheduled task;

所述根据所述执行顺序执行所述调度任务中的各目标数据任务，包括：The executing each target data task in the scheduled task according to the execution order includes:

响应于当前时间到达所述调度任务的首次运行时间，则根据所述执行顺序执行所述调度任务中的各目标数据任务，并获取每个目标数据任务的运行状态；In response to the current time reaching the first running time of the scheduled task, executing each target data task in the scheduled task according to the execution order, and obtaining the running status of each target data task;

若所述运行状态为错误状态，则停止执行所述调度任务并返回所述调度任务的状态为错误状态；If the running state is an error state, stopping the execution of the scheduled task and returning the state of the scheduled task to an error state;

若以所述各目标数据任务的执行顺序执行完所有的目标数据任务，则返回所述调度任务的状态为已完成状态。If all target data tasks are executed in the execution order of the target data tasks, the status of the scheduled task is returned as completed.

在调度任务模板页面中，基于用户的选择操作展示对应的调度任务模板；In the scheduling task template page, the corresponding scheduling task template is displayed based on the user's selection operation;

获取用户对所述调度任务模板的配置信息；Obtaining user configuration information for the scheduling task template;

基于所述调度任务模板生成与所述配置信息相匹配的调度任务。A scheduling task matching the configuration information is generated based on the scheduling task template.

根据本发明实施例的第三方面，提供了一种数据处理装置，应用于大数据平台，所述大数据平台包含至少两个不同的数据源；所述装置包括：According to a third aspect of an embodiment of the present invention, there is provided a data processing device, which is applied to a big data platform, wherein the big data platform includes at least two different data sources; the device comprises:

获取单元，用于展示数据处理任务页面，响应于用户在所述数据处理任务页面上的输入操作获取SQL关联查询语句，所述SQL关联查询语句包括多个第一待处理数据表的数据源信息，其中，所述多个第一待处理数据表中的至少两个第一待处理数据表来自不同数据源；The acquisition unit is used to display the data processing task page, and obtain the SQL associated query statement in response to the user's input operation on the data processing task page, wherein the SQL associated query statement includes multiple data source information of first data tables to be processed, wherein at least two of the plurality of first data tables to be processed are from different data sources;

转换单元，用于根据所述数据源信息获取第二待处理数据表，将所述SQL关联查询语句中第一待处理数据表的数据源信息替换为对应所述第二待处理数据表的数据表信息，得到可执行SQL查询语句；a conversion unit, configured to obtain a second data table to be processed according to the data source information, replace the data source information of the first data table to be processed in the SQL association query statement with data table information corresponding to the second data table to be processed, and obtain an executable SQL query statement;

执行单元，用于执行所述可执行SQL查询语句得到查询结果。The execution unit is used to execute the executable SQL query statement to obtain the query result.

根据本发明实施例的第四方面，提供了一种计算机可读存储介质，当所述存储介质中的可执行的计算机程序由处理器执行时，能够实现上述任一所述的方法。According to a fourth aspect of an embodiment of the present invention, a computer-readable storage medium is provided, and when an executable computer program in the storage medium is executed by a processor, any of the above-mentioned methods can be implemented.

根据上述实施例可知，本发明在获取到SQL关联查询语句后，根据SQL关联查询语句中的数据源信息获取待处理数据表；根据所述待处理数据表的数据表信息，将所述SQL关联查询语句转换为可执行SQL查询语句，执行所述可执行SQL查询语句得到查询结果，通过对用户输入的符合预先规范的SQL关联查询语句进行转换，得到执行SQL查询语句，利用SQL语言就可以实时将位于不同数据源中的数据融合在一起。According to the above embodiments, after obtaining the SQL associated query statement, the present invention obtains the data table to be processed according to the data source information in the SQL associated query statement; according to the data table information of the data table to be processed, the SQL associated query statement is converted into an executable SQL query statement, and the executable SQL query statement is executed to obtain the query result. By converting the SQL associated query statement input by the user that meets the pre-specified specifications, an executable SQL query statement is obtained, and the data located in different data sources can be merged together in real time using the SQL language.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本发明。It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention.

BRIEF DESCRIPTION OF THE DRAWINGS

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本发明的实施例，并与说明书一起用于解释本发明的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and, together with the description, serve to explain the principles of the invention.

图1是根据本发明实施例示出的一种数据处理方法的流程图。FIG. 1 is a flow chart showing a data processing method according to an embodiment of the present invention.

图2是根据本发明实施例示出的数据处理任务页面的示意图。FIG. 2 is a schematic diagram of a data processing task page according to an embodiment of the present invention.

图3是根据本发明实施例示出的数据处理任务的操作流程示意图。FIG. 3 is a schematic diagram of an operation flow of a data processing task according to an embodiment of the present invention.

图4是根据本发明实施例示出的增加计算列页面的示意图。FIG. 4 is a schematic diagram showing a page for adding a calculated column according to an embodiment of the present invention.

图5是根据本发明实施例示出的在查询结果中新增列的示意图。FIG. 5 is a schematic diagram showing a newly added column in a query result according to an embodiment of the present invention.

图6是根据本发明实施例示出的输出到库页面的示意图。 FIG. 6 is a schematic diagram showing output to a library page according to an embodiment of the present invention.

图7是根据本发明实施例示出的任务调度方法的流程图。FIG. 7 is a flow chart showing a task scheduling method according to an embodiment of the present invention.

图8是根据本发明实施例示出的任务调度顺序的示意图。FIG. 8 is a schematic diagram showing a task scheduling sequence according to an embodiment of the present invention.

图9A是根据本发明实施例示出的第一张数据集成页面的示意图。FIG. 9A is a schematic diagram of a first data integration page according to an embodiment of the present invention.

图9B是根据本发明实施例示出的第二张数据集成页面的示意图。FIG9B is a schematic diagram of a second data integration page according to an embodiment of the present invention.

图9C是根据本发明实施例示出的第三张数据集成页面的示意图。FIG9C is a schematic diagram of a third data integration page according to an embodiment of the present invention.

图10A是根据本发明实施例示出的任务调度页面的示意图。FIG. 10A is a schematic diagram of a task scheduling page according to an embodiment of the present invention.

图10B是根据本发明实施例示出的数据集成任务列表的示意图。FIG. 10B is a schematic diagram showing a data integration task list according to an embodiment of the present invention.

图10C是根据本发明实施例示出的数据处理任务列表的示意图。FIG. 10C is a schematic diagram showing a data processing task list according to an embodiment of the present invention.

图11是根据本发明实施例示出的运行顺序的示意图。FIG. 11 is a schematic diagram showing an operation sequence according to an embodiment of the present invention.

图12是根据本发明实施例示出的引用模板过程的示意图。FIG. 12 is a schematic diagram showing a template reference process according to an embodiment of the present invention.

图13是根据本发明实施例示出的数据处理装置的示意图。FIG. 13 is a schematic diagram of a data processing device according to an embodiment of the present invention.

DETAILED DESCRIPTION

这里将详细地对示例性实施例进行说明，其示例表示在附图中。下面的描述涉及附图时，除非另有表示，不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本发明相一致的所有实施方式。相反，它们仅是与如所附权利要求书中所详述的、本发明的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are shown in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Instead, they are merely examples of devices and methods consistent with some aspects of the present invention as detailed in the appended claims.

不同数据源中的数据，可以以不同形式存储、且依赖于不同的数据库管理系统，但在一些场景中需要实现跨数据源查询，即将不同数据源中的数据集中在一起，以便于进行分析。跨数据源查询存在一些技术难点，主要包括以下几个方面：Data from different data sources can be stored in different forms and rely on different database management systems, but in some scenarios, cross-data source queries need to be implemented, that is, data from different data sources are brought together for analysis. There are some technical difficulties in cross-data source queries, mainly including the following aspects:

数据源的异构性：不同的数据源有不同的存储结构和查询语法，需要进行数据的转换和语法的转换。Heterogeneity of data sources: Different data sources have different storage structures and query syntaxes, requiring data conversion and syntax conversion.

数据安全性：在跨数据源查询过程中，需要保证数据的安全性和隐私性，需要对数据进行权限验证和数据加密等操作。Data security: During cross-data source query processes, data security and privacy need to be guaranteed, and operations such as data permission verification and data encryption are required.

性能问题：跨数据源查询需要在不同的数据源之间进行数据传输和通信，可能会影响查询性能。Performance issues: Cross-data source queries require data transmission and communication between different data sources. May affect query performance.

因此，为了将不同数据源中的数据集中在一起进行分析，通常可以使用ETL工具将不同种数据源的数据同步到某种数据仓库内，在数据仓库内进行分析，但这种方式需要预先同步数据，同步流程复杂且耗时长，另外同步完的数据已经成为历史数据，无法做到实时数据的计算分析。Therefore, in order to bring together data from different data sources for analysis, ETL tools can usually be used to synchronize data from different data sources into a certain data warehouse for analysis. However, this method requires pre-synchronization of data, and the synchronization process is complex and time-consuming. In addition, the synchronized data has become historical data, and real-time data calculation and analysis cannot be performed.

鉴于此，本发明提供了一种数据处理方法，该方法可以基于PySpark框架将不同数据源中的数据集中在一起，也即是在进行数据处理任务时，若当前数据分析涉及到不同数据源中的数据，则可以利用该方法将不同数据源中所需要的数据进行融合，简称为跨源融合。In view of this, the present invention provides a data processing method, which can bring together data from different data sources based on the PySpark framework. That is, when performing data processing tasks, if the current data analysis involves data from different data sources, this method can be used to fuse the required data from different data sources, which is referred to as cross-source fusion.

本发明可以应用于大数据平台，通过数据源中数据库的IP、端口、数据库名称和表名等信息，将数据源添加到所述大数据平台，使大数据平台可以包含至少两个不同的数据源。The present invention can be applied to a big data platform. By adding the data source to the big data platform through information such as the IP, port, database name and table name of the database in the data source, the big data platform can include at least two different data sources.

下述实施例将结合附图对本发明进行说明。The following embodiments will illustrate the present invention with reference to the accompanying drawings.

图1是根据本发明实施例示出的一种数据处理方法的流程图，如图1所示，本方法包括以下步骤101至步骤104。FIG. 1 is a flow chart of a data processing method according to an embodiment of the present invention. As shown in FIG. 1 , the method includes the following steps 101 to 104 .

在步骤101中，展示数据处理任务页面，响应于用户在所述数据处理任务页面上的输入操作获取SQL关联查询语句。In step 101, a data processing task page is displayed, and an SQL associated query statement is obtained in response to an input operation of a user on the data processing task page.

在待处理的数据表位于不同的数据源时，用户需要按照约定的格式输入SQL关联查询语句，即通过“数据源.数据表”的格式标识表名，本实施例中将数据源和数据表称为数据源信息，所述数据源信息用于指示指定数据源中的数据表。When the data tables to be processed are located in different data sources, the user needs to enter the SQL association query statement in the agreed format, that is, identify the table name in the format of "data source.data table". In this embodiment, the data source and the data table are referred to as data source information, and the data source information is used to indicate the data table in the specified data source.

为了与后续提及的待处理数据表进行区分，将用户输入的SQL关联查询语句中的待处理数据表称为第一待处理数据表，所述SQL关联查询语句包括多个第一待处理数据表的数据源信息，其中，所述多个第一待处理数据表中的至少两个第一待处理数据表来自不同数据源。In order to distinguish it from the data tables to be processed mentioned later, the data table to be processed in the SQL associated query statement input by the user is called the first data table to be processed, and the SQL associated query statement includes data source information of multiple first data tables to be processed, wherein at least two of the multiple first data tables to be processed are from different data sources.

在一个可选的实施方式中，图2是根据本发明实施例示出的数据处理任务页面的示意图，如图2所示，用户可以在区域201中输入SQL关联查询语句。In an optional implementation, FIG. 2 is a schematic diagram of a data processing task page according to an embodiment of the present invention. As shown in FIG. 2 , a user can enter an SQL associated query in area 201. sentence.

示例的，用户输入的SQL关联查询语句符合下列规范：For example, the SQL query statement entered by the user complies with the following specifications:

select*from(数据源.数据表)(left/right)join(数据源.数据表)on连接条件select*from(data source.data table)(left/right)join(data source.data table)on connection condition

其中，数据源、数据表，以及连接条件可以根据实际需求确定。Among them, the data source, data table, and connection conditions can be determined according to actual needs.

在用户查询的多张数据表属于不同的数据源的情况下，用户可以在输入SQL关联查询语句时，将第一待处理数据表的表名以“数据源.数据库”的格式编写。对于用户而言，虽然该方法通过PySpark框架实现，但可以输入SQL语言而不是python代码，降低了对操作人员的要求。When multiple data tables queried by the user belong to different data sources, the user can write the table name of the first data table to be processed in the format of "data source.database" when entering the SQL association query statement. For the user, although this method is implemented through the PySpark framework, the SQL language can be entered instead of Python code, which reduces the requirements for operators.

由于用户输入的SQL关联查询语句中表名格式为“数据源.数据库”，导致所述SQL关联查询语句并不能被执行，需要通过后续步骤102和步骤103的转换。Since the table name format in the SQL association query statement input by the user is "data source.database", the SQL association query statement cannot be executed and needs to be converted through subsequent steps 102 and 103.

在步骤102中，根据所述数据源信息获取第二待处理数据表。In step 102, a second to-be-processed data table is obtained according to the data source information.

在获取到用户输入的SQL关联查询语句后，可以对所述SQL关联查询语句进行校验，校验通过后进行SQL解析，得到数据源信息。在PySpark框架中，获取到数据源信息后，根据数据源信息的指示读取数据源中的数据表，生成以列名为分组的分布式数据集合(Dataframe)，为了与前面提及的待处理数据表进行区别，本实施例中将读取数据源中的数据表生成的待处理数据表，称为第二待处理数据表。After obtaining the SQL associated query statement input by the user, the SQL associated query statement can be verified, and after the verification is passed, SQL parsing is performed to obtain data source information. In the PySpark framework, after obtaining the data source information, the data table in the data source is read according to the indication of the data source information to generate a distributed data set (Dataframe) grouped by column names. In order to distinguish it from the previously mentioned data table to be processed, in this embodiment, the data table to be processed generated by reading the data table in the data source is called the second data table to be processed.

在步骤103中，将所述SQL关联查询语句中第一待处理数据表的数据源信息替换为对应所述第二待处理数据表的数据表信息，得到可执行SQL查询语句。In step 103, the data source information of the first data table to be processed in the SQL association query statement is replaced with the data table information corresponding to the second data table to be processed, so as to obtain an executable SQL query statement.

将所述SQL关联查询语句中表示第一待处理数据表的数据源信息替换为对应所述第二待处理数据表的数据表信息，即可得到可执行SQL查询语句。The executable SQL query statement can be obtained by replacing the data source information representing the first data table to be processed in the SQL association query statement with the data table information corresponding to the second data table to be processed.

在步骤104中，执行所述可执行SQL查询语句得到查询结果。In step 104, the executable SQL query statement is executed to obtain a query result.

本发明选择PySpark框架做跨源融合分析处理数据，将任务提交到Yarn集群管理器执行，利用Yarn统一管理和调度集群中所有节点的资源。除了采用Yarn集群管理器外，还可以采用Standalone集群管理器和Mesos集群管理器进行管理，本实施例对此并不限定。通过本发明提供的数据处理方法可以将分散在不同数据源中的数据整合在一起进行查询和分析，从而提高查询的效率和数据分析的准确性，同时还能够实现数据共享和协同分析等功能，有助于优化数据管理和应用开发的效率。The present invention selects the PySpark framework to perform cross-source fusion analysis and data processing, submits the task to the Yarn cluster manager for execution, and uses Yarn to uniformly manage and schedule the resources of all nodes in the cluster. In addition to the Yarn cluster manager, the Standalone cluster manager and the Mesos cluster manager can also be used for management, which is not limited in this embodiment. The data processing method provided by the present invention can integrate data scattered in different data sources for query and analysis, thereby improving the efficiency of query and the accuracy of data analysis, and can also realize functions such as data sharing and collaborative analysis, which helps to optimize the efficiency of data management and application development.

下述实施例将通过具体示例说明本发明的数据处理方法。The following embodiments will illustrate the data processing method of the present invention by means of specific examples.

假设MySql作为业务数据库，存储用户的基础信息和订单信息表，ClickHouse作为列式存储数据库，具有存储数据量大且查询速度较快的优势，适合存储用户所有浏览的明细数据。以用户当前的需求为计算某店铺当月每天的用户转化率为例，计算逻辑：转化率＝下订单的用户数/浏览的用户数。Assume that MySql is used as a business database to store basic user information and order information tables. ClickHouse is used as a column storage database. It has the advantages of storing large amounts of data and fast query speed, and is suitable for storing detailed data browsed by users. For example, the user's current demand is to calculate the user conversion rate of a store every day in the current month. The calculation logic is: conversion rate = number of users who place orders / number of users who browse.

根据上述需求可知，需要将MySql中的订单数据与ClickHouse中的浏览数据通过用户信息作为关联字段进行关联分析。为了实现关联分析，用户根据需要按照“数据源.数据表”作为表名的格式来进行SQL编写。According to the above requirements, it is necessary to perform association analysis between the order data in MySql and the browsing data in ClickHouse using user information as the association field. In order to achieve association analysis, users need to write SQL in the format of "data source.data table" as the table name.

图3是根据本发明实施例示出的数据处理任务的操作流程示意图，如图3所示，假定在大数据平台上已经添加MySql中的数据源mysql2和ClickHouse中的数据源ch3，以用户信息表user_info和浏览明细表browse_details为例，关联查询语法的编写格式可以为：FIG3 is a schematic diagram of the operation flow of a data processing task according to an embodiment of the present invention. As shown in FIG3 , assuming that the data source mysql2 in MySql and the data source ch3 in ClickHouse have been added on the big data platform, taking the user information table user_info and the browse details table browse_details as examples, the writing format of the associated query syntax can be:

select*from mysql2.user_info left join ch3.browse_details on user_info.id＝browse_details.user_id；select*from mysql2.user_info left join ch3.browse_details on user_info.id=browse_details.user_id;

在数据处理任务页面上，用户按照规则编写SQL关联查询语句后，响应于用户点击运行控件，获取SQL关联查询语句，进行SQL校验，校验通过后进行SQL解析，根据规则能够解析出数据源、表名和字段名；基于解析出的信息，在PySpark框架中，读取对应数据源中的数据表生成对应的Dataframe，例如mysql2中的user_info，得到第二待处理数据表，采用df1表示，读取ch3中的browse_details得到第二待处理数据表，采用df2表示。On the data processing task page, after the user writes the SQL associated query statement according to the rules, in response to the user clicking the run control, the SQL associated query statement is obtained, and SQL verification is performed. After the verification passes, SQL parsing is performed, and the data source, table name and field name can be parsed according to the rules; based on the parsed information, in the PySpark framework, the data table in the corresponding data source is read to generate the corresponding Dataframe, such as user_info in mysql2, to obtain the second data table to be processed, represented by df1, and the browse_details in ch3 is read to obtain the second data table to be processed, represented by df2.

将用户输入的SQL关联查询语句中第一待处理数据表的数据源信息替换为对应所述第二待处理数据表的数据表信息，也就是将上述关联查询语句中的mysql2.user_info替换为df1，ch3.browse_details替换为df2，得到一个可执行的SQL查询语句，即select*from df1 left join df2 on df1.id＝df2.user_id。The data source information of the first to-be-processed data table in the SQL associated query statement input by the user is replaced with the data table information corresponding to the second to-be-processed data table, that is, the data source information of the first to-be-processed data table in the SQL associated query statement input by the user is replaced with the data table information corresponding to the second to-be-processed data table. Replace mysql2.user_info with df1 and ch3.browse_details with df2 to get an executable SQL query statement, namely select * from df1 left join df2 on df1.id=df2.user_id.

在PySpark框架中，执行得到的可执行SQL查询语句，得到结果数据集，即查询结果。In the PySpark framework, the executable SQL query statement is executed to obtain a result data set, that is, the query result.

需要说明的是，本发明中的跨数据源查询和跨数据库查询虽然都涉及到在一个查询中访问多个数据源或多个数据库，但二者存在以下不同：It should be noted that although the cross-data source query and cross-database query in the present invention both involve accessing multiple data sources or multiple databases in one query, they are different from each other in the following ways:

跨数据源查询通常指在一个查询中访问多个不同的数据库类型或不同的数据存储系统，例如关系型数据库、NoSQL数据库、文本文件、Hadoop集群等，这些数据源的底层存储结构、查询语法等可能都不同，因此跨数据源查询需要通过一些技术手段来将这些不同的数据源连接起来，之后进行数据的统一查询和处理。Cross-data source query usually refers to accessing multiple different database types or different data storage systems in one query, such as relational databases, NoSQL databases, text files, Hadoop clusters, etc. The underlying storage structure, query syntax, etc. of these data sources may be different. Therefore, cross-data source query requires some technical means to connect these different data sources, and then perform unified data query and processing.

而跨数据库查询则指在一个查询中访问多个同一类型的数据库，例如在同一的MySQL实例中进行查询。在这种情况下，数据的存储结构和查询语法都是相同的，因此跨数据库查询相对来说比较简单。Cross-database query refers to accessing multiple databases of the same type in one query, such as querying in the same MySQL instance. In this case, the data storage structure and query syntax are the same, so cross-database query is relatively simple.

若两个数据库部署在不同机器上，且IP和端口不同，则这两个数据库属于不同的数据源，例如MySQL-10.10.111.111和MySQL-47.22.22.22属于同一种类型的不同数据源，MySQL-10.10.111.111数据源下有多个数据库，例如bdmp、datax、information_schema和mysql111等，每个数据库下包括多个数据表；MySQL-47.22.22.22数据源下同样有多个数据库，例如mysql222。If two databases are deployed on different machines with different IP addresses and ports, the two databases belong to different data sources. For example, MySQL-10.10.111.111 and MySQL-47.22.22.22 belong to different data sources of the same type. There are multiple databases under the MySQL-10.10.111.111 data source, such as bdmp, datax, information_schema, and mysql111, and each database includes multiple data tables. There are also multiple databases under the MySQL-47.22.22.22 data source, such as mysql222.

在一个示例中，bdmp数据库下包括user_info数据表，datax数据库下包括df数据表，这种情况下，若一个查询中涉及bdmp数据库下的user_info数据表和datax数据库下的df数据表，则该查询涉及同一数据源下的不同数据库，属于跨数据库查询。In an example, the bdmp database includes a user_info data table, and the datax database includes a df data table. In this case, if a query involves the user_info data table under the bdmp database and the df data table under the datax database, the query involves different databases under the same data source and is a cross-database query.

在另一个示例中，bdmp数据库下包括user_info数据表，mysql222数据库中包括test数据表，在一个查询中涉及test数据表和user_info数据表时，表明需要将MySQL-10.10.111.111数据源中的user_info数据表和MySQL-47.22.22.22数据源中的test数据表利用本发明提供的方法进行跨数据源融合。In another example, the bdmp database includes a user_info data table, and the mysql222 database includes a test data table. When the test data table and the user_info data table are involved in a query, it indicates that the user_info data table in the MySQL-10.10.111.111 data source and the test data table in the MySQL-47.22.22.22 data source need to be cross-databaseed using the method provided by the present invention. Source fusion.

本发明可以应用于以下场景中：The present invention can be applied to the following scenarios:

1、在向用户进行演示或进行预分析时，将用户的数据同步到大数据平台需要花费大量的时间，因此，可以通过本发明的访问方式访问用户的数据。1. When demonstrating to users or performing preliminary analysis, it takes a lot of time to synchronize the user's data to the big data platform. Therefore, the user's data can be accessed through the access method of the present invention.

2、考虑到数据安全性，在用户未开放数据同步权限的情况下，可以采用本发明的访问方式访问用户的数据。2. Considering data security, when the user does not open the data synchronization authority, the access method of the present invention can be used to access the user's data.

3、在用户的数据不定期更新的情况下，若没用同步更新用户更新后的数据，将导致分析结果不准确，利用本发明可以对用户存储的最新数据进行分析，可以提高分析结果的准确性。3. In the case where the user's data is not updated regularly, if the updated data of the user is not updated synchronously, the analysis result will be inaccurate. The present invention can be used to analyze the latest data stored by the user, which can improve the accuracy of the analysis result.

在一些实施例中，在得到查询结果之后，还可以包括：基于用户的输入操作，确定对所述查询结果进行处理的处理条件；基于所述处理条件对所述查询结果进行处理得到处理结果；在所述查询结果中增加一列，并在该列中显示处理结果。In some embodiments, after obtaining the query results, it may also include: determining the processing conditions for processing the query results based on the user's input operation; processing the query results based on the processing conditions to obtain processing results; adding a column to the query results and displaying the processing results in the column.

如图2所示，在数据处理任务页面中包括增加计算列控件，本实施例中的处理条件可以为用户自定义的计算规则，或用户自定义的筛选条件。As shown in FIG. 2 , the data processing task page includes adding a calculation column control. The processing condition in this embodiment may be a user-defined calculation rule or a user-defined screening condition.

在一种实施例中，响应于检测到增加计算列控件被触发，可以显示如图4所示的增加计算列页面，所述增加计算列页面中包括计算规则输入框和确定控件，响应于检测到所述确定控件被触发，运行用户在所述计算规则输入框中输入的python代码。通过自定义计算列可以实现SQL语句不方便或无法实现的功能，通过获取用户输入的新增列的列名，以及编写的python代码，在校验代码格式通过后将代码段作为函数执行，得到新增的列。In one embodiment, in response to detecting that the add calculated column control is triggered, the add calculated column page shown in FIG4 may be displayed, the add calculated column page includes a calculation rule input box and a determination control, and in response to detecting that the determination control is triggered, the python code entered by the user in the calculation rule input box is run. Functions that are inconvenient or impossible to implement with SQL statements can be implemented through custom calculated columns, by obtaining the column name of the newly added column entered by the user and the written python code, and executing the code segment as a function after the code format is verified to pass, to obtain the newly added column.

图5是根据本发明实施例示出的在查询结果中新增列的示意图，如图5所示，在数据处理任务页面的区域501中显示查询结果，并查询结果对应的列表中增加自定义列，即最右侧的col1列。Figure 5 is a schematic diagram of adding a new column in the query results according to an embodiment of the present invention. As shown in Figure 5, the query results are displayed in area 501 of the data processing task page, and a custom column is added to the list corresponding to the query results, namely the col1 column on the far right.

在一些实施例中，在得到查询结果之后，还可以包括：基于用户的配置操作，确定存储所述查询结果的目标数据库，以及存储所述查询结果的数据表名；若所述目标数据库中不包括所述数据表名对应的数据表，则创建所述数据表名对应的数据表；若所述目标数据库中包括所述数据表名对应的数据表，则根据用户的选择操作确定输出模式，并以所述输出模式将所述查询结果写入到所述目标数据库的数据表中。In some embodiments, after obtaining the query result, the method may further include: determining a target database for storing the query result and a data table name for storing the query result based on the configuration operation of the user; if the target database does not include a data table corresponding to the data table name, creating the a data table corresponding to the data table name; if the target database includes a data table corresponding to the data table name, determining an output mode according to a user's selection operation, and writing the query result into the data table of the target database in the output mode.

也就是说，数据处理任务页面中包括输出到库控件，用于确定查询结果存储的位置。That is, the data processing task page includes an output to library control, which is used to determine the location where the query results are stored.

图6是根据本发明实施例示出的输出到库页面的示意图，响应于检测到输出到库控件被触发，展示如图6所示的输出到库页面，获取用户在输出到库页面中输入的配置信息，即目标源和表名，以及输出模式。检测数据库中是否存在用户配置的表名，若该表不存在，则创建表并存储查询结果；若该表存在，则根据选择的输出模式进行存储。如果选择追加模式，则将查询结果追加到该表；如果选择覆盖模式，则删除该表后，重新建表写入数据。FIG6 is a schematic diagram of an output to library page according to an embodiment of the present invention. In response to detecting that the output to library control is triggered, the output to library page shown in FIG6 is displayed, and the configuration information entered by the user in the output to library page, namely the target source and table name, and the output mode, is obtained. Check whether the table name configured by the user exists in the database. If the table does not exist, create the table and store the query results; if the table exists, store it according to the selected output mode. If the append mode is selected, the query results are appended to the table; if the overwrite mode is selected, the table is deleted and the table is re-created to write the data.

在一些实施例中，如图2所示数据处理任务页面上还包括下载控件，响应于所述下载控件被触发，将查询结果输出到csv文件中。In some embodiments, the data processing task page as shown in FIG2 also includes a download control, and in response to the download control being triggered, the query results are output to a csv file.

目前的大数据平台需要根据开发人员输入的运行脚本确定调度任务中各数据任务的执行顺序，对开发人员而言，开发成本较高且容易出错。鉴于此，本发明提供了一种任务调度方法，基于该方法，用户可以通过对目标区域的多个数据任务对应的图标进行拖拽操作，从而实现编排调度任务中各数据任务的执行顺序，降低了开发人员的工作量，且提高了开发效率。The current big data platform needs to determine the execution order of each data task in the scheduling task according to the running script input by the developer, which is costly and error-prone for the developer. In view of this, the present invention provides a task scheduling method, based on which the user can drag and drop the icons corresponding to multiple data tasks in the target area to realize the execution order of each data task in the scheduling task, which reduces the workload of the developer and improves the development efficiency.

下述实施例将结合附图对任务调度方法进行说明。The following embodiments will illustrate the task scheduling method in conjunction with the accompanying drawings.

图7是根据本发明实施例示出的任务调度方法的流程图，如图7所示，所述方法包括下列步骤701～705。FIG. 7 is a flow chart of a task scheduling method according to an embodiment of the present invention. As shown in FIG. 7 , the method includes the following steps 701 - 705 .

在步骤701中，展示任务调度页面，基于用户在所述任务调度页面上的输入操作，创建调度任务。In step 701, a task scheduling page is displayed, and a scheduling task is created based on the user's input operation on the task scheduling page.

在步骤702中，基于用户的选择操作从预先创建的多个候选数据任务中确定所述调度任务的多个目标数据任务。In step 702, a plurality of target data tasks of the scheduling task are determined from a plurality of pre-created candidate data tasks based on a user's selection operation.

所述候选数据任务包括数据处理任务，所述数据处理任务利用上述任一项所述的数据处理方法实现。 The candidate data tasks include data processing tasks, and the data processing tasks are implemented using any of the data processing methods described above.

在步骤703中，在所述任务调度页面上显示各目标数据任务的图标。In step 703, icons of each target data task are displayed on the task scheduling page.

在步骤704中，基于用户对各图标的移动操作，确定所述调度任务中各目标数据任务的执行顺序。In step 704, based on the user's movement operation on each icon, the execution order of each target data task in the scheduled task is determined.

在步骤705中，根据所述执行顺序执行所述调度任务中的各目标数据任务。In step 705, each target data task in the scheduled task is executed according to the execution order.

本实施例中的调度任务可以为离线调度任务，在一种实施方式中，目标数据任务的图标可以是卡片形式，也即是每个离线调度任务可以包含多个目标数据任务，各目标数据任务之间具有依赖关系，且各目标数据任务之间可以是串行执行，也可以是并行执行。在所述任务调度页面上显示各目标数据任务卡片后，可以通过拖拽任务卡片定义目标数据任务的执行顺序。The scheduling task in this embodiment can be an offline scheduling task. In one implementation, the icon of the target data task can be in the form of a card, that is, each offline scheduling task can contain multiple target data tasks, each target data task has a dependency relationship, and each target data task can be executed serially or in parallel. After each target data task card is displayed on the task scheduling page, the execution order of the target data tasks can be defined by dragging the task card.

图8是根据本发明实施例示出的任务调度顺序的示意图，如图8所示，当一个任务卡片在另一个任务卡片后面时为顺序执行，一个任务执行完成后执行下一个任务；当拖拽多个任务卡片到同一垂直位置时，则为并行执行，多个任务可以同时执行，且并行执行任务的数量不固定，例如可以有n个任务同时执行。需要注意的是，任务并行的编排也存在两种情况，任务4需要等到任务1-n均执行完成后，才能执行；而任务9只要任务8运行完成即可开始执行。FIG8 is a schematic diagram of the task scheduling sequence according to an embodiment of the present invention. As shown in FIG8 , when a task card is behind another task card, it is sequential execution, and the next task is executed after one task is completed; when multiple task cards are dragged to the same vertical position, it is parallel execution, and multiple tasks can be executed at the same time, and the number of parallel execution tasks is not fixed, for example, n tasks can be executed at the same time. It should be noted that there are also two situations for the parallel scheduling of tasks. Task 4 needs to wait until tasks 1-n are all executed before it can be executed; and task 9 can start execution as soon as task 8 is completed.

并行适用场景可以包括：在同一调度任务中需要处理大量的任务、且并行执行的任务间没有依赖关系，需要快速响应任务请求等场景。通过并行执行多个独立目标数据任务，可以提高整个系统的任务处理效率，缩短任务处理时间，提高系统的吞吐量和响应速度。Parallel application scenarios may include: a large number of tasks need to be processed in the same scheduling task, there is no dependency between the tasks executed in parallel, and a quick response to task requests is required. By executing multiple independent target data tasks in parallel, the task processing efficiency of the entire system can be improved, the task processing time can be shortened, and the system throughput and response speed can be improved.

本实施例中的候选数据任务除数据处理任务外，还可以包括数据集成任务、shell脚本任务和python脚本任务等，可以根据实际调度需要进行增加，本发明对此并不限定。In addition to data processing tasks, the candidate data tasks in this embodiment may also include data integration tasks, shell script tasks, python script tasks, etc., which can be added according to actual scheduling needs, and the present invention is not limited to this.

本发明中的数据处理任务指能通过编辑sql语句对各数据源间的数据进行查询和关联计算，并支持将数据分析结果直接输出到库和下载，支持自定义计算列实现复杂逻辑功能。本实施例中，数据处理任务涉及的内容已在数据处理方法中进行了说明，此处不再赘述。The data processing task in the present invention refers to the ability to query and perform correlation calculations on data between various data sources by editing SQL statements, and supports the direct output of data analysis results to the library and downloading, and supports custom calculation columns to implement complex logical functions. This has been explained in the processing method and will not be repeated here.

本发明中的数据集成任务为各数据源间的数据同步，数据源可以包括数据库、文件系统、服务接口和消息队列，数据可以通过集成任务在各数据源间进行数据融合。数据集成任务主要的应用场景为数据同步、数据整合(汇总)、数据迁移和数据交换。The data integration task in the present invention is data synchronization between various data sources. The data sources may include databases, file systems, service interfaces, and message queues. Data can be integrated between various data sources through integration tasks. The main application scenarios of data integration tasks are data synchronization, data integration (aggregation), data migration, and data exchange.

本发明中的shell脚本任务是由一组shell的语法或指令实现某种功能的代码，需要进行周期/非周期性调度时，可通过上传.sh文件，将其添加到调度任务中进行执行。The shell script task in the present invention is a code that implements a certain function by a set of shell syntax or instructions. When periodic/non-periodic scheduling is required, the .sh file can be uploaded and added to the scheduling task for execution.

本发明中的python脚本任务是由python语言编写的实现某种功能的代码，需要进行周期/非周期性调度时，可通过上传.py文件，将其添加到调度任务中进行执行。The python script task in the present invention is a code written in python language to implement a certain function. When periodic/non-periodic scheduling is required, it can be added to the scheduling task for execution by uploading the .py file.

下述实施例将对数据集成任务进行介绍。The following embodiments will introduce the data integration task.

在一些实施例中，所述方法还可以包括：基于用户在数据集成页面上的操作信息，确定数据集成任务名称，以及数据来源信息和数据去向信息；根据所述数据来源信息和所述数据去向信息，创建与所述数据集成任务名称对应的数据集成任务。In some embodiments, the method may also include: determining the data integration task name, data source information and data destination information based on the user's operation information on the data integration page; and creating a data integration task corresponding to the data integration task name based on the data source information and the data destination information.

图9A、图9B和图9C分别是根据本发明实施例示出的三张数据集成页面的示意图，获取用户在如图9A所示的数据集成页面中输入的数据集成任务名称，标签信息和描述信息，获取用户在如图9B所示的数据来源页面上选择的数据来源信息，以及获取用户在如图9C所示的数据去向页面上选择的数据去向信息。根据所述数据来源信息和所述数据去向信息，创建与所述数据集成任务名称对应的数据集成任务。Figures 9A, 9B and 9C are schematic diagrams of three data integration pages according to an embodiment of the present invention, respectively, to obtain the data integration task name, label information and description information entered by the user in the data integration page shown in Figure 9A, to obtain the data source information selected by the user on the data source page shown in Figure 9B, and to obtain the data destination information selected by the user on the data destination page shown in Figure 9C. A data integration task corresponding to the data integration task name is created based on the data source information and the data destination information.

在大数据平台上创建数据集成任务实现从一个数据源的数据同步到另一个数据源，以将pg某一数据表同步到clickhouse数据库中为例，配置数据集成任务：首先，填写基本信息，即在图9A所示的数据集成页面中填写任务名称和描述；其次，选择数据来源，即在如图9B所示的数据来源页面上选择pg中需要同步的表，勾选需要同步的数据字段；然后选择数据去向，即在如图9C所示的数据去向页面上选择clickhouse中的目的表和对应的字段。Create a data integration task on the big data platform to synchronize data from one data source to another. Take synchronizing a data table in PG to the ClickHouse database as an example to configure the data integration task: First, fill in the basic information, that is, fill in the task name and description in the data integration page shown in Figure 9A; secondly, select the data source, that is, select the table to be synchronized in PG on the data source page shown in Figure 9B, and check the data fields to be synchronized; then select the data destination, that is, in the On the data destination page shown in Figure 9C, select the destination table and corresponding fields in ClickHouse.

本实施例的任务调度页面中包括引用数据集成任务控件和引用数据处理任务控件，通过引用数据集成任务控件从候选数据集成任务中选择目标数据集成任务，通过引用数据处理任务控件从候选数据处理任务中选择目标数据处理任务。The task scheduling page of this embodiment includes a reference data integration task control and a reference data processing task control. The target data integration task is selected from the candidate data integration tasks through the reference data integration task control, and the target data processing task is selected from the candidate data processing tasks through the reference data processing task control.

响应于检测到所述引用数据集成任务控件被触发，展示所述引用数据集成任务控件对应的预先创建的数据集成任务列表，响应于检测到所述数据集成任务列表中的数据集成任务被触发，在所述任务调度页面的目标区域显示对应的数据集成任务卡片。In response to detecting that the reference data integration task control is triggered, a pre-created data integration task list corresponding to the reference data integration task control is displayed; in response to detecting that the data integration task in the data integration task list is triggered, a corresponding data integration task card is displayed in the target area of the task scheduling page.

响应于检测到所述引用数据处理任务控件被触发，展示所述引用数据处理任务控件对应的预先创建的数据处理任务列表，响应于检测到所述数据处理任务列表中的数据处理任务被触发，在所述任务调度页面的目标区域显示对应的数据处理任务卡片。In response to detecting that the reference data processing task control is triggered, a pre-created data processing task list corresponding to the reference data processing task control is displayed; in response to detecting that the data processing task in the data processing task list is triggered, a corresponding data processing task card is displayed in the target area of the task scheduling page.

在一种实施方式中，图10A是根据本发明实施例示出的任务调度页面的示意图，展示如图10A所示的任务调度页面，基于用户在所述任务调度页面上输入基础信息以及选择的首次运行时间和运行周期(例如每小时)，创建调度任务。In one embodiment, Figure 10A is a schematic diagram of a task scheduling page according to an embodiment of the present invention, showing the task scheduling page as shown in Figure 10A, and creating a scheduled task based on the basic information entered by the user on the task scheduling page and the selected first run time and run cycle (for example, every hour).

如图10A所示的任务调度页面中包含日志打包下载控件，引用数据集成任务控件和引用数据处理任务控件。The task scheduling page shown in FIG. 10A includes a log package download control, a reference data integration task control, and a reference data processing task control.

当任务调度详情页没有引用任何数据任务时，日志打包下载控件不可被触发；在引用目标数据任务后，触发该控件，可以将所有目标数据任务的运行日志内容打包下载到本地，该功能一般用于检查任务运行情况或当任务出现异常，排查问题时。When the task scheduling details page does not reference any data task, the log package download control cannot be triggered; after referencing the target data task, triggering this control can package and download the running log content of all target data tasks to the local computer. This function is generally used to check the task running status or to troubleshoot problems when a task is abnormal.

响应于引用数据集成任务控件被点击，可以展示如图10B所示的已经创建完成的数据集成任务列表，选择需要的数据集成任务(可多选)，即可生成数据任务卡片。默认数据集成任务按照时间倒序展示，当数据集成任务比较多时，可以使用搜索框，查找目标集成任务。 In response to the reference data integration task control being clicked, a list of data integration tasks that have been created as shown in FIG. 10B can be displayed. Select the required data integration tasks (multiple selections are allowed) to generate a data task card. By default, data integration tasks are displayed in reverse chronological order. When there are many data integration tasks, the search box can be used to find the target integration task.

响应于引用数据处理任务控件被点击，可以展示如图10C所示的已经创建完成的数据处理任务列表，选择需要的数据处理任务(可多选)，生成数据任务卡片。默认数据处理任务按照时间倒序展示，当数据处理任务比较多时，可以使用搜索框，查找目标处理任务。In response to the reference data processing task control being clicked, a list of data processing tasks that have been created as shown in FIG. 10C can be displayed, and the required data processing tasks (multiple selections are allowed) can be selected to generate data task cards. By default, data processing tasks are displayed in reverse chronological order. When there are many data processing tasks, the search box can be used to find the target processing task.

根据依赖关系拖动任务卡片调整数据任务顺序，在检测到用户点击保存控件的情况下，完成离线调度任务的创建。Drag the task cards according to the dependencies to adjust the order of data tasks. When it is detected that the user clicks the save control, the creation of the offline scheduling task is completed.

在一些实施例中，基于用户对所述图标的选择操作，展示所述图标对应的目标数据任务的开发页面；响应于检测到用户在所述开发页面上的修改操作，保存修改后的目标数据任务。In some embodiments, based on the user's selection operation on the icon, a development page of the target data task corresponding to the icon is displayed; in response to detecting the user's modification operation on the development page, the modified target data task is saved.

在所述图标为数据任务卡片的情况下，所述数据任务卡片上包括查看控件；响应于检测到所述查看控件被触发，则展示所述数据任务卡片对应的开发页面；响应于检测到用户在所述开发页面上修改所述目标数据任务，则保存修改后的目标数据任务。In the case where the icon is a data task card, the data task card includes a viewing control; in response to detecting that the viewing control is triggered, the development page corresponding to the data task card is displayed; in response to detecting that the user modifies the target data task on the development page, the modified target data task is saved.

本实施例中的数据任务均存储在大数据平台上，相比于相关技术中需要通过输入命令查看数据任务的运行情况而言，本实施例可以通过点击查看控件修改数据任务，便于用户操作。The data tasks in this embodiment are all stored on the big data platform. Compared with the related art that requires inputting commands to view the running status of data tasks, this embodiment allows data tasks to be modified by clicking on the viewing control, which is convenient for user operation.

在一些实施例中，所述数据任务卡片上还可以包括手动运行控件；所述方法还包括：响应于检测到所述手动运行控件被触发，则运行所述数据任务卡片对应的数据任务，并展示运行结果。在不上传运行脚本的情况下，即可实现数据任务的单独运行。In some embodiments, the data task card may further include a manual run control; the method further includes: in response to detecting that the manual run control is triggered, running the data task corresponding to the data task card and displaying the run result. Without uploading the run script, the data task can be run independently.

在一些实施例中，所述输入操作包括调度任务的首次运行时间；In some embodiments, the input operation includes a first run time of a scheduled task;

若所述运行状态为错误状态，则停止执行所述调度任务并返回所述调度任务的状态为错误状态； If the running state is an error state, stopping the execution of the scheduled task and returning the state of the scheduled task to an error state;

图11是根据本发明实施例示出的运行顺序的示意图，如图11所示，将用户配置好的离线调度任务基础信息、运行周期、首次运行时间、每个调度任务包含的所有目标数据任务以及目标数据任务的执行顺序，存储在指定数据表中；当前时间到达任务的首次运行时间时，执行该轮调度任务：遍历执行目标数据任务，获取每个目标数据任务的运行状态，当目标数据任务运行状态为错误时，结束调度任务，不再继续执行，返回错误状态；当目标数据任务正常运行完成时，执行下一个目标数据任务，直到所有目标数据任务都运行完成，返回已完成状态。Figure 11 is a schematic diagram of the operating sequence according to an embodiment of the present invention. As shown in Figure 11, the basic information of the offline scheduling task configured by the user, the operating cycle, the first operating time, all the target data tasks contained in each scheduling task, and the execution order of the target data tasks are stored in the specified data table; when the current time reaches the first operating time of the task, the round of scheduling tasks is executed: the target data tasks are traversed and executed, and the operating status of each target data task is obtained. When the operating status of the target data task is an error, the scheduling task is terminated, the execution is no longer continued, and an error status is returned; when the target data task is completed normally, the next target data task is executed until all target data tasks are completed and the completed status is returned.

在一些实施例中，所述大数据平台还可以包括调度任务模板页面，所述调度任务模板页面中包括多个预设的调度任务模板控件，所述方法还包括：In some embodiments, the big data platform may further include a scheduling task template page, wherein the scheduling task template page includes a plurality of preset scheduling task template controls, and the method further includes:

在所述调度任务模板页面中，响应于检测到任一调度任务模板控件被触发，展示与所述调度任务模板控件对应的调度任务模板；In the scheduling task template page, in response to detecting that any scheduling task template control is triggered, displaying the scheduling task template corresponding to the scheduling task template control;

通过调度任务模板可以方便用户在数据分析过程中引用模板快速生成离线调度任务，减少重复工作。模板的生成方式有两种：用户生成模板和系统自带模板。用户在创建一个离线调度任务后，可以选择点击生成模板按钮，将模板保存到模板列表中，用于下次做类似任务时直接引用；系统自带模板是基于大数据分析的常用指标，如日活、月活、在线率、转化率等，用户可以根据需要进行选择。Scheduling task templates can be used to easily generate offline scheduling tasks by referencing templates during data analysis, thus reducing duplication of work. There are two ways to generate templates: user-generated templates and system-provided templates. After creating an offline scheduling task, users can click the Generate Template button to save the template to the template list for direct reference the next time they perform a similar task. The system-provided templates are based on common indicators of big data analysis, such as daily active users, monthly active users, online rate, conversion rate, etc. Users can select them as needed.

示例的，以引用计算转化率为例，图12是根据本发明实施例示出的引用模板过程的示意图，如图12所示，用户在引用调度任务模板后，可以修改任务名称、描述、设定首次运行时间和运行周期等；调度任务模板中预设有任务卡片，根据用户的选择操作增删任务卡片。还可以进入任务卡片详情页修改任务内容，例如若是数据集成任务，则可以修改数据来源和数据去向的内容，指定目标数据源；若是数据处理任务，则可以修改SQL代码并运行调试，按需编辑新增计算列、输出到库的内容等，与前述实施例中的内容类似，不再赘述。For example, taking the calculation conversion rate as an example, FIG12 is a schematic diagram of the template reference process according to an embodiment of the present invention. As shown in FIG12, after referencing the scheduling task template, the user can modify the task name, description, set the first run time and run cycle, etc.; the scheduling task template is preset with task cards, and task cards can be added or deleted according to the user's selection operation. You can also enter the task card details page to modify the task content. For example, if it is a data integration task, you can modify the content of the data source and data destination. content, specify the target data source; if it is a data processing task, you can modify the SQL code and run debugging, edit the newly added calculated columns, output to the library content, etc. as needed, which is similar to the content in the previous embodiment and will not be repeated here.

本发明能够让用户在大数据平台上实现数据分析的全套开发过程，用户能够创建、编辑数据任务，然后进入调度任务页面引用先前创建的任务，配置运行周期和首次运行时间，实现任务的周期/非周期运行；在调度任务列表页可以看到所有调度任务的运行状态；点击某调度任务进入调度任务详情页，可以看到所有的目标数据任务卡片，可以通过拖拽任务卡片调整目标数据任务间的执行顺序，还可以点击目标数据任务卡片上直接跳转到任务详情页进行查看、编辑、手动运行和查看日志等。极大地方便了用户做数据分析的过程，降低了用户做大数据开发的门槛，且可视化的方式更易维护和调整，使用户的使用感更好。The present invention enables users to implement a full set of development processes for data analysis on a big data platform. Users can create and edit data tasks, then enter the scheduling task page to reference previously created tasks, configure the operating cycle and first operating time, and implement periodic/non-periodic operation of tasks; the operating status of all scheduled tasks can be seen on the scheduling task list page; click on a scheduled task to enter the scheduling task details page, where all target data task cards can be seen, and the execution order between target data tasks can be adjusted by dragging the task cards, and the target data task card can be clicked to directly jump to the task details page for viewing, editing, manual operation, and viewing logs, etc. This greatly facilitates the process of data analysis for users, lowers the threshold for users to develop big data, and the visual method is easier to maintain and adjust, giving users a better sense of use.

图13是根据本发明实施例示出的数据处理装置的示意图，如图13所示，本发明提供的数据处理装置，应用于大数据平台，所述大数据平台包含至少两个不同的数据源；所述装置包括：FIG13 is a schematic diagram of a data processing device according to an embodiment of the present invention. As shown in FIG13 , the data processing device provided by the present invention is applied to a big data platform, and the big data platform includes at least two different data sources; the device includes:

获取单元1301，用于展示数据处理任务页面，响应于用户在所述数据处理任务页面上的输入操作获取SQL关联查询语句，所述SQL关联查询语句包括多个第一待处理数据表的数据源信息，其中，所述多个第一待处理数据表中的至少两个第一待处理数据表来自不同数据源；The acquisition unit 1301 is used to display a data processing task page, and acquire an SQL associated query statement in response to an input operation of a user on the data processing task page, wherein the SQL associated query statement includes data source information of multiple first data tables to be processed, wherein at least two of the multiple first data tables to be processed are from different data sources;

转换单元1302，用于根据所述数据源信息获取第二待处理数据表，将所述SQL关联查询语句中第一待处理数据表的数据源信息替换为对应所述第二待处理数据表的数据表信息，得到可执行SQL查询语句；The conversion unit 1302 is used to obtain a second data table to be processed according to the data source information, replace the data source information of the first data table to be processed in the SQL association query statement with data table information corresponding to the second data table to be processed, and obtain an executable SQL query statement;

执行单元1303，用于执行所述可执行SQL查询语句得到查询结果。The execution unit 1303 is used to execute the executable SQL query statement to obtain the query result.

上述各单元具体执行的内容可以参见上述实施例，此处不再赘述。The specific contents executed by the above-mentioned units can be found in the above-mentioned embodiments, which will not be described again here.

本发明还提供了一种任务调度装置，应用于大数据平台，所述装置包括：The present invention also provides a task scheduling device, which is applied to a big data platform, and the device comprises:

创建单元，用于展示任务调度页面，基于用户在所述任务调度页面上的输入操作，创建调度任务； A creation unit, used to display a task scheduling page and create a scheduling task based on the user's input operation on the task scheduling page;

确定单元，用于基于用户的选择操作从预先创建的多个候选数据任务中确定所述调度任务的多个目标数据任务，所述候选数据任务包括数据处理任务，所述数据处理任务利用上述实施例所述的数据处理方法实现；a determining unit, configured to determine a plurality of target data tasks of the scheduling task from a plurality of pre-created candidate data tasks based on a selection operation of a user, wherein the candidate data tasks include a data processing task, and the data processing task is implemented by using the data processing method described in the above embodiment;

显示单元，用于在所述任务调度页面上显示各目标数据任务的图标；A display unit, used to display icons of each target data task on the task scheduling page;

移动单元，用于基于用户对各图标的移动操作，确定所述调度任务中各目标数据任务的执行顺序；A moving unit, used to determine the execution order of each target data task in the scheduling task based on the user's moving operation on each icon;

调度单元，用于根据所述执行顺序执行所述调度任务中的各目标数据任务。The scheduling unit is used to execute each target data task in the scheduled task according to the execution order.

执行本实施例的设备具有显示装置，所述显示装置可以为：电子纸、手机、平板电脑、电视机、笔记本电脑、数码相框、导航仪等任何具有显示功能的产品或部件。The device implementing this embodiment has a display device, and the display device can be: electronic paper, mobile phone, tablet computer, television, notebook computer, digital photo frame, navigator, or any other product or component with a display function.

需要指出的是，在附图中，为了图示的清晰可能夸大了层和区域的尺寸。而且可以理解，当元件或层被称为在另一元件或层“上”时，它可以直接在其他元件上，或者可以存在中间的层。另外，可以理解，当元件或层被称为在另一元件或层“下”时，它可以直接在其他元件下，或者可以存在一个以上的中间的层或元件。另外，还可以理解，当层或元件被称为在两层或两个元件“之间”时，它可以为两层或两个元件之间唯一的层，或还可以存在一个以上的中间层或元件。通篇相似的参考标记指示相似的元件。It should be noted that in the accompanying drawings, the sizes of layers and regions may be exaggerated for clarity of illustration. It is also understood that when an element or layer is referred to as being "on" another element or layer, it may be directly on the other element, or there may be an intermediate layer. In addition, it is understood that when an element or layer is referred to as being "under" another element or layer, it may be directly under the other element, or there may be more than one intermediate layer or element. In addition, it is also understood that when a layer or element is referred to as being "between" two layers or two elements, it may be the only layer between the two layers or two elements, or there may also be more than one intermediate layer or element. Similar reference numerals throughout the text indicate similar elements.

在本发明中，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性。术语“多个”指两个或两个以上，除非另有明确的限定。In the present invention, the terms "first" and "second" are used for descriptive purposes only and cannot be understood as indicating or implying relative importance. The term "plurality" refers to two or more than two, unless otherwise clearly defined.

本领域技术人员在考虑说明书及实践这里公开的公开后，将容易想到本发明的其它实施方案。本发明旨在涵盖本发明的任何变型、用途或者适应性变化，这些变型、用途或者适应性变化遵循本发明的一般性原理并包括本发明未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的，本发明的真正范围和精神由下面的权利要求指出。Those skilled in the art will readily appreciate other embodiments of the present invention after considering the specification and practicing the disclosure disclosed herein. The present invention is intended to cover any variations, uses or adaptations of the present invention that follow the general principles of the present invention and include common knowledge or customary techniques in the art that are not disclosed by the present invention. The description and examples are to be considered exemplary only, and the true scope and spirit of the present invention are indicated by the following claims.

应当理解的是，本发明并不局限于上面已经描述并在附图中示出的精确结构，并且可以在不脱离其范围进行各种修改和改变。本发明的范围仅由所附的权利要求来限制。 It should be understood that the present invention is not limited to the precise details described above and shown in the accompanying drawings. The present invention is structured and various modifications and changes may be made without departing from the scope thereof. The scope of the present invention is limited only by the appended claims.

Claims

A data processing method, characterized in that it is applied to a big data platform, wherein the big data platform includes at least two different data sources; the method comprises:

Displaying a data processing task page, and acquiring an SQL associated query statement in response to an input operation of a user on the data processing task page, wherein the SQL associated query statement includes data source information of a plurality of first data tables to be processed, wherein at least two of the plurality of first data tables to be processed are from different data sources;

Acquire a second table of data to be processed according to the data source information;

The data source information of the first data table to be processed in the SQL association query statement is replaced with the data table information corresponding to the second data table to be processed, so as to obtain an executable SQL query statement;

Execute the executable SQL query statement to obtain the query result.

The method according to claim 1 is characterized in that the data processing method is implemented based on the PySpark framework.

The method according to claim 1, characterized in that after obtaining the query result, the method further comprises:

Determining a processing condition for processing the query result based on the user's input operation;

Processing the query result based on the processing condition to obtain a processing result;

A column is added to the query result, and the processing result is displayed in the column.

Based on the configuration operation of the user, determine the target database for storing the query results and the name of the data table for storing the query results;

If the target database does not include a data table corresponding to the data table name, create a data table corresponding to the data table name;

If the target database includes a data table corresponding to the data table name, the output mode is determined according to the user's selection operation, and the query result is written to the target database in the output mode. In the data table of the target database.

A task scheduling method, characterized in that it is applied to a big data platform, and the method comprises:

Displaying a task scheduling page, and creating a scheduling task based on the user's input operation on the task scheduling page;

Determining multiple target data tasks of the scheduling task from multiple pre-created candidate data tasks based on a user's selection operation, wherein the candidate data tasks include data processing tasks, and the data processing tasks are implemented using the data processing method described in any one of claims 1 to 4;

Displaying icons of each target data task on the task scheduling page;

Determining the execution order of each target data task in the scheduling task based on the user's movement operation on each icon;

Execute each target data task in the scheduled task according to the execution order.

The method according to claim 5, characterized in that the candidate data task also includes a data integration task; the method further includes:

Based on the user's operation information on the data integration page, determine the data integration task name, data source information, and data destination information;

A data integration task corresponding to the data integration task name is created according to the data source information and the data destination information.

The method according to claim 5, characterized in that the method further comprises:

Based on the user's selection operation of the icon, displaying the development page of the target data task corresponding to the icon;

In response to detecting a modification operation of the user on the development page, the modified target data task is saved.

The method according to claim 5, characterized in that the input operation includes the first running time of the scheduling task;

The executing each target data task in the scheduled task according to the execution order includes:

In response to the current time reaching the first running time of the scheduled task, the execution Execute each target data task in the scheduling task in sequence, and obtain the running status of each target data task;

If the running state is an error state, stopping the execution of the scheduled task and returning the state of the scheduled task to an error state;

If all target data tasks are executed in the execution order of the target data tasks, the status of the scheduled task is returned as completed.

In the scheduling task template page, the corresponding scheduling task template is displayed based on the user's selection operation;

Obtaining user configuration information for the scheduling task template;

A scheduling task matching the configuration information is generated based on the scheduling task template.

A data processing device, characterized in that it is applied to a big data platform, wherein the big data platform includes at least two different data sources; the device comprises:

an acquisition unit, configured to display a data processing task page, and acquire an SQL associated query statement in response to an input operation of a user on the data processing task page, wherein the SQL associated query statement includes data source information of a plurality of first data tables to be processed, wherein at least two of the plurality of first data tables to be processed are from different data sources;

a conversion unit, configured to obtain a second data table to be processed according to the data source information, replace the data source information of the first data table to be processed in the SQL association query statement with data table information corresponding to the second data table to be processed, and obtain an executable SQL query statement;

The execution unit is used to execute the executable SQL query statement to obtain the query result.

A computer-readable storage medium, characterized in that when an executable computer program in the storage medium is executed by a processor, it can implement the method as claimed in any one of claims 1 to 4 or claims 5 to 9.