CN116414816A

CN116414816A - A massive data processing method and system based on heterogeneous data sources

Info

Publication number: CN116414816A
Application number: CN202310350188.1A
Authority: CN
Inventors: 师莎; 盛振宇; 汪飞; 王钢
Original assignee: CLP Cloud Digital Intelligence Technology Co Ltd
Current assignee: CLP Cloud Digital Intelligence Technology Co Ltd
Priority date: 2023-04-04
Filing date: 2023-04-04
Publication date: 2023-07-11

Abstract

The invention relates to a massive data processing method and system based on heterogeneous data sources. The method includes steps such as writing service call data into storage, querying service call data, regularly cleaning data files, and performing intelligent early warning through shell scripts. This method can support the rapid expansion of the service system, greatly increase the concurrency of the interface, support the call of the billion-level interface, and can be used for the storage and call of massive data; it also supports system expansion without the perception of external users, and the interface can be called normally during the expansion. This method adopts multi-data source engine collaborative query, supports cross-data source query, is applicable to various query scenarios, improves real-time data query capabilities and multi-dimensional analysis capabilities, and this method can intelligently clean up useless stored data and files in a timely and rapid manner. Manual inspection and cleaning are not required for operation and maintenance, which reduces labor and machine costs.

Description

A massive data processing method and system based on heterogeneous data sources

技术领域technical field

本发明属于混合数据源数据处理方法技术领域，尤其涉及一种基于异构数据源的海量数据处理方法及系统。The invention belongs to the technical field of data processing methods of mixed data sources, and in particular relates to a massive data processing method and system based on heterogeneous data sources.

背景技术Background technique

很多应用系统在工作运行过程中，例如数据服务系统在统计用户调用接口的情况时，需获得已发布服务数、接口失败率、接口调用耗时、接口调用量(日、周、月、年)、支撑应用数等信息，都需要基于接口调用数据来进行统计分析。目前，大多现有系统采取的都是基于mongodb+mysql+运维手动清数据的架构方案，mongodb用于存储调用数据，定时任务处理mongodb数据形成统计数据，然后把统计数据存储到mysql单表中，运维定时(每年)手动清理一次mongodb中无用的数据文件。Many application systems need to obtain the number of published services, interface failure rate, interface call time-consuming, and interface call volume (day, week, month, year) Information such as the number of supported applications, etc., all need to be statistically analyzed based on the interface call data. At present, most existing systems adopt an architectural solution based on mongodb+mysql+operation and maintenance manual data clearing. mongodb is used to store call data, and timing tasks process mongodb data to form statistical data, and then store statistical data in a mysql single table. Operation and maintenance regularly (every year) manually clean up useless data files in mongodb.

然而，上述基于mongodb+mysql+运维手动清数据的方案，如果因为突发事件导致调用接口用户剧增，服务日调用量达到千万或者亿级别，单秒单接口并发量达到数千时(比如千万级人口城市提供的个人信息查询等服务接口)，会存在如下缺陷：However, for the above-mentioned solution based on mongodb+mysql+operation and maintenance manual data clearing, if the number of call interface users increases sharply due to emergencies, the number of service calls reaches tens of millions or billions per day, and the concurrent number of single interfaces per second reaches thousands (such as Personal information query and other service interfaces provided by cities with a population of tens of millions), there will be the following defects:

1、上述方案中接口日调用量只能支持到万级别，根本无法支撑突发情况导致调用量达到日亿级别的情况。1. In the above solution, the daily call volume of the interface can only support up to 10,000 levels, and it is impossible to support the situation where the number of calls reaches the daily level of 100 million due to emergencies.

2、mongodb存储数据文件，难以扩容，mongo集群扩容复杂，扩容过程中会停服，机器成本和人工成本巨大，当并发量突然提高时，难以在短时间内以用户无感知的方式进行扩容支撑。2. MongoDB stores data files, which is difficult to expand. The expansion of mongo cluster is complicated, and the service will be stopped during the expansion process. The machine cost and labor cost are huge. When the concurrency suddenly increases, it is difficult to support expansion in a short period of time in a way that users do not perceive .

3、接口调用统计数据存放在单表mysql，如果对外开放1w个接口，每天产生的数据有13w条以上，每个月可到400w条，每年可到5000w条。当单表mysql数据量超过5000w时会触及磁盘io瓶颈，会导致查询慢、效率低、接口调用超时、甚至不可用等情况出现。3. Interface call statistics are stored in a single table mysql. If 1w interfaces are opened to the outside world, there will be more than 130,000 pieces of data generated every day, 4 million pieces per month, and 50 million pieces per year. When the amount of mysql data in a single table exceeds 5000w, it will touch the disk IO bottleneck, which will lead to slow query, low efficiency, interface call timeout, or even unavailability.

4、基于mysql和mongodb存储中间件，不支持跨数据源查询，对复杂查询、统计不友好，且查询效率难以满足复杂OLAP需求。4. Based on mysql and mongodb storage middleware, it does not support cross-data source query, is not friendly to complex query and statistics, and the query efficiency is difficult to meet complex OLAP requirements.

5、手动清数据文件，对于单日数据文件大小在1G以下，一般文件服务器，磁盘大小1T，运维年度手动清理一次数据方案即可。但对于日产生数据量50G以上的，只需十几天数据磁盘就会爆满，需要运维高频率检查磁盘，手动清理文件，消耗人工成本，若清理不及时还会导致服务不可用。5. Manually clear data files. For a single-day data file size of less than 1G, a general file server with a disk size of 1T, it is enough to manually clear the data plan once in the operation and maintenance year. However, for a daily data volume of more than 50G, the data disk will be full in just a dozen days, requiring O&M to check the disk frequently, manually clean up files, and consume labor costs. If the cleanup is not timely, the service will be unavailable.

发明内容Contents of the invention

术语解释Terminology Explanation

Presto：Presto是Facebook开源的、完全基于内存并行计算的、分布式SQL查询引擎，适用于交互式分析查询。Presto: Presto is Facebook's open source, fully memory-based parallel computing, distributed SQL query engine, suitable for interactive analysis queries.

Catalog：Catalog即数据源。每个数据源连接都有一个名字，一个Catalog可以包含多个Schema，并通过connector引用数据源，通过show catalogs命令可看到Presto已连接的所有数据源。Catalog: Catalog is the data source. Each data source connection has a name. A Catalog can contain multiple Schemas, and the data source is referenced through the connector. You can see all the data sources connected to Presto through the show catalogs command.

Schema：相当于一个数据库，一个Schema包含多张数据表。通过show schemasfrom'catalog_name'命令可列出Catalog下的所有Schema。Schema: Equivalent to a database, a Schema contains multiple data tables. Use the show schemas from 'catalog_name' command to list all Schemas under the Catalog.

Kafka：Kafka是一种高吞吐量的分布式发布订阅消息系统，它可以处理消费者在网站中的所有动作流数据。Kafka: Kafka is a high-throughput distributed publish-subscribe message system that can handle all action stream data of consumers in the website.

为了克服现有mongodb+mysql+运维手动清数据架构方案存在的缺陷，本发明提出了一种新的基于异构数据源的海量数据处理方法。本发明方法旨在解决以下问题：In order to overcome the defects of the existing mongodb+mysql+operation and maintenance manual data clearing architecture scheme, the present invention proposes a new massive data processing method based on heterogeneous data sources. The inventive method aims to solve the following problems:

1、针对突发事件导致单接口日服务调用量达到千万甚至亿级别的情况，提出容易扩容的数据存储方案，以解决扩容困难的问题。1. In view of the situation that the daily service calls of a single interface reach tens of millions or even billions due to emergencies, a data storage solution that is easy to expand is proposed to solve the problem of difficult expansion.

2、优化存储方案，以解决mysql单表数据量超过5000w时数据库io阻塞和数据库卡死导致服务平台不可用的问题。2. Optimize the storage solution to solve the problem that the service platform is unavailable due to database io blocking and database freezing when the data volume of a single mysql table exceeds 5000w.

3、提高单表查询速度和查询效率。3. Improve single-table query speed and query efficiency.

4、针对多数据源，实现跨数据源查询，满足复杂OLAP需求。4. For multiple data sources, realize cross-data source query to meet complex OLAP requirements.

5、提出自动清理规则及清理算法，实现数据文件智能清理，替代手动清理数据，节省人工成本。5. Propose automatic cleaning rules and cleaning algorithms, realize intelligent cleaning of data files, replace manual cleaning of data, and save labor costs.

整体而言，本发明通过以下技术构思及策略来实现上述目的：Overall, the present invention achieves the above-mentioned purpose through the following technical concepts and strategies:

(1)高效扩容方案：更换主要数据存储引擎，把mongodb只作为临时存储中间件、最终数据存储到ElasticaSearch，由于ElasticaSearch扩容是在集群中增加机器节点，因此更加容易实现扩容。(1) Efficient expansion solution: replace the main data storage engine, use mongodb only as a temporary storage middleware, and store the final data in ElasticaSearch. Since ElasticaSearch expansion is to increase machine nodes in the cluster, it is easier to achieve expansion.

(2)mysql按特定规则和算法智能分表。(2) mysql intelligently divides tables according to specific rules and algorithms.

(3)为了实现单表查询、复杂OLAP等多种查询，采用了多种数据源引擎(mysql、mongodb、ElasticaSearch、redis)存储数据，多数据源之间协作完成查询，提高查询效率、分担查询压力。(3) In order to realize multiple queries such as single-table query and complex OLAP, multiple data source engines (mysql, mongodb, ElasticaSearch, redis) are used to store data, and multiple data sources cooperate to complete queries to improve query efficiency and share queries pressure.

(4)引入Presto引擎，支持mysql、ElasticaSearch跨数据源查询分析。(4) Introduce the Presto engine to support cross-data source query and analysis of mysql and ElasticaSearch.

(5)定时调度任务、shell脚本、数据库定时任务等多种数据清理方案，定时清理mongodb、mysql统计表、ElasticaSearch。(5) A variety of data cleaning solutions such as scheduled scheduling tasks, shell scripts, and database scheduled tasks, regularly clean up mongodb, mysql statistical tables, and ElasticaSearch.

(6)引入智能预警，通过上述方案之后，系统还存在的无法处理的服务器压力或者数据库压力，按天巡检，智能发送预警消息给管理员。(6) Introduce intelligent early warning. After passing the above scheme, if the system still has server pressure or database pressure that cannot be handled, it will be inspected on a daily basis and intelligently send early warning messages to the administrator.

具体地，本发明提供了一种基于异构数据源的海量数据处理方法，本方法包括：Specifically, the present invention provides a method for processing massive data based on heterogeneous data sources. The method includes:

S1.服务调用数据写入存储，所述存储包括mongodb和Elasticsearch；S1. Service call data is written into storage, and the storage includes mongodb and Elasticsearch;

S2.查询服务调用数据；S2. Query service call data;

S3.定时清理数据文件；S3. Regularly clean up data files;

S4.通过shell脚本进行智能预警。S4. Carry out intelligent early warning through shell script.

进一步地，根据本发明的一些实施例，本发明基于异构数据源的海量数据处理方法步骤S1中所述服务调用数据包括API服务调用时间、apiId、应用名称、参数、耗时、出错信息、错误码信息记录json串。Further, according to some embodiments of the present invention, the service call data in step S1 of the heterogeneous data source-based mass data processing method of the present invention includes API service call time, apiId, application name, parameters, time-consuming, error information, Error code information record json string.

进一步地，本发明基于异构数据源的海量数据处理方法步骤S1中所述服务调用数据写入存储，包括：Further, the method for processing massive data based on heterogeneous data sources in the present invention writes and stores service call data in step S1, including:

S11.将服务调用数据的具体信息异步写入mongodb，用于数据统计查询，并每天为mongodb新建collection，每天23:59分启动定时任务，创建新的mongodb表，当天的数据写入当天的数据表；S11. Asynchronously write the specific information of the service call data into mongodb for data statistics query, create a new collection for mongodb every day, start a scheduled task at 23:59 every day, create a new mongodb table, write the data of the day into the data of the day surface;

S12.将服务调用数据的具体信息异步写入Elasticsearch，用于用户查询，并每月为Elasticsearch滚动创建新索引(创建模板，每月基于模板和创建策略rollover生成新索引)；S12. Asynchronously write the specific information of the service call data into Elasticsearch for user query, and create a new index for Elasticsearch rolling every month (create a template, generate a new index based on the template and creation strategy rollover every month);

S13.每10秒钟对mongodb的数据，按年、月、日、小时的维度对接口调用量、平均耗时、失败次数、调用次数进行统计，并将统计数据保存到mysql统计表；S13. Make statistics on mongodb data every 10 seconds, according to the dimensions of year, month, day, and hour, the amount of interface calls, average time-consuming, number of failures, and number of calls, and save the statistical data to the mysql statistics table;

S14.每年为mysql创建新数据统计表，每年末定时创建新表，新表同步上个月表数据，下个月表数据写入新表，每年最后一天23:00:00创建新数据表，表名为dytj_当前日期年，同时把12月数据同步到新数据表，每年2月1日，定时清理新表中上一年12月的数据。S14. Create a new data statistics table for mysql every year, and create a new table regularly at the end of each year. The new table will synchronize the table data of the previous month, and the table data of the next month will be written into the new table. Create a new data table at 23:00:00 on the last day of each year. The name of the table is dytj_current date year. At the same time, the data in December is synchronized to the new data table. On February 1st every year, the data in December of the previous year in the new table is regularly cleaned up.

进一步地，根据本发明的一些实施例，本发明基于异构数据源的海量数据处理方法当系统在运行中出现数据库和/或服务器压力突然剧增，资源空间不足时，采用下述方式进行扩容：Further, according to some embodiments of the present invention, when the massive data processing method based on heterogeneous data sources of the present invention suddenly increases the pressure on the database and/or server during the operation of the system, and the resource space is insufficient, the capacity expansion is performed in the following manner :

(1)针对mysql，通过mysql分表策略，当出现资源空间不足时，给mysql增加机器，分库存储历史数据和当前数据；(1) For mysql, through the mysql sub-table strategy, when there is insufficient resource space, add machines to mysql, and store historical data and current data in sub-databases;

(2)针对mongodb，mongodb只存储临时数据，作为临时存储中间件，当出现资源空间不足时，只保留1天数据，删除中间表数据；(2) For mongodb, mongodb only stores temporary data, as a temporary storage middleware, when there is insufficient resource space, only keep 1 day of data, and delete the intermediate table data;

(3)针对Elasticsearch，评估需要增加的服务器资源，根据调用量计算需要扩容的机器个数，进行水平扩容；(3) For Elasticsearch, evaluate the server resources that need to be increased, calculate the number of machines that need to be expanded according to the call volume, and perform horizontal expansion;

(4)针对应用服务器，根据总并发量及单服务器配置支持的并发量，评估需要增加的服务器资源，进行水平扩容。(4) For the application server, evaluate the server resources that need to be increased according to the total concurrency and the concurrency supported by a single server configuration, and perform horizontal expansion.

进一步地，本发明基于异构数据源的海量数据处理方法步骤S2中所述查询服务调用数据，包括：Further, the query service invocation data described in step S2 of the mass data processing method based on heterogeneous data sources in the present invention includes:

(1)单表查询(1) Single table query

①查询API服务调用统计信息，通过查询mysql统计表，获取年、月、日、小时维度的统计数据信息；①Query the statistical information of API service calls, and obtain statistical data information in the dimensions of year, month, day, and hour by querying the mysql statistical table;

②查询API服务调用数据信息，通过查询Elasticsearch，获取API服务名称、调用时间、应用名称、数据内容、参数、耗时、错误码的数据信息；②Query the API service call data information, and obtain the data information of the API service name, call time, application name, data content, parameters, time-consuming, and error codes by querying Elasticsearch;

(2)跨数据源查询(2) Cross data source query

①配置Presto的catalog，在Presto安装目录下找到catalog目录，然后在catalog目录下添加connector文件，创建mysql.properties、mongodb.properties、Elasticsearch.properties，并配置连接器信息(地址、用户名、密码等)；① Configure the catalog of Presto, find the catalog directory in the Presto installation directory, then add the connector file in the catalog directory, create mysql.properties, mongodb.properties, Elasticsearch.properties, and configure the connector information (address, user name, password, etc. );

②编写混合查询的sql，混合查询sql的编写方法与普通sql相同，查询的表名写为catalog.schame.tableName(catalog即为properties的名字)，完成跨数据源查询。② Write mixed query sql. The method of writing mixed query sql is the same as that of ordinary sql. The table name of the query is written as catalog.schame.tableName (catalog is the name of properties) to complete cross-data source query.

进一步地，上述基于异构数据源的海量数据处理方法，单表查询中所述查询API服务调用数据信息还包括基于查询获取到的数据信息，进行聚合分析；Further, in the above-mentioned mass data processing method based on heterogeneous data sources, the query API service call data information in the single table query also includes the data information obtained based on the query for aggregate analysis;

所述聚合分析包括错误数据分布分析、错误类型分布分析、以及报错最多时间段分析。The aggregation analysis includes error data distribution analysis, error type distribution analysis, and most error reporting time period analysis.

进一步地，本发明基于异构数据源的海量数据处理方法步骤S3中所述定时清理数据文件，包括：Further, the regular cleaning of data files in step S3 of the mass data processing method based on heterogeneous data sources in the present invention includes:

(1)mongodb数据清理(1) mongodb data cleaning

设置按天清理的数据定时删除任务，数据有效时间为2天，每日删除前2天的mongodb数据；Set a scheduled deletion task for data cleaned up on a daily basis. The data is valid for 2 days, and the mongodb data of the previous 2 days are deleted every day;

(2)Elasticsearch数据清理(2) Elasticsearch data cleaning

①Elasticsearch创建索引时同步设置数据定时删除任务，数据有效时间为1年，到期自动删除数据；①Elasticsearch synchronously sets the data timing deletion task when creating the index, the data is valid for 1 year, and the data is automatically deleted when it expires;

②手动编写一个用于检测的shell脚本，并将其放入服务器中，每日定时检查磁盘剩余空间，当磁盘剩余空间不足10％时，按重要性以及索引创建时间策略删除文件，直到磁盘剩余空间达到80％；②Manually write a shell script for detection and put it into the server to check the remaining disk space regularly every day. When the remaining disk space is less than 10%, delete files according to the importance and index creation time strategy until the disk is left Space up to 80%;

(3)mysql数据清理(3) mysql data cleaning

①设置按年清理的数据定时删除任务，数据有效时间为3年，到期自动清理3年前的数据统计表；① Set up a scheduled deletion task for data cleaned up on a yearly basis. The data is valid for 3 years, and the data statistics table 3 years ago will be automatically cleaned up when it expires;

②手动编写一个用于检测的shell脚本，并将其放入服务器中，每日定时检查磁盘剩余空间，当磁盘剩余空间不足10％时，按重要性以及数据存入时间策略删除文件，直到磁盘剩余空间超过10％；②Manually write a shell script for detection and put it into the server to check the remaining disk space regularly every day. When the remaining disk space is less than 10%, delete files according to the importance and data storage time strategy until the disk The remaining space exceeds 10%;

③针对mysql数据库新建事件，每日检查单表数据量，若单表数据量超过1亿，则向告警表中写入数据，并通知平台维护人员检查处置。③ For the new event of the mysql database, check the data volume of a single table every day. If the data volume of a single table exceeds 100 million, write data into the alarm table and notify the platform maintenance personnel to check and deal with it.

进一步地，本发明基于异构数据源的海量数据处理方法步骤S4中所述通过shell脚本进行智能预警，包括：Further, the intelligent early warning through the shell script described in step S4 of the mass data processing method based on the heterogeneous data source of the present invention includes:

S41.编写智能预警shell脚本，然后在/home/opt创建xtyj.sh文件；S41. Write an intelligent early warning shell script, and then create the xtyj.sh file at /home/opt;

S42.查询磁盘空间，若磁盘空间占用率达到85％，则向Kafka发送XT_YJ_TOPIC的消息，消息内容包括服务器ip、cpu及磁盘使用情况；S42. query the disk space, if the disk space occupancy rate reaches 85%, then send the message of XT_YJ_TOPIC to Kafka, the message content includes server ip, cpu and disk usage;

S43.为智能预警shell脚本设置定时任务，每日1:00执行脚本；S43. Set a scheduled task for the intelligent early warning shell script, and execute the script at 1:00 every day;

S44.后端代码监听Kafka对应主题XT_YJ_TOPIC的消息，接收消息之后，给系统管理员发送预警消息；S44. The back-end code monitors the message of Kafka corresponding to the topic XT_YJ_TOPIC, and after receiving the message, sends an early warning message to the system administrator;

S45.系统管理员收到预警消息后，检查服务器异常并进行相应的处理。S45. After receiving the warning message, the system administrator checks the abnormality of the server and performs corresponding processing.

通过上述方案之后，若系统还存在无法处理的服务器压力或者数据库压力，按天巡检，智能发送预警消息给系统管理员。After passing the above solution, if the system still has server pressure or database pressure that cannot be handled, it will be inspected on a daily basis and intelligently send an early warning message to the system administrator.

第二方面，本发明还提供了一种基于异构数据源的海量数据处理系统，所述处理系统包括：In the second aspect, the present invention also provides a massive data processing system based on heterogeneous data sources, and the processing system includes:

存储模块，用于将服务调用数据写入存储，所述存储包括mongodb和Elasticsearch；The storage module is used to write the service call data into storage, and the storage includes mongodb and Elasticsearch;

查询模块，用于查询服务调用数据；A query module, used to query service call data;

清理模块，用于定时清理数据文件；The cleaning module is used to regularly clean up data files;

预警模块，用于通过shell脚本进行智能预警。The early warning module is used for intelligent early warning through shell scripts.

另外，本发明还提供了一种计算机可读存储介质，所述存储介质上存储有计算机程序，所述程序被处理器执行时实现上述的基于异构数据源的海量数据处理方法的步骤。In addition, the present invention also provides a computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the steps of the above-mentioned massive data processing method based on heterogeneous data sources are realized.

综上，本发明基于异构数据源的海量数据处理方法具有以下优点：In summary, the massive data processing method based on heterogeneous data sources in the present invention has the following advantages:

(1)本发明方法能够支持服务系统快速扩容，大幅提高接口并发量，支撑接口日亿级别的调用，可用于海量数据的存储及调用；且支持外部用户无感知的系统扩容，扩容期间接口可以正常调用。(1) The method of the present invention can support the rapid expansion of the service system, greatly increase the concurrency of the interface, support the call of the billion-level interface, and can be used for the storage and call of massive data; and support the expansion of the system without the perception of external users. During the expansion, the interface can be call normally.

(2)本发明方法采用多数据源引擎协作查询，支持跨数据源查询，适用于各种查询场景，提高了数据实时查询能力和多维分析能力。(2) The method of the present invention adopts multi-data source engine collaborative query, supports cross-data source query, is applicable to various query scenarios, and improves real-time data query capability and multi-dimensional analysis capability.

(3)本发明方法能够及时快速地智能清理无用的存储数据和文件，不需要运维手动巡查清理，减少了人工和机器成本。(3) The method of the present invention can intelligently clean up useless stored data and files in a timely and rapid manner, without manual inspection and cleaning for operation and maintenance, and reduces labor and machine costs.

(4)本发明方法提供了智能预警功能，当数据库、服务器压力突然剧增时，能够以邮件、短信等方式给平台维护人员发送预警。(4) The method of the present invention provides an intelligent early warning function, and when the pressure on the database and server suddenly increases sharply, an early warning can be sent to the platform maintenance personnel in the form of email or short message.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面对本发明实施例中需要使用的附图作简要介绍，显而易见地，下述附图仅是本发明中记载的一些实施例，对于本领域技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings that need to be used in the embodiments of the present invention are briefly introduced below. Obviously, the following drawings are only some embodiments recorded in the present invention. As far as people are concerned, other drawings can also be obtained based on these drawings on the premise of not paying creative work.

图1为本发明方法的整体实施流程图。Fig. 1 is the overall implementation flowchart of the method of the present invention.

图2为本发明方法中服务调用数据存储方案框图。Fig. 2 is a block diagram of a service call data storage solution in the method of the present invention.

图3为本发明方法中服务调用数据存储过程流程图。Fig. 3 is a flow chart of the service call data storage process in the method of the present invention.

图4为本发明方法中服务调用数据查询方案框图。Fig. 4 is a block diagram of a service call data query scheme in the method of the present invention.

图5为本发明方法中数据定时清理方案框图。Fig. 5 is a block diagram of a data timing cleaning scheme in the method of the present invention.

图6为本发明方法中智能预警过程流程图。Fig. 6 is a flow chart of the intelligent early warning process in the method of the present invention.

图7为本发明数据处理系统的组成结构示意图。Fig. 7 is a schematic diagram of the composition and structure of the data processing system of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合具体实施例及相应的附图对本发明的技术方案进行清楚、完整地描述。显然，所描述的实施例仅是本发明一部分实施例，而不是全部的实施例，本发明还可以通过另外不同的具体实施方式加以实施或应用，本说明书中的各项细节也可以基于不同观点与应用，在没有背离本发明的精神下进行各种修饰或改变。In order to make the purpose, technical solution and advantages of the present invention clearer, the technical solution of the present invention will be clearly and completely described below in conjunction with specific embodiments and corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. The present invention can also be implemented or applied through other different specific implementation modes, and the details in this specification can also be based on different viewpoints Various modifications or changes may be made without departing from the spirit of the invention.

同时，应理解，本发明的保护范围并不局限于下述特定的具体实施方案；还应当理解，本发明实施例中使用的术语是为了描述特定的具体实施方案，而不是为了限制本发明的保护范围。Simultaneously, it should be understood that the protection scope of the present invention is not limited to the following specific embodiments; protected range.

实施例：一种基于异构数据源的海量数据处理方法Embodiment: A method for processing massive data based on heterogeneous data sources

如图1所示，本方法包括下述步骤：As shown in Figure 1, this method comprises the following steps:

S1.服务调用数据写入存储S1. Service call data is written to storage

假设背景：Hypothetical background:

服务平台对外提供的接口个数如下，分为高并发调用接口和低并发调用接口：The number of interfaces provided by the service platform is as follows, divided into high-concurrency call interfaces and low-concurrency call interfaces:

mysql数据量：mysql data volume:

产生统计数据量(条/每天)＝服务总个数*24*1.5≈11w条；The amount of statistical data generated (items/day) = total number of services * 24 * 1.5 ≈ 11w pieces;

产生统计数据量(条/每年)＝产生统计数据量(条/每天)*365≈4000w条。The amount of statistical data generated (articles/year) = the amount of statistical data generated (articles/day)*365≈4000w articles.

mongodb数据量：mongodb data volume:

详细数据量(每天)＝高并发接口个数*日均调用量*数据量+低并发接口个数*日均调用量*数据量Detailed data volume (per day) = number of high-concurrency interfaces * average daily call volume * data volume + number of low-concurrency interfaces * average daily call volume * data volume

＝5000w*10*1kb+1w*3000*1kb＝5000w*10*1kb+1w*3000*1kb

≈500G。≈500G.

ElasticaSearch数据量：ElasticaSearch data volume:

每月数据量＝数据量(每天)*30≈15T；Monthly data volume = data volume (per day) * 30 ≈ 15T;

每季数据量＝数据量(每天)*90≈45T；Quarterly data volume = data volume (per day) * 90 ≈ 45T;

每年数据量＝数据量(每天)*365≈180T。Annual data volume = data volume (per day) * 365 ≈ 180T.

如图2和图3所示，服务调用数据存储流程包括下述步骤：As shown in Figure 2 and Figure 3, the service call data storage process includes the following steps:

S11.服务调用数据异步写入mongodbS11. Service call data is written into mongodb asynchronously

将服务调用数据，如API服务调用时间、apiId、应用名称、参数、耗时、出错信息、错误码信息记录json串等的具体信息异步写入mongodb，用于数据统计查询。Asynchronously write service call data, such as API service call time, apiId, application name, parameters, time-consuming, error information, error code information record json string, etc., into mongodb for data statistics query.

mongodb按天分表mongodb table by day

每天为mongodb新建collection，每天23:59分启动定时任务，创建新的mongodb表，当天的数据写入当天的数据表，只保留2天数据。Create a new collection for mongodb every day, start a scheduled task at 23:59 every day, create a new mongodb table, write the data of the day into the data table of the day, and only keep the data of 2 days.

S12.服务调用数据异步写入ElasticsearchS12. Service call data is asynchronously written to Elasticsearch

将上述服务调用数据的具体信息异步写入Elasticsearch，用于用户查询。Asynchronously write the specific information of the above service call data into Elasticsearch for user query.

Elasticsearch按月生成新索引Elasticsearch generates new indexes on a monthly basis

创建模板，每月基于模板和策略为Elasticsearch滚动生成新的索引。Create templates to roll out new indexes for Elasticsearch based on templates and policies every month.

S13.统计数据保存到mysqlS13. Save statistics to mysql

每10秒钟对mongodb的数据，按年、月、日、小时的维度对接口调用量、平均耗时、失败次数、调用次数进行统计，并将统计数据保存到mysql统计表；Every 10 seconds, the mongodb data is counted according to the dimensions of year, month, day, and hour, the number of interface calls, the average time spent, the number of failures, and the number of calls, and the statistical data is saved to the mysql statistics table;

(1)按小时同步统计数据insertAndUpdateHourData；(1) Synchronize statistical data insertAndUpdateHourData by hour;

(2)按天同步统计数据insertAndUpdateDayData；(2) Synchronize statistical data insertAndUpdateDayData by day;

(3)按月同步统计数据insertAndUpdateMonthData；(3) Synchronize statistical data insertAndUpdateMonthData on a monthly basis;

(4)按年同步统计数据insertAndUpdateYearData。(4) Synchronize statistical data insertAndUpdateYearData by year.

S14.mysql统计表按年分表S14.mysql statistics table by year

每年为mysql创建新数据表，每年末定时任务创建新表，新表同步上个月表数据，下个月表数据写入新表。每年最后一天23:00:00创建新数据表，表名为dytj_当前日期年，同时把12月数据同步到新数据表，每年2月1号，定时任务清理新表中上一年12月的数据。A new data table is created for mysql every year, and a new table is created by a scheduled task at the end of each year. The new table synchronizes the table data of the previous month, and the table data of the next month is written into the new table. Create a new data table at 23:00:00 on the last day of each year. The table name is dytj_current date year. At the same time, the December data is synchronized to the new data table. On February 1st every year, the scheduled task cleans up the new table in December of the previous year The data.

扩容expansion

系统运行中，应用了数据文件清理方案之后，如果数据库、服务器压力突然剧增，资源空间不足，智能预警功能会以邮件、短信等方式给平台维护人员预警，提示哪台服务器哪台应用存在资源不足。针对mysql、mongodb、Elasticsearch、应用服务器，采取不同方案进行扩容或处置。When the system is running, after the data file cleaning solution is applied, if the pressure on the database and server suddenly increases sharply, and the resource space is insufficient, the intelligent early warning function will send an early warning to the platform maintenance personnel by email, SMS, etc., prompting which server and which application have resources insufficient. For mysql, mongodb, Elasticsearch, and application servers, different solutions are adopted for capacity expansion or disposal.

(1)mysql(1)mysql

针对mysql，通过mysql分表策略，当出现资源空间不足时，给mysql增加机器，分库存储历史数据和当前数据；For mysql, through the mysql sub-table strategy, when there is insufficient resource space, add machines to mysql, and store historical data and current data in sub-databases;

(2)mongodb(2) mongodb

针对mongodb，mongodb只存储临时数据，作为临时存储中间件，当出现资源空间不足时，只保留1天数据，删除中间表中的其他数据；For mongodb, mongodb only stores temporary data, as a temporary storage middleware, when there is insufficient resource space, only one day of data is kept, and other data in the intermediate table are deleted;

(3)Elasticsearch(3) Elastic search

针对Elasticsearch，评估需要增加的服务器资源，进行水平扩容，根据调用量计算需要扩容的机器个数：For Elasticsearch, evaluate the server resources that need to be increased, perform horizontal expansion, and calculate the number of machines that need to be expanded according to the call volume:

每天数据量＝∑统计表单服务每天调用量*1K；Daily data volume = ∑ daily call volume of statistical form service * 1K;

如果需要保留1年的数据，那数据存储能力为：If you need to keep data for 1 year, the data storage capacity is:

1年数据量＝每天数据量*365。1-year data volume = data volume per day * 365.

(4)应用服务器(4) Application server

针对应用服务器，根据总并发量及单服务器配置支持的并发量，评估需要增加的服务器资源，进行水平扩容，For the application server, according to the total concurrency and the concurrency supported by a single server configuration, evaluate the server resources that need to be increased, and perform horizontal expansion.

总并发量(每秒)＝统计表存储的每小时调用量/3600。Total concurrency (per second) = call volume per hour stored in the statistics table/3600.

S2.查询服务调用数据S2. Query service call data

如图4所示，服务调用数据查询包括单表查询和跨数据源查询，具体流程如下：As shown in Figure 4, service call data query includes single-table query and cross-data source query. The specific process is as follows:

(一)单表查询(简单查询)(1) Single table query (simple query)

(1)查询API服务调用统计信息，通过查询mysql统计表，可以直接获取年、月、日、小时维度的统计数据信息；(1) Query the statistical information of API service calls. By querying the mysql statistical table, you can directly obtain the statistical data information in the dimensions of year, month, day, and hour;

(2)查询API服务调用数据信息，通过查询Elasticsearch，获取API服务名称、调用时间、应用名称、数据内容、参数、耗时、错误码的具体数据信息；(2) Query the API service call data information, and obtain the specific data information of the API service name, call time, application name, data content, parameters, time-consuming, and error code by querying Elasticsearch;

还支持基于查询获取到的数据信息，进行聚合分析，例如进行错误数据分布、错误类型分布、报错最多时间段等统计信息分析。It also supports aggregated analysis based on the data information obtained from the query, such as the analysis of statistical information such as error data distribution, error type distribution, and the most error-reported time period.

(二)跨数据源查询(复杂查询)(2) Cross-data source query (complex query)

查询复杂业务场景，需要mysql、Elasticsearch、mongodb联合查询，通过使用Presto，实现跨数据源查询。Querying complex business scenarios requires a joint query of mysql, Elasticsearch, and mongodb. By using Presto, cross-data source queries are realized.

(1)配置Presto的catalog，在Presto安装目录下找到catalog目录，然后在catalog目录下添加connector文件，创建mysql.properties、mongodb.properties、Elasticsearch.properties，并配置连接器信息，包括地址、用户名、密码等；(1) Configure the catalog of Presto, find the catalog directory in the Presto installation directory, then add the connector file in the catalog directory, create mysql.properties, mongodb.properties, Elasticsearch.properties, and configure the connector information, including address, user name , password, etc.;

(2)编写混合查询的sql，完成跨数据源查询，混合查询sql的编写方法与普通sql相同，查询的表名写为catalog.schame.tableName，catalog即properties的名字，具体如下表所示：(2) Write mixed query sql to complete cross-data source query. The method of writing mixed query sql is the same as that of ordinary sql. The table name of the query is written as catalog.schame.tableName, and catalog is the name of properties, as shown in the following table:

数据源类型data source type catalogcatalog 表名Table Name mysqlmysql mysqlmysql mysql.库名.表名mysql.library name.table name mongodbmongodb mongodbmongodb mongodb.库名.表名mongodb.library name.table name ElasticsearchElasticsearch ElasticsearchElasticsearch Elasticsearch.库名.表名Elasticsearch. library name. table name

S3.定时清理数据文件S3. Regularly clean up data files

如图5所示，数据文件清理包括下述过程：As shown in Figure 5, data file cleaning includes the following processes:

(1)mongodb数据清理(1) mongodb data cleaning

设置按天清理的数据定时删除任务，数据有效时间为2天，每日删除前2天的mongodb数据(collection名称为dyrz_前2天日期的collection)；Set a scheduled data deletion task that is cleaned up on a daily basis. The data is valid for 2 days, and the mongodb data of the previous 2 days are deleted every day (collection name is dyrz_collection of the date of the previous 2 days);

(2)Elasticsearch数据清理(2) Elasticsearch data cleaning

(3)mysql数据清理(3) mysql data cleaning

③针对mysql数据库新建事件，每日检查单表数据量，若单表数据量超过1亿，则向告警表中写入数据，并利用邮件、短信通知平台维护人员进行检查，如有异常则采取相应的措施处置。③ For the new event of mysql database, check the data volume of a single table every day. If the data volume of a single table exceeds 100 million, write data into the alarm table, and use emails and text messages to notify the platform maintenance personnel to check. If there is any abnormality, take corresponding measures to deal with.

S4.通过shell脚本进行智能预警S4. Intelligent early warning through shell script

如图6所示，智能预警包括下述过程：As shown in Figure 6, intelligent early warning includes the following processes:

S43.为智能预警shell脚本设置定时任务，每日1:00执行脚本，S43. Set a scheduled task for the intelligent early warning shell script, execute the script at 1:00 every day,

0 1***/home/opt/xtyj.sh；0 1 *** /home/opt/xtyj.sh;

S44.后端代码监听Kafka对应主题XT_YJ_TOPIC的消息，接收消息之后，给系统管理员发送邮件、短信等预警消息；S44. The back-end code listens to the message of Kafka corresponding to the topic XT_YJ_TOPIC, and after receiving the message, sends an email, a short message and other early warning messages to the system administrator;

一种基于异构数据源的海量数据处理系统，如图7所示，包括：A massive data processing system based on heterogeneous data sources, as shown in Figure 7, includes:

以上所述仅为本发明的实施例而已，并不用于限制本发明。对于本领域技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原理之内所作的任何修改、替换等，均应包含在本发明的权利要求保护范围之内。The above descriptions are only examples of the present invention, and are not intended to limit the present invention. Various modifications and variations of the present invention will occur to those skilled in the art. Any modification, substitution, etc. made within the spirit and principle of the present invention shall be included in the protection scope of the claims of the present invention.

Claims

1. A massive data processing method based on heterogeneous data sources, characterized in that the method comprises:

S1. Service call data is written into storage, and the storage includes mongodb and Elasticsearch;

S2. Query service call data;

S3. Regularly clean up data files;

S4. Carry out intelligent early warning through shell script.

2. The method for processing massive data based on heterogeneous data sources according to claim 1, wherein the service call data in step S1 includes API service call time, apiId, application name, parameters, time-consuming, and error information , Error code information record json string.

3. The method for processing massive data based on heterogeneous data sources according to claim 2, characterized in that, the service calling data in step S1 is written into storage, comprising:

S11. Asynchronously write the specific information of the service call data into mongodb for data statistics query, and create a new collection for mongodb every day;

S12. Asynchronously write the specific information of the service call data into Elasticsearch for user query, and create a new index for Elasticsearch every month;

S13. Make statistics on mongodb data every 10 seconds, according to the dimensions of year, month, day, and hour, the amount of interface calls, average time-consuming, number of failures, and number of calls, and save the statistical data to the mysql statistics table;

S14. Create a new data table for mysql every year.

4. The method for processing massive data based on heterogeneous data sources according to claim 3, characterized in that, when the pressure on the database and/or server suddenly increases sharply during the operation of the system, and the resource space is insufficient, the following method is adopted: Expansion:

(1) For mysql, through the mysql sub-table strategy, when there is insufficient resource space, add machines to mysql, and store historical data and current data in sub-databases;

(2) For mongodb, mongodb only stores temporary data, as a temporary storage middleware, when there is insufficient resource space, only keep 1 day of data, and delete the intermediate table data;

(3) For Elasticsearch, evaluate the server resources that need to be increased, calculate the number of machines that need to be expanded according to the call volume, and perform horizontal expansion;

(4) For the application server, evaluate the server resources that need to be increased according to the total concurrency and the concurrency supported by a single server configuration, and perform horizontal expansion.

5. The method for processing massive data based on heterogeneous data sources according to claim 3, wherein the query service call data in step S2 includes:

(1) Single table query

①Query the statistical information of API service calls, and obtain statistical data information in the dimensions of year, month, day, and hour by querying the mysql statistical table;

②Query the API service call data information, and obtain the data information of the API service name, call time, application name, data content, parameters, time-consuming, and error codes by querying Elasticsearch;

(2) Cross data source query

① Configure the catalog of Presto, find the catalog directory in the Presto installation directory, then add the connector file in the catalog directory, create mysql.properties, mongodb.properties, Elasticsearch.properties, and configure the connector information;

② Write the sql of the mixed query, write the table name of the query as catalog.schame.tableName, and complete the cross-data source query.

6. The method for processing massive amounts of data based on heterogeneous data sources according to claim 5, wherein the query API service call data information in the single table query also includes data information obtained based on the query for aggregate analysis;

The aggregation analysis includes error data distribution analysis, error type distribution analysis, and most error reporting time period analysis.

7. The method for processing massive data based on heterogeneous data sources according to claim 3, wherein the regular cleaning of data files in step S3 includes:

(1) mongodb data cleaning

Set a scheduled deletion task for data cleaned up on a daily basis. The data is valid for 2 days, and the mongodb data of the previous 2 days are deleted every day;

(2) Elasticsearch data cleaning

①Elasticsearch synchronously sets the data timing deletion task when creating the index, the data is valid for 1 year, and the data is automatically deleted when it expires;

②Manually write a shell script for detection and put it into the server to check the remaining disk space regularly every day. When the remaining disk space is less than 10%, delete files according to the importance and index creation time strategy until the disk is left Space up to 80%;

(3) mysql data cleaning

① Set up a scheduled deletion task for data cleaned up on a yearly basis. The data is valid for 3 years, and the data statistics table 3 years ago will be automatically cleaned up when it expires;

②Manually write a shell script for detection and put it into the server to check the remaining disk space regularly every day. When the remaining disk space is less than 10%, delete files according to the importance and data storage time strategy until the disk The remaining space exceeds 10%;

③ For the new event of the mysql database, check the data volume of a single table every day. If the data volume of a single table exceeds 100 million, write data into the alarm table and notify the platform maintenance personnel to check and deal with it.

8. the massive data processing method based on heterogeneous data source according to claim 3, is characterized in that, described in step S4 carries out intelligent early warning by shell script, comprises:

S41. Write an intelligent early warning shell script, and then create the xtyj.sh file at /home/opt;

S42. query the disk space, if the disk space occupancy rate reaches 85%, then send the message of XT_YJ_TOPIC to Kafka, the message content includes server ip, cpu and disk usage;

S43. Set a scheduled task for the intelligent early warning shell script, and execute the script at 1:00 every day;

S44. The back-end code monitors the message of Kafka corresponding to the topic XT_YJ_TOPIC, and after receiving the message, sends an early warning message to the system administrator;

S45. After receiving the warning message, the system administrator checks the abnormality of the server and performs corresponding processing.

9. A massive data processing system based on heterogeneous data sources, characterized in that the processing system comprises:

The storage module is used to write the service call data into storage, and the storage includes mongodb and Elasticsearch;

A query module, used to query service call data;

The cleaning module is used to regularly clean up data files;

The early warning module is used for intelligent early warning through shell scripts.

10. A computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method for processing massive data based on heterogeneous data sources according to any one of claims 1-8 is realized A step of.