tags: Big Data Programming language database artificial intelligence data analysis
BIGO's global audio and video business requires more and more real-time data capabilities. Data analysts hope to see new users, active users and other business data in real time in multiple dimensions to grasp market trends as soon as possible. Machine learning engineers hope to get users' views in real time. , Click and other data, and then quickly add user preferences to the model through online learning, so as to push users the most interesting content. APP development engineers hope to monitor the success rate and crash rate of APP opening in real time.
These real-time data capabilities must rely on real-time computing platforms to provide.From the perspective of the industry, the real-time trend is accelerating. This article will introduce the construction experience and achievements of BIGO's Flink-based real-time computing platform.
Platform Introduction
The development of BIGO real-time computing is roughly divided into two stages. Before 2018, there were still fewer real-time scenarios and not many real-time jobs. At that time, Spark Streaming was mainly used to support it. Starting in 2018, after comprehensively considering the advantages of Flink over Spark Streaming, it was decided to switch the real-time computing platform to a technical route based on Flink. After nearly two years of development, the BIGO real-time computing platform has become more and more perfect, and basically supports the mainstream real-time computing scenarios in the company. The following figure is the architecture diagram of the BIGO real-time computing platform:
The data sources of real-time calculation can be divided into two categories,One type is the user's browsing, clicking and other behavior logs in the APP or browser, which are collected through kafka and entered into real-time calculation; the other type is the changes recorded in the relational database generated by the user's behavior, and the biglog generated by these changes is extracted by BDP Enter real-time calculation.
It can be seen from the figure that the bottom layer of the BIGO real-time computing platform is based on Yarn for cluster resource management, and with the help of Yarn's distributed scheduling capabilities, it can achieve scheduling in large-scale clusters. The calculation engine of the real-time platform is specially customized and developed for BIGO scenarios based on the open source Flink. The upper layer of the real-time platform is BIGO's self-developed one-stop development platform BigoFlow, where users can conveniently develop, debug, and monitor operations. BigoFlow provides a comprehensiveSQL development capabilities, automatic monitoring and configuration capabilities, and automatic log collection and query capabilities, Allowing users to complete a business operation with only one SQL. It has the following functions:
A powerful SQL editor is provided, which can perform grammar checking and automatic prompting.
It can connect to all the company's data sources and data storage, eliminating the need for custom work by the business side.
Logs are automatically collected in the ES, users can easily search and query, and can quickly locate errors.
Operational key indicators are automatically connected to the company's monitoring and alarm platform, and users do not need to configure themselves.
Collect the resource usage of all operations and automatically analyze them to help identify and manage unreasonable operations.
The results calculated in real time will be stored in different storages according to business needs. The results of ETL jobs are usually stored in Hive, and the data that needs to be queried by Adhoc is usually stored in ClickHouse.Monitoring alarms and other types of jobs can directly output the results to the Prometheus database of the alarm platform for direct use by the alarm platform.
Business Applications
With the development of real-time computing platforms, more and more scenarios have been moved to the BigoFlow platform. Real-time computing also brings many benefits to these scenarios. Let’s take a few typical scenarios as examples to illustrate what real-time computing brings to them. An increase in capacity or performance.
Data ETL
Data extraction and conversion is a typical real-time scenario. The user's behavior log in APP and browser is generated in real time without interruption. It must be collected in real time, extracted and converted, and finally entered into the database. The ETL scenario data path before BIGO is usuallyKafka->Flume->Hive. The path to the library via Flume has the following problems:
Flume's fault tolerance is poor, and it may cause data loss or data duplication when it encounters success.
Flume's dynamic expansion capability is poor, and it is difficult to expand immediately when traffic suddenly arrives.
Once the data field or format changes, Flume is more difficult to adjust flexibly.
Flink provides powerful state-based fault tolerance, end-to-end Exactly Once, concurrency can be flexibly adjusted, and Flink SQL can flexibly adjust logic.Therefore, most of the ETL scenarios have been migrated to the Flink architecture.
Real-time statistics
As a company with multiple APP products, BIGO needs a large number of statistical indicators to reflect the daily activity, revenue and other indicators of the product. Traditionally, these indicators are generally calculated daily or hourly through offline Spark jobs. Offline calculations are difficult to guarantee the timeliness of data generation, and there are often problems with delays in important indicators.
Therefore, we slowly generate important indicators through real-time calculations, which greatly guarantees the timeliness of data generation.The most notable thing is that the previous important indicator was often delayed, which caused its downstream to be output in the afternoon, which caused a lot of trouble for data analysts. After transforming into a real-time link, the final indicator can be output at 7 in the morning. Data analysis The teacher can use it at work.
Machine learning
With the explosive development of information, users’ interests are shifting faster and faster, which requires machine learning to be able to recommend videos of interest to users as soon as possible based on their behavior at the time. Traditional machine learning is based on batch processing, and it usually takes the fastest hours to update the model.Today's sample training based on real-time calculations can continuously train samples into real-time models and apply them online, which truly achieves online learning, and will update the recommendations generated by user behaviors in minutes.Currently, machine learning tasks have accounted for more than 50% of real-time computing clusters.
real time monitoring
Real-time monitoring is also a very important real-time scenario. APP developers need to monitor indicators such as the success rate of APP opening in real time. If there is an abnormality, they must be notified in time. The previous approach is usually to store the original data in Hive or ClickHouse. In the Grafana-based monitoring platform configuration rules, use Presto or ClickHouse to check every certain time, and determine whether an alarm is needed based on the calculated results. There are several problems with this approach:
Although Presto or ClickHouse itself is an OLAP engine with good performance, it does not guarantee the high availability and real-time performance of the cluster. However, monitoring has relatively high requirements for real-time and high availability.
In this way, all the data of the day must be calculated every time the indicator is calculated, and there is a great calculation waste.
The real-time calculation of the monitoring program can calculate the indicators in real time and directly output them to the Grafana database, which not only ensures real-time performance, but also reduces the amount of calculated data by thousands of times.
BIGO real-time platform features
During the development of the BIGO real-time computing platform, it gradually formed its own characteristics and advantages based on the characteristics of BIGO's internal business use. Mainly reflected in the following aspects:
Metadata access
A common situation is that the producer and user of the data are not the same group of people. The colleague who manages to report the data to Kafka or Hive, and the data analyst uses the data to calculate. They don't know the specific information of Kafka, only the name of the Hive table to be used.
In order to reduce the trouble for users to use real-time calculations, BigoFlow integrates metadata with Kafka, Hive, ClickHouse and other storage. Users can directly use Hive and ClickHouse tables in their homework without writing DDL. BigoFlow automatically resolves it. Metadata information is automatically converted into DDL statements in Flink, which greatly reduces user development work. This is due to the unified planning of the BIGO computing platform, which many companies with separate offline and real-time systems cannot do.
End-to-end productization solution
BigoFlow is not just a platform for real-time computing. In order to facilitate users' use or migration, BigoFlow will also provide an end-to-end overall solution according to business scenarios.Like the monitoring scenarios introduced above, users have many monitoring services that need to be migrated. In order to minimize the work, BigoFlow provides solutions for monitoring scenarios. Users only need to migrate the SQL that calculates monitoring indicators to Flink SQL. Others include Flink jobs. There is no need to do DDL, data sink to monitoring platform at all, all are automatically implemented by BigoFlow, and the user's original configuration rules are not changed. This allows users to complete the migration with the least amount of work.
In addition, as mentioned earlier, BigoFlow automatically adds alarms to key indicators of user operations, which basically meets the needs of most users, allowing them to concentrate on business logic instead of other things. The user's logs will also be automatically collected in the ES for easy viewing. There is a search query that precipitates some summarized investigation questions in ES, and users can directly click on the query according to the phenomenon.
Powerful Hive capabilities
Since most of the data in BIGO is stored in Hive, real-time operations often need to write the results into Hive, and many scenarios also need to be able to read data from Hive. Therefore, the integration of BigoFlow and Hive has been at the forefront of the industry. Before the community 1.11, we realized the ability to write data to Hive and dynamically update Meta. 1.11 has not yet been officially released. On the basis of 1.11, we self-researched and developed streaming read Hive tables to support EventTime, dynamic filter partitioning, and TXT format compression. These functions are ahead of the open source community.
This is a unified batch flow scenario that we implemented on ABTest through Flink. Under normal circumstances, Flink consumes Kafka's real-time data, and real-time calculation results are stored in Hive. However, operations often encounter business logic adjustments and need to re-track the data for logarithm. Due to the large amount of data, if the data is still consumed from Kafka, it will put a lot of pressure on Kafka and affect online stability. Since the data is also stored in Hive, when we chase data, we choose to read it from Hive, so that with the same code, we can go offline and online, minimizing the impact of data chasing on online.
Automated ETL job generation
Flink currently undertakes most of the ETL scenarios. The logic of ETL jobs is generally simple, but there are many jobs, and the format of the data reported by the user changes frequently, or the fields are increased or decreased.In order to reduce the cost of user development and maintenance of ETL jobs, we developed the function of automatic generation of ETL jobs. Users only need to provide the Topic and format of the reported data to automatically generate ETL jobs and write the results into Hive. After the reported data format or field changes, the job can also be updated automatically. Currently supports multiple data formats such as Json and pb.
Outlook
With the rapid development of BIGO business, the BigoFlow real-time computing platform is constantly growing and improving, but there are still many areas that need to be improved and improved. In the future, we will focus on two aspects of platform improvement and business support:
Perfect platform:Focus on improving the productization level of the platform.It mainly includes several aspects:Develop functions such as automatic resource configuration and automatic tuning, which can automatically configure the resources required by the job according to the real-time data volume of the job, automatically expand when the traffic peaks, and automatically shrink when the traffic is low;Support table blood relationship display, which is convenient for users to analyze the dependencies between tasks;It supports multiple clusters in different places. Flink supports many key businesses and requires extremely high SLA guarantee. We will ensure the reliability of key businesses through multiple computer rooms in different places.Explore scenarios such as the unification of the batch and the data lake.
Support more business scenarios:Open up more machine learning, real-time data warehouse scenarios, and further promote the use of Flink SQL.
About the author team:
The BIGO big data team focuses on achieving rapid iteration on PB-level data, and empowering upper-level businesses with big data analysis technology.Specifically responsible for the construction of EB-level distributed file storage, daily average trillion message queues, and 50PB of big data computing for all the company’s businesses, including batch, stream, MPP and other computing architectures, covering data definition, channels, storage and Full-link technology stacks such as computing, data warehouse and BI. The team has a strong technical atmosphere, there are many open source software developers, and we look forward to excellent talents to join us!
The highly anticipated second Apache Flink Geek Challenge is here! This competition is fully upgraded, with professional guidance from heavyweight guests, powerful resource allocation for your creativity, and 30w generous bonuses waiting for you to take away ~ Focus on the application of Flink and AI technology to challenge world-class problems in epidemic prevention and control. are you ready?
(Click on the picture to learn more about the competition)
Click "readoriginal"To sign up
ELK Real-Time Diary Platform Construction and Use What is ELK ELK is Elasticsearch, Logstash, Kibana Three Open Source Frame The first letters capitalization is abbreviated. (But later Filebeat (one o...