CN113312376B

CN113312376B - Method and terminal for real-time processing and analysis of Nginx logs

Info

Publication number: CN113312376B
Application number: CN202110559722.0A
Authority: CN
Inventors: 刘德建; 王张浩; 陈宏�
Original assignee: Fujian Tianquan Educational Technology Ltd
Current assignee: Fujian Tianquan Educational Technology Ltd
Priority date: 2021-05-21
Filing date: 2021-05-21
Publication date: 2022-10-21
Anticipated expiration: 2041-05-21
Also published as: CN113312376A

Abstract

The invention discloses a method and a terminal for real-time processing and analyzing a Nginx log, which are used for collecting and storing the Nginx log in real time; aggregating and compressing the original text data meeting preset parameters in the Nginx log to obtain and store aggregated query data to a query database; and receiving an analysis query request, and acquiring and returning a query result corresponding to the analysis query request from the query database. According to the method and the device, the original text data meeting the preset parameters in the Nginx log are aggregated and compressed, and the aggregated query data are obtained and stored in the query database, so that the data to be analyzed and queried subsequently is aggregated query data.

Description

Method and terminal for real-time processing and analysis of Nginx logs

Technical Field

The invention relates to the technical field of data processing, in particular to a method and a terminal for real-time processing and analysis of Nginx logs.

Background

The Web software Nginx is currently widely used as HTTP (Hypertext Transfer Protocol) and reverse proxy servers for enterprise-level Web services because of its advantages of light weight and high performance. As a uniform HTTP request entry of all network services of an enterprise, nginx logs generated during the operation of the uniform HTTP request entry record the detailed information of all HTTP requests. By analyzing the log, the behavior characteristics of external users and the running condition of internal services can be analyzed, and the analysis result can generate important value for product operation and service maintenance of enterprises.

Currently, the industry often uses an open-source ELK stack (ELK for short) to collect, process and analyze network service logs such as Nginx. ELK is an acronym for three different log processing tools (Elasticsearch, logstack, kibana), which are separated by labor, and are:

the Elasticsearch is a distributed search engine and is responsible for storing and retrieving logs.

The logstack is mainly responsible for collecting and filtering data.

Kibana provides a friendly Web interface responsible for visual analysis and summary of logs.

The three tools supplement each other and are often used together as an integral solution for uniformly collecting, managing and analyzing the weblogs.

The existing method for acquiring and analyzing logs in real time by adopting an ELK tool has the following defects: real-time queries for large-scale data are slow to respond. The Elasticsearch is used as a full-text search engine for analyzing and processing the log, and stores all collected log data. When the size of the Nginx log is very large, taking a company where the applicant is located as an example, there are billions of the Nginx logs in one day, the data size is about 1TB, and it is very time-consuming to search from the full amount of data of billions of one day.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: a method and a terminal for real-time processing and analysis of Nginx logs are provided to realize fast query.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a method for real-time processing analysis of a Nginx log, comprising the steps of:

s1, acquiring and storing Nginx logs in real time;

s2, aggregating and compressing the original text data meeting preset parameters in the Nginx log to obtain and store aggregated query data to a query database;

and S3, receiving an analysis query request, and acquiring and returning a query result corresponding to the analysis query request from the query database.

In order to solve the technical problem, the invention adopts another technical scheme as follows:

a terminal for real-time processing and analysis of Nginx logs, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above-mentioned method for real-time processing and analysis of the Nginx logs when executing the computer program.

The invention has the beneficial effects that: a method and a terminal for real-time processing and analyzing Nginx logs are used for aggregating and compressing original text data which meet preset parameters in the Nginx logs to obtain and store aggregated query data to a query database, so that the data which are subsequently analyzed and queried are aggregated query data.

Drawings

FIG. 1 is a schematic flow chart diagram of a method for real-time Nginx log processing analysis in accordance with an embodiment of the present invention;

FIG. 2 is a schematic flow chart illustrating a method for real-time Nginx log processing and analysis and a corresponding tool used in the method according to an embodiment of the present invention;

fig. 3 and fig. 4 are schematic diagrams illustrating results of a method for real-time processing and analyzing a Nginx log in an actual application process according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a terminal for real-time processing and analyzing the Nginx log according to an embodiment of the present invention;

fig. 6 is a schematic block diagram of a terminal for real-time processing and analyzing the Nginx log according to an embodiment of the present invention.

Description of reference numerals:

1. a terminal for real-time processing and analysis of Nginx logs; 2. a processor; 3. a memory.

Detailed Description

In order to explain technical contents, achieved objects, and effects of the present invention in detail, the following description is made with reference to the accompanying drawings in combination with the embodiments.

Referring to fig. 1 to 4, a method for real-time processing and analyzing a Nginx log includes the steps of:

s1, acquiring and storing Nginx logs in real time;

As can be seen from the above description, the beneficial effects of the present invention are: and aggregating and compressing the original text data meeting the preset parameters in the Nginx log to obtain and store the aggregated query data in a query database, so that the data to be subsequently analyzed and queried is the aggregated query data.

Further, the step S1 specifically includes the following steps:

the method comprises the steps that Nginx logs on each server are collected in a unified and real-time mode through a log collection tool and stored in a service cluster according to a preset format, and original text data of each Nginx log record information of one HTTP request from the outside.

As can be seen from the above description, the information of each external HTTP request is collected and stored in the form of original text data through log collection data, so as to facilitate subsequent statistics.

Further, the preset parameters comprise dimension variables and index variables, the dimension variables comprise service dimensions, time dimensions, interface dimensions, product dimensions and user dimensions, and the index variables comprise total request numbers, error request numbers and slow response request numbers;

the step S2 specifically includes the following steps:

aggregating and compressing the original text data meeting the same preset parameter in the Nginx log into one piece, counting the number of the corresponding original text data, and obtaining and storing aggregated query data to a query database.

Further, the step S2 specifically includes the following steps:

acquiring domain name information, an original timestamp, an interface field, a product ID, an encrypted Token character string, a state code and response time in original text data of the Nginx log, converting the domain name information into a service name through a mapping relation, converting the original timestamp into a time level of supported query analysis to obtain time information, replacing the interface field with interface information in a RestFul format, querying the product name corresponding to the product ID from an association mapping table in MySQL, decrypting the Token character string through an account center interface to obtain a user ID, and performing corresponding statistics on each index variable according to the state code and the response time to obtain intermediate process data comprising the service name, the time information, the interface information, the product name, the user ID and a statistical result of each index variable;

grouping the intermediate process data according to the dimension variables and performing aggregation statistics on the index variables according to the groups to obtain query data;

and storing the query data to a query database.

From the above description, it can be seen that, through the above processing manner, compared with the original ELK query tool, the required dimension variables can be obtained through conversion processing of the original text data, and grouping is performed based on a plurality of dimension variables, so that the data volume is greatly reduced, that is, the invention can quickly realize query of multiple dimensions, so that time and labor cost are saved, and based on the corresponding relationship between the index variables and the dimension variables, when a service transmission error or response is too slow, which dimension variables cause the error can be quickly determined, so that fault processing time is shortened, and economic loss of an enterprise is reduced.

Further, the step S2 of replacing the interface field with the RestFul format interface information specifically includes the following steps:

judging whether the interface field is registered in an interface management tool, if so, acquiring interface registration information in a RestFul format from the interface management tool, replacing the interface registration information with a regular expression, matching the interface field through the regular expression, and if matching is successful, using the interface registration information as interface information;

and if the character string is not registered in the interface management tool, replacing the character string in the interface path with a corresponding placeholder according to a preset conversion rule.

As can be seen from the above description, when the interface information is already registered in the interface management tool, a regular expression matching may be used, so that after the matching is successful, the interface registration information is used to replace the original interface field, thereby obtaining the interface information in the RestFul format; for the interface which is not registered, the character string is replaced by the corresponding placeholder by adopting a preset conversion rule, and the corresponding parameters are replaced by the same placeholder, so that aggregation statistics can be carried out on each interface under a certain service name.

Further, the step S2 of replacing the character string in the interface path with the corresponding placeholder according to a preset conversion rule specifically includes the following steps:

and respectively replacing the string in the interface path, which accords with a full digit, a UUID format, a TOKEN string with more than 64 bits and a file name suffix at the tail, with corresponding placeholders.

From the above description, it can be known that parameters affecting the same interface are unified to realize aggregation statistics based on the interface dimensions.

Further, the step S2 specifically includes the following steps:

aggregating and compressing the original text data which meet preset parameters in the Nginx log to obtain aggregated query data;

storing the aggregated query data in a Hive database, wherein a partition field corresponding to each partition in each Hive table in the Hive database comprises a date and an hour.

As can be seen from the above description, the two partition fields of date and hour are used, so that when a user only queries the access details of a service in a past hour, because the data has hour as a partition, the user only needs to read the data in the partition of the current hour for calculation, and the query efficiency is higher compared with the case that the user only uses date as a partition, reads all the data of the current day, and filters out the data of the current hour for the same calculation.

Further, the step S2 of storing the aggregated query data in a Hive database specifically includes the following steps:

and reading all original data stored in the previous hour partition and merging the original data into a new data file through a merging program every time a new hour is entered, and writing the new data file into the previous hour partition in an overlaying mode.

From the above description, it can be known that integrating all small files within one hour into one large file partitioned by one hour avoids the problem that too many small files reduce the query efficiency.

Further, the step S1 specifically includes the following steps:

acquiring and storing Nginx logs into a Kafka cluster in real time by using a FileBeat log acquisition tool;

the step S2 specifically includes the steps of:

aggregating and compressing the textual data which is stored in the Kafka cluster and meets preset parameters in the Nginx logs by using Spark Streaming to obtain and store aggregated query data to a Hive database;

the step S3 specifically includes the following steps:

and receiving an analysis query request, and acquiring and returning a query result corresponding to the analysis query request from the Hive database based on a Presto data query engine.

From the above description, presto is used as a data query engine, and all the calculation processes are performed in a memory, so that the query efficiency is far better than that of the Hive database, and the query speed is improved.

Referring to fig. 5 and 6, a terminal for real-time processing and analyzing of a Nginx log includes a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor executes the computer program to implement the method for real-time processing and analyzing of the Nginx log.

From the above description, the beneficial effects of the present invention are: and aggregating and compressing the original text data meeting preset parameters in the Nginx log to obtain and store the aggregated query data in a query database, so that the data to be analyzed and queried subsequently is aggregated query data.

Referring to fig. 1 to 4, a first embodiment of the present invention is:

s1, acquiring and storing Nginx logs in real time;

the step S1 specifically includes the following steps:

In this embodiment, fileBeat is used as a log collection tool.

in the daily operation and maintenance analysis process, developers of web services generally pay attention to only a few special operation and maintenance indexes of current services, such as the total number of requests, the 5xx number of requests (HTTP requests with state codes beginning with 5 generally indicate internal errors of a server), the 4xx number of requests (HTTP requests with state codes beginning with 4 generally indicate errors caused by clients), and slow response requests (HTTP requests with response time longer than 1 second can be regarded as slow requests). Wherein the 5xx request number and the 4xx request number can be classified as error requests.

Besides the statistics from the global perspective, sometimes developers need to count the operation and maintenance indexes from different dimensions such as interfaces, products, users and the like. The situation that the conventional ELK tool cannot do is that, for example, a developer wants to check the access amount of each interface currently served, because the interface information in the original request is recorded in the ELK, the path contains different parameter information during the process and at the end of the path, and aggregation statistics cannot be performed on the interfaces of the same type.

Based on the scenes, dimension variables and index variables which are commonly used in the daily operation and maintenance statistical analysis of the network service are sorted and summarized, namely the preset parameters comprise the dimension variables and the index variables, the dimension variables comprise service dimensions, time dimensions, interface dimensions, product dimensions and user dimensions, and the index variables comprise total request numbers, error request numbers and slow response request numbers.

In the embodiment, spark Streaming is used to aggregate and compress the original text data meeting preset parameters in the Nginx log stored in the Kafka cluster, so as to obtain and store aggregated query data in the Hive database; the Spark Streaming is a set of framework, is an extension of a Spark core API, can realize high throughput, and has a fault-tolerant mechanism for real-time stream data processing. kafka is a distributed information streaming tool.

Wherein, step S2 specifically includes the following steps:

s21, obtaining domain name information, an original timestamp, an interface field, a product ID (Identity), an encrypted Token character string, a state code and response time in the original text data of the Nginx log;

wherein Token is a temporary Token in computer identity authentication.

S22, converting the domain name information into a service name through a mapping relation;

the log text only stores the requested domain name information, and there may be multiple domain names under one service name, so we need to convert the domain name information into corresponding service information. In the conventional ELK tool, since the log original text is analyzed by the preset configuration file, the domain name to service mapping cannot be realized unless the log original text redundantly stores the service information.

S23, converting the original timestamp into the time level of the supported query analysis to obtain time information;

since Spark Streaming is to segment data at intervals of arrival time of the data and then process the data in batches, it is assumed that this embodiment supports a query whose time level is a support minute level, and thus a batch of data is processed every 1 minute here, strictly speaking, it is a quasi-real-time process here.

Since the final query analysis is based on minute granularity, the request occurrence time in the log is converted into a timestamp in seconds, then divided by 60 and rounded down as the occurrence time of the request, so that the requests of the same dimension combination in the same minute can be subjected to aggregate statistics in the following period, and the aim of reducing the data amount finally retained in the result table is fulfilled.

Since we provide a query for statistics on the order of minutes. So here, if the other dimension values are the same, multiple requests in the same minute will be aggregated into one query. For example, in the case that the variables (service, interface, product, user) of other dimensions are the same, 2021-04-27 10. Thus 2 records under the original record will be recorded as 1 record.

In addition, since the individual logs may delay arriving, there may be three other request records within this minute for the next batch of data processing, and the same reason is also combined into one record. And finally, aggregating according to 'time (minute) + service + interface + product + user' during Presto query, and adding the access amount to obtain that the access amount of the service + interface + product + user in the minute is 5. See the examples that follow for details.

S24, replacing the interface field with interface information in a RestFul format;

since the interface field in the log source contains parameter information, the parameter information may appear in the middle of the interface path, and may also be represented by? "appears as a separator after the interface path. Therefore, the same interface cannot realize aggregate statistics due to different parameters, and currently, the ELK cannot count daily access data of each interface under a certain service name.

Therefore, in this embodiment, the step S24 of replacing the interface field with the interface information in the RestFul format specifically includes the following steps:

s241, judging whether the interface field is registered in the interface management tool, if so, acquiring interface registration information in a RestFul format from the interface management tool, replacing the interface registration information with a regular expression, matching the interface field through the regular expression, and if matching is successful, using the interface registration information as the interface information;

the restul is a design style and development mode of a network application, and may be defined in an XML format or a JSON format based on HTTP.

Among other examples of replacing the interface field with RestFul format are as follows:

GET/v0.1/users/123456time＝1618392005＝》GET/{version}/id/{user_id}。

in this embodiment, swagger, as an open-source interface management tool, registers interface information of a service to be monitored on Swagger, so as to obtain the interface registration information of the service from Swagger, and replace the content in the { } placeholder with a regular expression, as follows:

GET/{version}/id/{user_id}＝》GET/[^/]+？/id/[^/]+？。

then match the regular expression with the interface fields in the textual data of the Nginx log, i.e. leave paths out "? "later parameters, if matching is successful, replace the current interface information with RestFul format interface in Swagger.

S242, if the address is not registered in the interface management tool, replacing the string of characters in the interface path with corresponding placeholders, the string of characters conforming to the full number, the UUID (universal Unique Identifier) format, the TOKEN string with more than 64 bits, and the suffix with the filename at the end.

If there are services that do not register an interface in Swagger, the following strings and corresponding placeholders are included:

(1) Full-digital: /id/123456 is replaced with/id/{ NUM }.

(2) The UUID format/id/4 ff99fbd-1a56-4a95-8682-dab52782e823 is replaced by/id/{ UUID }.

(3) The TOKEN character STRING of more than 64 bits is replaced by/TOKEN/{ TOKEN-STRING }, wherein the TOKEN character STRING of more than 64 bits is/TOKEN/7F 938B205F876FC3398A4D17A79F93C72BDBB410117BE7AEEE04B2A4988DE1B48151A0D1DEEF882D04F855650A6FB6B 1.

(4) And the end has FILENAME suffix,/file/tmp.txt is replaced by/file/{ FILENAME }.

(5) And 15 or more full upper case or full lower case character strings,/app/iejxidheuygeojdlf is replaced by/app/{ ID }.

S25, inquiring a product name corresponding to the product ID from an association mapping table in MySQL, and decrypting the Token character string through an account center interface to obtain the user ID;

wherein, all the requests sent to the server by the user through the product will carry the product ID in the header information of the request, and these information will be recorded in the Nginx log original text. The product ID of the log original text is extracted, and the product ID and the product name are in one-to-one correspondence, so that the product ID and the product name can be obtained by inquiring an association mapping table in MySQL (relational database management system), and similarly, the service ID and the service name in the table are also in one-to-one correspondence. And obtaining the product name corresponding to the product ID, storing the product name in a result table, and displaying the details of the current service accessed by each product according to the product name.

In the log text, information such as the ID and the password of the user is not recorded in plain text for the protection of the personal information of the user and the consideration of network security. The user is encrypted locally at the client by a key provided by the server to obtain a Token character string for the server to perform identity authentication, authority authentication and other operations. The Token string is updated after the Token string expires, so that one user may correspond to a plurality of Token strings. Therefore, the user dimension information cannot be counted in the conventional ELK tool.

According to the step S25, the user ID can be obtained, so that the follow-up statistical analysis of the user is supported.

S26, correspondingly counting each index variable according to the state code and the response time to obtain intermediate process data comprising a service name, time information, interface information, a product name, a user ID and a counting result of each index variable;

after the dimension information in the Nginx log is specially processed, the Spark Streaming aggregates the data of the current batch by the dimension combination of service + time + interface + product + user by combining the information of request status code, request response time and the like existing in the Nginx, and calculates the indexes of total request number, 5xx request number, 4xx request number, slow request number more than 1 second, slow request number more than 3 seconds, slow request number more than 5 seconds, slow request number more than 10 seconds and the like. And writing the aggregated query data into a table of the Hive database.

Through the round of aggregation compression, about 10 hundred million Nginx log original text data per day is changed into more than 1100 ten thousand aggregated logs, the log scale is changed into about one percent of the original scale, and the guarantee is provided for subsequent quick instant query.

S27, grouping the intermediate process data according to the dimension variables and carrying out aggregation statistics on the index variables according to the groups to obtain query data;

and S28, storing the query data into a query database.

The aggregated data can be written into a Hive database, and the table structure information of a result table corresponding to the query data is as the following table 1:

table 1: table structure information

The current dimension refers to a dimension combination of all dimension variables in the table, in this example, "service + time + interface + product + user", and in query data obtained by processing the same batch of data, the dimension combination is unique. See the above data for details.

And S3, receiving the analysis query request, and acquiring and returning a query result corresponding to the analysis query request from the query database.

Because the Hive query is calculated based on MapReduce in Hadoop, the intermediate data can be cached in a disk by the calculation framework, and because the disk is slow in reading and writing speed, the Hive query cannot meet the requirement of instant query. Therefore, presto is adopted as a data query engine of the terminal, the biggest difference between Presto and Hive query is that all calculation processes are carried out in a memory, and the query efficiency is far better than Hive.

In addition, presto supports JDBC protocol, HIVE query can be connected through JDBC, the query syntax is different from that of HIVE, but the query syntax is SQL-like statements, so that data query can be performed by using Presto based on a HIVE database.

The following is that the effect of querying the current service access details through different dimensions after the front-end interface is accessed to Presto query is as shown in fig. 5, which shows that the access amount details of different interfaces within a certain service time are viewed. The developer can quickly and conveniently know indexes such as the access amount, the 5xx error number, the slow request number and the request response delay of each interface under the self service, the online quality problem of the service can be conveniently and quickly found, the fault processing time is shortened, and the economic loss of an enterprise is reduced. The user can also display the request details from the product or user perspective by switching the dimensions in the dimension drop-down box, and the free and flexible query cannot be made through an ELK tool, so that development and operation and maintenance personnel can conveniently know the running state of the current service, know the access details in different dimensions, and save time and labor cost. Meanwhile, because the data is subjected to aggregation compression, by means of the Presto query engine, the query can realize second-level response, and a query result can be returned in one second basically. Much faster than on ELK based on queries that often times tens of seconds or even tens of seconds throughout the log.

Not only can the query be performed in a single dimension, but also multiple dimensions can be combined to realize multi-level query, similar to OLAP (online analytical processing) in a relational database, with reference to fig. 6.

The distribution details of the source products under a certain interface of the current service are displayed, and a developer can clearly know which product accesses the current interface most frequently. Similarly, it is also possible to switch to the user dimension to see which users access the current interface most frequently.

In addition, the terminal dimension and index are expandable. Because the Hive table can support new columns, if new statistical dimensions and indexes need to be added, only corresponding dimensions and indexes need to be added in the Hive table, and meanwhile, calculation processing codes of new dimensions or new indexes are added in the Spark Streaming calculation program. Meanwhile, the Nginx log result table stored in the data warehouse can also be subjected to multi-table association query with other external data in the warehouse.

Therefore, for the prior art, if a plurality of users are inquiring at the same time, the memory resources of the Elasticsearch background are consumed, and the service is not available. In this embodiment, the query pressure of the original ELK cluster can be shared, and if the ELK cluster is only used for querying the log detail information, excessive computing resources do not need to be configured, thereby reducing the hardware cost of an enterprise.

From the above, in order to facilitate understanding of the present invention, the steps of this embodiment are specifically illustrated as follows:

assume that the textual data in the Nginx log is as in table 2 below (for example only the key fields associated with this embodiment):

TABLE 2 textual data of the current batch

Converting the original text data according to the steps from S21 to S27 to obtain intermediate process data, and only listing part of index variables to obtain the following table 3:

TABLE 3 intermediate Process data

The above requests are all within a minute of 10.

Wherein the time is a 10-bit time stamp.

Grouping the converted intermediate process data according to dimension variables (Group BY, each dimension needs to participate), performing aggregation statistics (accumulation or maximum and minimum values) on the index variables according to the grouping to obtain query data, and writing the query data into an HIVE table, wherein the query data is as shown in the following table 4:

TABLE 4 intermediate Process data

Through the steps, 5 pieces of data in the current batch are compressed and aggregated into 3 pieces of data in the HIVE result table. The net result is that around 10 hundred million logs of native text data per day becomes 1100 more than ten thousand pieces aggregated. These aggregated query data are sufficient to service multi-dimensional queries on the order of minutes in the operation and maintenance statistical analysis on a daily basis. There is no need to query from log textual data of the ELK.

The data can obtain the average response time of serving an interface in a certain period of time, and can obtain sum of response time/sum of request number of the interface in the period of time, and if the maximum response time of the interface is required, only max of the interface is required.

Referring to fig. 1 to 4, a second embodiment of the present invention is:

on the basis of the first embodiment, the step S2 specifically includes the following steps:

aggregating and compressing the original text data meeting preset parameters in the Nginx log to obtain aggregated query data;

and storing the aggregated query data in a Hive database, wherein the partition field corresponding to each partition in each Hive table in the Hive database comprises a date and an hour.

The step S2 of storing the aggregated query data in the Hive database specifically includes the following steps:

and when a new hour enters, reading all original data stored in the last hour partition by a merging program, merging the original data into a new data file, and writing the new data file into the last hour partition in an overlaying manner.

That is, in this embodiment, two aims are adopted in the Hive database storage to optimize the query efficiency.

(1) Hive table partition setting

Two partition fields of date and hour are used here in order to improve query efficiency. Hive is taken as a member of a Hadoop ecosystem, and the bottom storage of Hive is a distributed file storage system depending on HDFS. Each Hive table corresponds to a directory in the HDFS, and partitions in the Hive table correspond to subdirectories under the directory. Partition variables are used as filtering conditions in the Hive-like SQL query statement, so that query can be limited to be read in required partition data and calculated, reading of full-table data is avoided, and query efficiency is improved. In the terminal, when a user only inquires about the access details of a certain service in a certain past hour, because the data takes the hour as a partition, the user only needs to read the data in the current hour partition for calculation, and the efficiency is higher compared with the method that only takes the date as a partition, reads all the data of the day, and screens out the data of the current hour from the data for the same calculation.

(2) Hive doclet merging

However, the finer the granularity of the partition and the better the non-partitioning. The minute is not taken as the partition, because the fine partition can cause the Hive table to generate too many small files for storing data in the underlying HDFS. If the partitions are minutes, and each partition has at least one file, then there will be at least 24 x 60=1440 small files a day. Too many small files may make Hive query less efficient. One reason is that too much metadata information consumes the memory of the master node during query, and the other reason is that a JVM process is started for each small file during Hive query to perform calculation, so that initialization, starting and execution of such many tasks consume a large amount of calculation resources. Therefore, we take the smallest partition granularity in hours as the result.

Since the Spark Streaming writes data into Hive every minute, and a batch of data forms a small file in the HDFS, in order to control the number of small files, we start another lightweight Spark Streaming program to merge the small files at regular time. The specific strategy is as follows: when entering a new hour, the lightweight Spark Streaming program will read out the data of the last hour partition just past and merge into a large file, and then write back the large file to the hour partition and overwrite the original data.

Referring to fig. 5 and 6, a third embodiment of the present invention is:

a terminal 1 for real-time processing and analysis of Nginx logs, comprising a memory 3, a processor 2 and a computer program stored on the memory 3 and executable on the processor 2, wherein the steps of the first or second embodiment are implemented when the computer program is executed by the processor 2.

As shown in fig. 6, if a terminal 1 for real-time processing and analyzing a Nginx log is represented as a functional module connection, the terminal is composed of a data acquisition module, a data real-time processing module, a data storage module and a data query module which are connected in sequence.

In summary, the method and the terminal for real-time processing and analyzing the Nginx logs provided by the present invention group the Nginx logs based on a plurality of dimensional variables, thereby greatly reducing the data volume of subsequent queries; acquiring and storing aggregated query data into a query database, adopting two partition fields of date and hour, and integrating all small files in one hour into a large file partitioned according to the hour; and the Presto is used as a data query engine to greatly improve the query speed, thereby realizing quick query. Meanwhile, based on a plurality of dimensional variables, the invention can quickly realize inquiry of multiple dimensions, thereby saving time and labor cost, and based on the corresponding relation between the index variable and the dimensional variable, when a service is sent wrongly or the response is too slow, the invention can quickly determine which dimensional variables cause the fault, thereby shortening the fault processing time and reducing the economic loss of enterprises.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent changes made by using the contents of the present specification and the drawings, or applied directly or indirectly to the related technical fields, are included in the scope of the present invention.

Claims

1. A method for real-time processing analysis of a Nginx log, comprising the steps of:

s1, acquiring and storing Nginx logs in real time;

s3, receiving an analysis query request, and acquiring and returning a query result corresponding to the analysis query request from the query database;

the preset parameters comprise dimension variables and index variables, the dimension variables comprise service dimensions, time dimensions, interface dimensions, product dimensions and user dimensions, and the index variables comprise total request numbers, error request numbers and slow response request numbers;

the step S2 specifically includes the following steps:

aggregating and compressing original text data meeting the same preset parameter in the Nginx log into one piece, counting the number of the corresponding original text data, and obtaining and storing aggregated query data to a query database;

grouping the intermediate process data according to the dimension variables and carrying out aggregation statistics on the index variables according to the groups to obtain query data;

and storing the query data to a query database.

2. The method for real-time processing and analyzing of the Nginx log according to claim 1, wherein the step S1 specifically comprises the following steps:

3. The method according to claim 1, wherein the step S2 of replacing the interface field with the restfull format interface information specifically comprises the following steps:

and if the character string is not registered in the interface management tool, replacing the character string in the interface path with the corresponding placeholder according to a preset conversion rule.

4. The method according to claim 3, wherein the step S2 of replacing the character strings in the interface paths with corresponding placeholders according to the preset conversion rules specifically comprises the following steps:

and respectively replacing the TOKEN character string which accords with the full digit, the UUID format and more than 64 bits in the character string in the interface path and the suffix of which the tail is provided with the file name into corresponding placeholders.

5. The method for real-time processing and analyzing of the Nginx log according to claim 1, wherein the step S2 specifically comprises the following steps:

storing the aggregated query data in a Hive database, wherein the partition field corresponding to each partition in each Hive table in the Hive database comprises a date and an hour.

6. The method as claimed in claim 5, wherein the step S2 of storing the aggregated query data in a Hive database specifically includes the following steps:

7. The method for real-time processing and analyzing of a Nginx log according to any one of claims 1 to 6, wherein the step S1 specifically comprises the following steps:

the step S2 specifically includes the following steps:

aggregating and compressing original text data which are stored in the Kafka cluster and meet preset parameters in the Nginx logs by using Spark Streaming to obtain and store aggregated query data to a Hive database;

the step S3 specifically includes the following steps:

8. A terminal for real-time processing analysis of Nginx logs, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements a method for real-time processing analysis of Nginx logs according to any one of claims 1 to 7 when executing the computer program.