Disclosure of Invention
The embodiment of the invention provides a method and a system for realizing asynchronous file generation and downloading, which can solve the problems in the prior art.
In a first aspect of the embodiment of the present invention, a method for implementing asynchronous file generation and download is provided, including:
receiving a file downloading request of a user, extracting a data source code, creating a file downloading task, writing the file downloading task into a database, and generating a global unique task identifier;
Analyzing the file downloading task based on a preset file characteristic identification rule, dynamically dividing the file downloading task into a plurality of subtask fragments according to an analysis result, and distributing independent resource quota and a processing timeout threshold for each subtask fragment;
Distributing the sub-task fragments to a plurality of target processing nodes, generating sub-files by adopting a parallel calling file generation method, and redistributing corresponding sub-file generation tasks to other file processing nodes when detecting that the sub-files are failed to be generated;
Uploading the target download file to a file server cluster, establishing a file state monitoring table, responding to a file list query request of a user, and acquiring and returning the target download file from the file server cluster.
Receiving a file downloading request of a user, extracting a data source code, creating a file downloading task, writing the file downloading task into a database, and generating a global unique task identifier, wherein the method comprises the following steps of:
receiving a file downloading request of a user, extracting a data source type identifier, a data source number, time stamp information and a digital signature from the file downloading request to obtain a data source code and create a file downloading task;
based on the data source code, calculating the priority score of the file downloading task by combining the user grade and the task queue length, writing the file downloading task into a double-layer task processing structure according to the priority score, and starting an asynchronous thread to write the file downloading task with the priority score larger than a priority threshold into a database in batches when the number of the tasks of the double-layer task processing structure reaches the upper limit;
and generating a global unique task identifier of the file downloading task according to the database writing time stamp, the priority score and the data source code, and setting the cache failure time.
Analyzing the file downloading task based on a preset file characteristic identification rule, dynamically dividing the file downloading task into a plurality of subtask fragments according to an analysis result, and distributing independent resource quota and processing timeout threshold for each subtask fragment, wherein the method comprises the following steps:
Extracting file data in the file downloading task, calculating entropy value, data distribution density and time sequence correlation of the file data, and constructing a file feature vector;
Calculating the association degree of the data blocks according to the file feature vectors, constructing a data block dependency relationship graph, calculating the ratio of the input degree value to the output degree value of each data block node, and taking the data block node with the ratio lower than the dependency threshold value as an initial segmentation node;
Calculating the data transmission cost between the initial segmentation nodes, selecting the initial segmentation nodes with the data transmission cost smaller than a cost threshold as optimal segmentation nodes, dynamically segmenting the file downloading task based on the optimal segmentation nodes, and generating a plurality of subtask fragments;
and performing similarity matching on the file feature vector and the historical execution task feature, setting an independent processing timeout threshold value for each subtask segment based on the similarity matching result, and setting an independent resource quota in combination with the data complexity of the subtask segment.
Distributing the plurality of sub-task fragments to a plurality of target processing nodes, generating sub-files by adopting a parallel calling file generation method, and redistributing corresponding sub-file generation tasks to other file processing nodes when detecting that the sub-files are generated in failure, wherein the method comprises the following steps of:
Obtaining the total resource amount, the network bandwidth capacity and the current load state of each processing node, screening out candidate processing nodes meeting the subtask segment resource quota requirement, calculating the task affinity score of each candidate processing node, constructing a candidate processing node priority sequence according to the order of the task affinity score from high to low, selecting a plurality of candidate processing nodes as target processing nodes, and distributing the plurality of subtask segments to the target processing nodes;
Starting a file generation process on a target processing node, initializing a file writing handle and a file check code, writing data of a sub-task segment into a file according to the size of a data block in a segmented mode, calculating the check code of each data block, comparing the check code with the file check code, and writing the data block again when the fact that the check codes of the data blocks are not matched is detected, and updating file metadata information after the generation of the sub-file is completed;
And monitoring the processing progress of the subtask fragments on the target processing nodes, and when the fact that the file metadata information of the subtask fragments is unchanged and the processing time exceeds the processing timeout threshold value of each subtask fragment is detected, selecting the next processing node from the candidate processing node priority sequence again to perform task allocation.
Merging all the subfiles into a target download file according to a preset merging strategy, detecting the file integrity of the target download file, and marking the globally unique task identifier as completed, wherein the method comprises the following steps:
acquiring file metadata information of all subfiles, and establishing a subfile merging sequence according to the numbering sequence of the file metadata information;
reading data blocks of the subfiles according to the subfiles merging sequence, writing the data blocks into a file merging buffer area for data merging, controlling the data merging rate through a token bucket algorithm, dynamically adjusting the capacity of the token bucket according to the writing bandwidth of a system disk, suspending the data merging operation when the number of tokens in the token bucket is insufficient, and reducing the reading rate balance data processing pipelines of the upstream subfiles;
Real-time checking is carried out on the combined data based on a sliding window mechanism, cyclic redundancy check codes of the data blocks are calculated in the sliding window, the calculated cyclic redundancy check codes are compared with original check codes of the subfiles, and when mismatching of the check codes is detected, the sliding window is retracted and data combination is carried out again;
And reading the data blocks which pass verification from the merging buffer area, writing the data blocks into a target download file, carrying out integrity check on the target download file to obtain the target download file which passes verification, and marking the globally unique task identifier as a finished state.
Uploading the target download file to a file server cluster, establishing a file state monitoring table, responding to a file list query request of a user, acquiring and returning the target download file from the file server cluster, and comprising the following steps:
uploading the target download file to a file server cluster, acquiring storage position information of the target download file in the file server cluster, and generating a corresponding file address;
establishing a file state monitoring table, recording the file state, the file address and the file access times of the target download file, marking the target download file as a hot spot file when the file access times exceed an access threshold value, and triggering a file preloading mechanism;
And receiving a file list query request sent by a user, obtaining a file state list based on the file state monitoring table, preferentially displaying the hot files, acquiring a target download file from the server cluster, and returning the target download file to the user.
In a second aspect of the embodiment of the present invention, an asynchronous file generation and download implementation system is provided, including:
the first unit is used for receiving a file downloading request of a user, extracting a data source code, creating a file downloading task, writing the file downloading task into a database and generating a global unique task identifier;
The second unit is used for analyzing the file downloading task based on a preset file characteristic identification rule, dynamically dividing the file downloading task into a plurality of subtask fragments according to an analysis result, and distributing independent resource quota and a processing overtime threshold value for each subtask fragment;
The third unit is used for distributing the plurality of sub-task fragments to a plurality of target processing nodes, generating sub-files by adopting a parallel calling file generation method, and redistributing corresponding sub-file generation tasks to other file processing nodes when detecting that the sub-files are failed to be generated;
And the fourth unit is used for uploading the target download file to a file server cluster, establishing a file state monitoring table, responding to a file list query request of a user, and acquiring and returning the target download file from the file server cluster.
In a third aspect of an embodiment of the present invention,
There is provided an electronic device including:
A processor;
a memory for storing processor-executable instructions;
Wherein the processor is configured to invoke the instructions stored in the memory to perform the method described previously.
In a fourth aspect of an embodiment of the present invention,
There is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method as described above.
The beneficial effects of the application are as follows:
By implementing the method for implementing asynchronous file generation and downloading, the file downloading task can be dynamically divided into a plurality of subtask fragments for parallel processing, the processing efficiency of large file generation and downloading is effectively improved, the waiting time of a user is shortened, and the response speed of a system is improved.
The invention adopts an independent resource quota and timeout threshold control mechanism and combines a subtask failure automatic redistribution strategy, thereby remarkably enhancing the stability and fault tolerance of the system under a high concurrency scene, effectively avoiding the problem of integral download failure caused by single-point failure and ensuring the reliability of the file generation process.
In addition, the invention realizes the functions that users can inquire the file processing progress and acquire results at any time through the file server cluster storage and the file state monitoring table management, obviously improves the user experience, simultaneously is convenient for a system administrator to monitor and manage the file processing process, and improves the maintainability of the system.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The technical scheme of the invention is described in detail below by specific examples. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.
Fig. 1 is a flow chart of an implementation method for asynchronous file generation and downloading according to an embodiment of the present invention, as shown in fig. 1, the method includes:
receiving a file downloading request of a user, extracting a data source code, creating a file downloading task, writing the file downloading task into a database, and generating a global unique task identifier;
Analyzing the file downloading task based on a preset file characteristic identification rule, dynamically dividing the file downloading task into a plurality of subtask fragments according to an analysis result, and distributing independent resource quota and a processing timeout threshold for each subtask fragment;
Distributing the sub-task fragments to a plurality of target processing nodes, generating sub-files by adopting a parallel calling file generation method, and redistributing corresponding sub-file generation tasks to other file processing nodes when detecting that the sub-files are failed to be generated;
Uploading the target download file to a file server cluster, establishing a file state monitoring table, responding to a file list query request of a user, and acquiring and returning the target download file from the file server cluster.
Fig. 2 is a flow chart of a method for processing a file download request. In an alternative embodiment, receiving a file download request from a user, extracting a data source code and creating a file download task, writing the file download task into a database and generating a globally unique task identifier, including:
receiving a file downloading request of a user, extracting a data source type identifier, a data source number, time stamp information and a digital signature from the file downloading request to obtain a data source code and create a file downloading task;
based on the data source code, calculating the priority score of the file downloading task by combining the user grade and the task queue length, writing the file downloading task into a double-layer task processing structure according to the priority score, and starting an asynchronous thread to write the file downloading task with the priority score larger than a priority threshold into a database in batches when the number of the tasks of the double-layer task processing structure reaches the upper limit;
and generating a global unique task identifier of the file downloading task according to the database writing time stamp, the priority score and the data source code, and setting the cache failure time.
In this embodiment, a method of receiving a user file download request, extracting a data source code, and creating a file download task is described in detail. By creating an efficient task processing mechanism, the method ensures reasonable allocation of system resources and the priority of task processing.
In this embodiment, a FILE download request sent by a user needs to be received, where the request is usually transmitted in the form of HTTP or HTTPs protocol through a network, and includes a plurality of key information, a data source type identifier is extracted from a request body, where the identifier is usually in the form of a character string, such as "DB" indicates a database type source, "API" indicates an interface type source, "FILE" indicates a FILE system type source, and at the same time, a data source number is extracted, where the number is usually a unique character string, and a format such as "SRC20240501001" is used to precisely locate a specific data source, timestamp information in the request is recorded in UTC format, such as "2024-05-01t12:30:45z", and used to prevent replay attacks, and the request also includes a digital signature, and an encrypted character string generated by using the sha256+rsa algorithm, such as "a1b2c3d4e5f6.
After the file downloading task is created, the priority score of the task is calculated so as to reasonably allocate system resources, three main factors are considered in priority calculation, namely data source coding, user grade and current task queue length, for the data source coding, basic scores are allocated according to different types, such as 10 points for database type, 8 points for API type and 6 points for file system type, the user grade is generally divided into VIP users, common users and trial users, and the weights are respectively set to 1.5, 1.0 and 0.8. When the length of the task queue is less than 100, the score is not adjusted, when the length is between 100 and 500, the priority is reduced by 10%, when the length is more than 500, the priority is reduced by 20%, and the final priority score is calculated by combining the factors. For example, if the VIP user requests to download a database type source, and the current queue length is 80, the priority score is 10×1.5×1.0=15 points.
After the priority score is calculated, the file downloading task is written into a double-layer task processing structure, the structure comprises a high-priority task queue and a low-priority task queue in a memory, when the priority score exceeds 12 times, the tasks enter the high-priority queue, otherwise, enter the low-priority queue, the inside of each queue is organized by adopting a priority stack structure, the priority processing of the high-priority tasks is ensured, the task number of the double-layer task processing structure is continuously monitored, when the total task number reaches a preset upper limit (usually 1000), asynchronous thread processing is started, the threads screen out tasks with priority scores larger than a priority threshold (usually 10 times), the tasks form batches (each batch is at most 100 tasks), a database batch writing interface is called, the tasks are stored in a lasting mode, and the database writing adopts a transaction mechanism to ensure the atomicity of batch writing, namely all success or all failure.
After the database is written successfully, a globally unique Task identifier (Task ID) needs to be generated for each Task, three elements including a database writing time stamp, a Task priority score and a data source code are comprehensively considered in the generation process, in specific implementation, the time stamp accurate to milliseconds needs to be acquired, such as 1588763142358, the priority score is converted into a two-bit fixed-length character string, such as priority 15.5 is converted into 16, the last 6-bit character (such as 01001 is extracted from DB_SRC 20240501001) is extracted from the data source code, the three parts are connected by using a specific separator, and a 4-bit random alphanumeric character string is added as a suffix to form a final Task identifier, for example, the combination mode guarantees the global uniqueness of the identifier, and meanwhile key information of the Task is contained.
After generating task identifiers, writing task information into a cache system to improve the subsequent query efficiency, wherein the cache records contain key information such as task identifiers, data source codes, priority scores, creation time and the like, different cache failure times are set according to task priorities, namely, a high-priority task (score > 15) is set for 30 minutes of a cache period, a medium-priority task (10-15) is set for 20 minutes of a cache period, a low-priority task (10) is set for 10 minutes of a cache period, and after the cache fails, the task information is reloaded from a database to ensure data consistency.
And returning a response containing the global unique task identifier to the user, wherein the user can inquire the task state through the identifier, the response body adopts a JSON format and contains information such as the task identifier, the estimated processing time, the task state and the like, and meanwhile, a detailed operation log containing information such as the user ID, the operation time, the request IP, the task identifier and the like is recorded so as to facilitate subsequent audit and problem investigation.
In an optional implementation manner, the file downloading task is analyzed based on a preset file feature recognition rule, the file downloading task is dynamically divided into a plurality of subtask fragments according to an analysis result, and an independent resource quota and a processing timeout threshold are allocated to each subtask fragment, including:
Extracting file data in the file downloading task, calculating entropy value, data distribution density and time sequence correlation of the file data, and constructing a file feature vector;
Calculating the association degree of the data blocks according to the file feature vectors, constructing a data block dependency relationship graph, calculating the ratio of the input degree value to the output degree value of each data block node, and taking the data block node with the ratio lower than the dependency threshold value as an initial segmentation node;
Calculating the data transmission cost between the initial segmentation nodes, selecting the initial segmentation nodes with the data transmission cost smaller than a cost threshold as optimal segmentation nodes, dynamically segmenting the file downloading task based on the optimal segmentation nodes, and generating a plurality of subtask fragments;
and performing similarity matching on the file feature vector and the historical execution task feature, setting an independent processing timeout threshold value for each subtask segment based on the similarity matching result, and setting an independent resource quota in combination with the data complexity of the subtask segment.
In practice, a file download task is received, which contains basic information of the file, such as URL, file size, etc., a file data sample is extracted from the download task, typically the first 1MB of the file or a uniformly sampled piece of data, for the extracted file data, an entropy value is calculated reflecting the randomness and information density of the data, the entropy value calculation is calculated by counting the frequency of occurrence of each byte in the data, and a high entropy value (e.g., > 7.5) typically indicates that the data is compressed or encrypted. Meanwhile, calculating the data distribution density, namely the distribution condition of the data values in the range, which can be realized by constructing a histogram, dividing the byte value range of 0-255 into 16 intervals, counting the number of data points of each interval, analyzing the time sequence correlation of data, namely the degree of correlation between adjacent data blocks, quantifying by calculating the correlation coefficient of the adjacent data blocks, wherein the correlation coefficient is close to 1 and shows high correlation, and the correlation coefficient is close to 0 and shows almost irrelevance, and constructing a file feature vector comprising entropy values, data distribution density feature vectors and time sequence correlation indexes by the calculation.
Based on the constructed file feature vector, the association degree between each data block in the file is calculated, the file is divided into basic data blocks with the same size, such as 64KB, each block is divided into basic data blocks, based on time sequence correlation indexes in the feature vector, the association degree between the blocks is calculated, for the text file, the association degree of adjacent blocks is higher (such as 0.8), for different content areas in the multimedia file, the association degree is lower (such as 0.3), a data block dependency graph is constructed by using the association degree values, nodes in the graph represent the data blocks, edges represent the association relation between the data blocks, and weights of the edges represent the association strength.
After the dependency graph is constructed, calculating the ratio of the ingress value to the egress value of each data block node, wherein the ingress value represents the edge number pointing to the node, the egress value represents the edge number starting from the node, and the nodes with the ratio lower than the preset dependency threshold (such as 0.5) are marked as initial segmentation nodes, and are usually positions where the data content is significantly changed and are suitable as task segmentation nodes. For example, a 100MB video file has two low ratio nodes at 33MB and 67MB, with ratios of 0.3 and 0.4, respectively, which correspond to scene transitions or content type changes.
Further calculating the data transmission cost between the initial split nodes, i.e. the resource consumption required for transmitting data from one node to another, the transmission cost can be calculated comprehensively based on the data volume, network conditions and data characteristics between the nodes. For example, for two initial segmentation nodes 20MB apart, if the entropy of the intermediate data is high, which indicates that the data complexity is high, the transmission cost reaches a high value such as 85, while for a data region with a low entropy, even if the entropy is 30MB apart, the transmission cost is only 50, and the initial segmentation node with the data transmission cost smaller than the preset cost threshold is selected as the optimal segmentation node.
Based on the determined optimal segmentation node, the file downloading task is dynamically segmented into a plurality of subtask fragments, each subtask fragment corresponds to a continuous data area of the original file, for example, a 100MB file is divided into three subtasks, namely 0-33MB,33-67MB and 67-100MB, and the data internal structure is considered in the segmentation mode, so that the data relevance in each subtask is strong, and the dependence among the subtasks is weak.
And performing similarity matching on the constructed file feature vector and the feature vector of the historical execution task, wherein the similarity calculation can adopt methods such as cosine similarity or Euclidean distance, the higher the value is, the more similar the task characteristics are, for example, the similarity between the current file feature vector and a certain ZIP file in the historical record is 0.92, the two files are provided with similar processing characteristics, and an independent processing timeout threshold is set for each subtask segment based on a similarity matching result. For subtasks that are highly similar to the historical task (e.g., similarity > 0.9), the average processing time of the historical task plus 20% tolerance is directly used as the timeout threshold, for the case of moderate similarity (0.6-0.9), 150% of the historical time can be used as the threshold, and for the case of low similarity (< 0.6), a looser default threshold is set, such as subtask size divided by the lowest desired download rate.
Meanwhile, independent resource quota is set according to the data complexity of the subtask fragments, the data complexity can be measured through entropy values and distribution density in feature vectors, subtasks with high complexity (such as entropy value > 7.0) need more processing resources and allocate 50% more memory and CPU time, subtasks with low complexity (such as entropy value < 5.0) can use standard quota, for example, for a download task comprising three subtasks, the first subtask is allocated with 2 processing threads and 100MB memory, the timeout threshold is 30 seconds, the second subtask is allocated with 4 threads and 200MB memory, the timeout threshold is 60 seconds, the third subtask is allocated with 6 threads and 400MB memory, and the timeout threshold is 120 seconds.
Through the dynamic fragmentation and resource allocation mechanism, the downloading process can be optimized according to the file characteristics, the downloading success rate and efficiency are improved, and meanwhile, the system resources are reasonably utilized.
In an alternative embodiment, the multiple subtask fragments are distributed to multiple target processing nodes, a parallel calling file generation method is adopted to generate a subfile, when the subfile generation failure is detected, the corresponding subfile generation task is redistributed to other file processing nodes, including:
Obtaining the total resource amount, the network bandwidth capacity and the current load state of each processing node, screening out candidate processing nodes meeting the subtask segment resource quota requirement, calculating the task affinity score of each candidate processing node, constructing a candidate processing node priority sequence according to the order of the task affinity score from high to low, selecting a plurality of candidate processing nodes as target processing nodes, and distributing the plurality of subtask segments to the target processing nodes;
Starting a file generation process on a target processing node, initializing a file writing handle and a file check code, writing data of a sub-task segment into a file according to the size of a data block in a segmented mode, calculating the check code of each data block, comparing the check code with the file check code, and writing the data block again when the fact that the check codes of the data blocks are not matched is detected, and updating file metadata information after the generation of the sub-file is completed;
And monitoring the processing progress of the subtask fragments on the target processing nodes, and when the fact that the file metadata information of the subtask fragments is unchanged and the processing time exceeds the processing timeout threshold value of each subtask fragment is detected, selecting the next processing node from the candidate processing node priority sequence again to perform task allocation.
In this embodiment, resource status information of each processing node is obtained, including load indexes such as the number of CPU cores, the memory capacity, the storage space, the network bandwidth capacity, the current CPU utilization, the memory occupancy rate, the network traffic, and the like of each processing node, taking a server cluster of a certain data center as an example, the processing node a has a 16-core CPU, a 64-GB memory, a 10Gbps network bandwidth, the current CPU utilization rate is 30%, the memory occupancy rate is 40%, the network bandwidth utilization rate is 25%, the processing node B has a 32-core CPU, a 128-GB memory, a 40Gbps network bandwidth, the current CPU utilization rate is 60%, the memory occupancy rate is 70%, and the network bandwidth utilization rate is 45%.
And according to the resource requirement of each subtask segment, for example, the subtask segment 1 needs a 4-core CPU, an 8GB memory and a 1Gbps network bandwidth, the subtask segment is matched with the available resources of each processing node, and candidate processing nodes meeting the resource requirement are screened out. For subtask segment 1, processing nodes A and B both meet their resource requirements, becoming candidate processing nodes.
Calculating task affinity scores of candidate processing nodes, wherein the task affinity scores are based on a plurality of factors including resource matching degree of the processing nodes, historical task completion rate, average processing speed and the like, the resource matching degree considers the matching degree of demands of subtasks on resources and the processing nodes which can provide the resources, the historical task completion rate reflects the stability of the processing nodes, and the average processing speed reflects the efficiency of the processing nodes. For processing node A, the resource matching degree is 85% (sufficient resources but not excessive), the historical task completion rate is 98%, the average processing speed is 120MB/s, the comprehensive calculation results in an affinity score of 0.89, for processing node B, the resource matching degree is 65% (excessive resources), the historical task completion rate is 95%, the average processing speed is 180MB/s, and the comprehensive calculation results in an affinity score of 0.81.
And constructing candidate processing node priority sequences according to the task affinity scores, wherein the priority sequences in the example are processing node A and processing node B. And selecting the processing node A from the priority sequence as a target processing node, distributing the subtask fragment 1 to the processing node A for execution, and distributing other subtask fragments by adopting a similar method.
When the target processing node receives the subtask segment, the target processing node starts a file generation process, creates a file writing handle, allocates a file storage space, initializes a file check code, for example, using an MD5 or SHA256 algorithm, takes the subtask segment 1 as an example, and divides the subtask segment into data blocks of 4MB each, and 500 data blocks in total.
And executing writing operation on each data block, after writing the content of the data block 1 into a file, calculating a check code of the data block, for example, using a CRC32 algorithm to obtain a check value 0x1A2B3C4D, comparing the check value with an expected check value, if the check values are matched, continuing to process the next data block, and if the check values are not matched (for example, obtaining 0x1A2B3C 5E), re-writing the data block until the check codes are matched or the maximum retry number is reached.
During the writing process, the file metadata information is updated periodically (e.g., every 10 data blocks written), including file size, number of processed data blocks, last modification time, etc. After writing all the data blocks, calculating the check code of the whole file, updating the final file metadata, and marking the generation state of the subfiles as 'complete'.
The processing progress of all the subtask fragments is continuously monitored, for each executing subtask fragment, file metadata information is checked at regular intervals (for example, 5 seconds), and if the metadata information (for example, last modification time and number of processed data blocks) of the subtask fragment 1 is found to have no change in 3 continuous checks, and the processing time exceeds a preset processing timeout threshold (for example, 1.5 times of expected processing time calculated according to the data quantity, 150 seconds in this example), the subtask fragment processing failure is determined.
When the failure of the subtask segment 1 processing is detected, selecting the next processing node B from the candidate processing node priority sequence, and reassigning the subtask segment 1 to the processing node B for execution. And after receiving the task, the processing node B restarts the file generation flow, initializes the file writing handle and the check code, writes the file according to the data block and checks the file until the generation of the subfiles is successfully completed.
By the technical means, the method and the device realize intelligent distribution of the subtask fragments, reliable writing of file data and automatic redistribution of failed tasks, and improve the parallel efficiency and reliability of file generation.
In an alternative embodiment, merging all the subfiles into a target download file according to a preset merging policy, detecting the file integrity of the target download file, and marking the globally unique task identifier as completed, including:
acquiring file metadata information of all subfiles, and establishing a subfile merging sequence according to the numbering sequence of the file metadata information;
reading data blocks of the subfiles according to the subfiles merging sequence, writing the data blocks into a file merging buffer area for data merging, controlling the data merging rate through a token bucket algorithm, dynamically adjusting the capacity of the token bucket according to the writing bandwidth of a system disk, suspending the data merging operation when the number of tokens in the token bucket is insufficient, and reducing the reading rate balance data processing pipelines of the upstream subfiles;
Real-time checking is carried out on the combined data based on a sliding window mechanism, cyclic redundancy check codes of the data blocks are calculated in the sliding window, the calculated cyclic redundancy check codes are compared with original check codes of the subfiles, and when mismatching of the check codes is detected, the sliding window is retracted and data combination is carried out again;
And reading the data blocks which pass verification from the merging buffer area, writing the data blocks into a target download file, carrying out integrity check on the target download file to obtain the target download file which passes verification, and marking the globally unique task identifier as a finished state.
In a method for implementing efficient file merging and integrity verification, ordered merging and verification of subfiles is implemented through multiple steps, the method retrieves file metadata information of all subfiles from a storage system, including key attributes such as file name, creation time, file size, file block sequence number, and the like, and constructs a subfile merging sequence based on the sequence of numbers in the file metadata information, typically sequence numbers contained in the subfile names, such as "file_part_001", "file_part_002", and the like, for example, for a download task containing 5 subfiles, a merging queue is established according to the sequence from "file_part_001" to "file_part_005", so as to ensure that merging operations are performed according to the correct file block sequence.
The merging process adopts a token bucket algorithm to control the data processing rate, initializes a token bucket, the initial capacity of the token bucket is set to be 50MB, and the token bucket is dynamically adjusted according to the current system disk writing capacity, a certain amount of tokens are added to the token bucket every 100 milliseconds, the adding amount depends on the disk writing bandwidth monitored in real time, for example, when the system disk writing bandwidth is detected to be 120MB/s, about 12MB of tokens are added every 100 milliseconds. In the data merging process, sub-file data blocks are sequentially read according to a sub-file merging sequence, the default block size is 4MB, tokens with corresponding sizes are consumed for each data block to be read, when the number of available tokens in a token bucket is lower than 4MB, data merging operation is suspended, meanwhile, the reading rate of an upstream sub-file is reduced through a back pressure mechanism, so that the whole data processing pipeline is balanced, in a specific implementation, a reading thread can be enabled to enter a short sleep state, usually 50 milliseconds, when the tokens are insufficient, and the memory occupation is prevented from being too high due to too fast reading operation.
The data merging quality control adopts a sliding window mechanism to carry out real-time verification, a sliding window with the default size of 16MB is created, after merging data is written into a merging buffer area, the sliding window moves in the buffer area and calculates CRC32 cyclic redundancy check codes of the data in the current window, for example, when 32MB data is written into the buffer area, the sliding window sequentially calculates check codes of intervals of 0-16MB, 4MB-20MB, 8MB-24MB and the like and compares the check codes with original check codes of subfiles, if the check codes of intervals [12MB-28MB ] are not matched, if the calculated check codes are 0xA1B2C3D4 and the original check codes are 0xA1B2C3D5, the sliding window is retracted to the position of the last successful check, for example, the data blocks of the corresponding subfiles are read again and combined, and in order to improve the verification efficiency, the block calculation mode is adopted, the window of 16MB is divided into 4 sub-blocks of 4MB to calculate the check codes in parallel, and the merging result is obtained.
And after the data passes the verification, reading the data blocks which pass the verification from the merging buffer area and writing the data blocks into the target downloading file, wherein the writing adopts an asynchronous IO mode, an IO request queue is maintained, the default queue length is 8, the data size of each IO request is 8MB, the IO requests are processed in a background thread, the data are written into a disk, and after all the subfiles are merged, the integrity verification is carried out on the generated target downloading file. The integrity check adopts a mode of combining block check and full file check, the target file is divided into a plurality of blocks with the size of 128MB, the last block is smaller than 128MB, the SHA-256 hash value of each block is calculated and compared with the original hash value obtained during the initialization of the downloading task, if all the blocks pass the check, the MD5 value of the whole file is calculated and is subjected to secondary confirmation, and after the verification passes, the system marks a globally unique task identifier, such as 'download_task_12345678', in the task management system as a completed state and updates related information such as the task completion time, the file size and the like.
If the mismatch condition is found in the verification process of the file integrity, the information of the failed block is recorded, and the corresponding subfiles are tried to be recombined, if the verification is still failed after the retry is performed for 3 times, a detailed error report is generated, and the detailed error report comprises the information of the failed block range, the difference between the expected check value and the actual check value and the like, so that the subsequent analysis and the subsequent processing are facilitated.
In an alternative embodiment, uploading the target download file to a file server cluster, and establishing a file status monitoring table, responding to a file list query request of a user, acquiring and returning the target download file from the file server cluster, including:
uploading the target download file to a file server cluster, acquiring storage position information of the target download file in the file server cluster, and generating a corresponding file address;
establishing a file state monitoring table, recording the file state, the file address and the file access times of the target download file, marking the target download file as a hot spot file when the file access times exceed an access threshold value, and triggering a file preloading mechanism;
And receiving a file list query request sent by a user, obtaining a file state list based on the file state monitoring table, preferentially displaying the hot files, acquiring a target download file from the server cluster, and returning the target download file to the user.
In this embodiment, a target download file uploaded by a user is received, the file is transmitted to a preset file server cluster through a secure transmission protocol (such as HTTPS, SFTP, etc.), the file server cluster includes a plurality of distributed file storage servers, distributed storage of the file is implemented through a load balancing policy, a slicing process is performed on the uploaded file, for example, a 2GB video file is divided into 20 100MB data blocks, and a hash value of each data block is calculated for verification, so as to ensure file integrity, and after the uploading is completed, storage location information of the target download file is obtained from the file server cluster, including a server IP address, a storage path, a file slicing location, etc. Based on the above information, a unique file access address is generated, which is in URI format, such as "/cluster/file/repository/2023/07/15/file_id_12345.mp4", for quick locating and accessing of subsequent files.
After the uploading is completed, a file state monitoring table is established and used for recording and managing the state information of the target downloaded file, and the monitoring table is stored in a data structure by adopting key values and comprises the following fields of a file ID, a file name, a file type, a file size, uploading time, file states (such as available, deleted, damaged and the like), file addresses, file access times, final access time, file labels and the like. The file status is checked periodically (e.g., every 5 minutes), file accessibility is ensured, when it is detected that the number of file accesses exceeds a preset access threshold, the target downloaded file is marked as a hot file, e.g., when a certain video file is accessed more than 1000 times within 1 hour, the system updates its status field as a "hot file" and adds a corresponding mark in the database.
The file preloading mechanism is triggered for a target download file marked as a hot spot file, and is implemented by copying the hot spot file from an original storage location to a cache server, such as an edge node server distributed at a different geographic location. These edge node servers configure high speed SSD storage and sufficient bandwidth resources to more quickly respond to user requests. And implementing a multi-copy strategy on the hot spot file, storing the same file copy on a plurality of service nodes, improving the concurrency performance of file access, and generating versions with different quality and formats in advance for the oversized file so as to adapt to the user requirements under different network conditions, for example, generating 1080p, 720p, 480p and other versions for a video file with 4K resolution, wherein the user can automatically select the proper version according to the network bandwidth.
When a file list query request sent by a user is received, related file information is retrieved from a file state monitoring table according to request parameters such as file types, uploading time ranges and the like, a file state list is generated based on records in the monitoring table, the list contains basic information such as file IDs, file names, file types, file sizes, uploading time, file states and the like, when the list is generated, contents marked as hot files are preferentially displayed, the hot files are arranged at the front position of the list, the list is displayed in a paging mode, 20 records are displayed per page by default, and the user can adjust the display quantity of each page through the request parameters.
And for the hot files, the file content is preferentially obtained from the nearest edge node server, the breakpoint continuous transmission function is supported, and the user is allowed to continue to download from the last interruption position after the downloading is interrupted by recording the downloading progress, so that the repeated transmission of the downloaded file part is avoided.
In the file transmission process, the downloading operation is recorded, the file access times field in the file state monitoring table is updated, the last access time is updated, the file access frequency is monitored in real time, and the hot file judgment threshold is dynamically adjusted. For example, during peak network hours (e.g., 18:00-22:00 a day), the hotspot file determination threshold is raised to 2000 accesses per hour, and during low peak hours (e.g., 2:00 a-6:00 a.m.) to 500 accesses per hour to optimize system resource allocation.
By the method, efficient uploading, state monitoring and intelligent distribution of the target download file are achieved, particularly preloading processing of the hot spot file is achieved, file access performance and user experience are remarkably improved, meanwhile load pressure of the server cluster is effectively balanced, and stability and reliability of the whole system are improved.
The asynchronous file generation and downloading realization system of the embodiment of the invention comprises the following steps:
the first unit is used for receiving a file downloading request of a user, extracting a data source code, creating a file downloading task, writing the file downloading task into a database and generating a global unique task identifier;
The second unit is used for analyzing the file downloading task based on a preset file characteristic identification rule, dynamically dividing the file downloading task into a plurality of subtask fragments according to an analysis result, and distributing independent resource quota and a processing overtime threshold value for each subtask fragment;
The third unit is used for distributing the plurality of sub-task fragments to a plurality of target processing nodes, generating sub-files by adopting a parallel calling file generation method, and redistributing corresponding sub-file generation tasks to other file processing nodes when detecting that the sub-files are failed to be generated;
And the fourth unit is used for uploading the target download file to a file server cluster, establishing a file state monitoring table, responding to a file list query request of a user, and acquiring and returning the target download file from the file server cluster.
In a third aspect of an embodiment of the present invention, there is provided an electronic device including:
A processor;
a memory for storing processor-executable instructions;
Wherein the processor is configured to invoke the instructions stored in the memory to perform the method described previously.
In a fourth aspect of embodiments of the present invention, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method as described above.
The present invention may be a method, apparatus, system, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for performing various aspects of the present invention.
It should be noted that the above embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that the technical solution described in the above embodiments may be modified or some or all of the technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the scope of the technical solution of the embodiments of the present invention.