CN120935167A

CN120935167A - Asynchronous file generation and downloading realization method and system

Info

Publication number: CN120935167A
Application number: CN202511437820.1A
Authority: CN
Inventors: 朱晶晶; 吴浩然; 黄永洋
Original assignee: Guancheng Information Technology Suzhou Co ltd
Current assignee: Guancheng Information Technology Suzhou Co ltd
Priority date: 2025-10-10
Filing date: 2025-10-10
Publication date: 2025-11-11
Anticipated expiration: 2045-10-10
Also published as: CN120935167B

Abstract

This invention provides a method and system for asynchronous file generation and downloading, relating to the field of computer file processing technology. The method includes: receiving a user's file download request, creating a task and generating a unique identifier; dividing the task into multiple sub-task segments based on file feature recognition rules and allocating resources accordingly; distributing the sub-tasks to multiple processing nodes to generate sub-files in parallel; merging the sub-files into the target download file and uploading it to a server cluster; and responding to user queries by returning the file. This invention improves the efficiency of large file downloads, enhances system stability, and optimizes the user experience.

Description

Asynchronous file generation and downloading realization method and system

Technical Field

The invention relates to the technical field of computer file processing, in particular to a method and a system for realizing asynchronous file generation and downloading.

Background

In an enterprise information system, a file downloading function is a common basic requirement, and particularly in application scenes such as data analysis, report export and the like, a user often needs to export a large amount of data as a file for downloading. The traditional file downloading implementation usually adopts a synchronous processing mode, namely, when a user initiates a downloading request, the system immediately executes data query, file generation and transmission operations, and the user can acquire the file after waiting for the whole process to be completed.

The prior art lacks an effective file generation state tracking and task recovery mechanism, and once abnormality or failure occurs in the file generation process, the whole task needs to be re-executed, thereby wasting system resources and prolonging the waiting time of users.

In recent years, although part of systems start to adopt an asynchronous processing mode to improve the file downloading function, the problems of insufficient flexibility in task scheduling, unreasonable resource allocation, imperfect subtask failure processing mechanism and the like still exist. In complex business scenarios, especially when processing ultra-large data volumes or file generation tasks involving multi-data source integration is required, the requirements of high efficiency and high reliability are difficult to meet in the prior art.

Disclosure of Invention

The embodiment of the invention provides a method and a system for realizing asynchronous file generation and downloading, which can solve the problems in the prior art.

In a first aspect of the embodiment of the present invention, a method for implementing asynchronous file generation and download is provided, including:

receiving a file downloading request of a user, extracting a data source code, creating a file downloading task, writing the file downloading task into a database, and generating a global unique task identifier;

Analyzing the file downloading task based on a preset file characteristic identification rule, dynamically dividing the file downloading task into a plurality of subtask fragments according to an analysis result, and distributing independent resource quota and a processing timeout threshold for each subtask fragment;

Distributing the sub-task fragments to a plurality of target processing nodes, generating sub-files by adopting a parallel calling file generation method, and redistributing corresponding sub-file generation tasks to other file processing nodes when detecting that the sub-files are failed to be generated;

Uploading the target download file to a file server cluster, establishing a file state monitoring table, responding to a file list query request of a user, and acquiring and returning the target download file from the file server cluster.

Receiving a file downloading request of a user, extracting a data source code, creating a file downloading task, writing the file downloading task into a database, and generating a global unique task identifier, wherein the method comprises the following steps of:

receiving a file downloading request of a user, extracting a data source type identifier, a data source number, time stamp information and a digital signature from the file downloading request to obtain a data source code and create a file downloading task;

based on the data source code, calculating the priority score of the file downloading task by combining the user grade and the task queue length, writing the file downloading task into a double-layer task processing structure according to the priority score, and starting an asynchronous thread to write the file downloading task with the priority score larger than a priority threshold into a database in batches when the number of the tasks of the double-layer task processing structure reaches the upper limit;

and generating a global unique task identifier of the file downloading task according to the database writing time stamp, the priority score and the data source code, and setting the cache failure time.

Analyzing the file downloading task based on a preset file characteristic identification rule, dynamically dividing the file downloading task into a plurality of subtask fragments according to an analysis result, and distributing independent resource quota and processing timeout threshold for each subtask fragment, wherein the method comprises the following steps:

Extracting file data in the file downloading task, calculating entropy value, data distribution density and time sequence correlation of the file data, and constructing a file feature vector;

Calculating the association degree of the data blocks according to the file feature vectors, constructing a data block dependency relationship graph, calculating the ratio of the input degree value to the output degree value of each data block node, and taking the data block node with the ratio lower than the dependency threshold value as an initial segmentation node;

Calculating the data transmission cost between the initial segmentation nodes, selecting the initial segmentation nodes with the data transmission cost smaller than a cost threshold as optimal segmentation nodes, dynamically segmenting the file downloading task based on the optimal segmentation nodes, and generating a plurality of subtask fragments;

and performing similarity matching on the file feature vector and the historical execution task feature, setting an independent processing timeout threshold value for each subtask segment based on the similarity matching result, and setting an independent resource quota in combination with the data complexity of the subtask segment.

Distributing the plurality of sub-task fragments to a plurality of target processing nodes, generating sub-files by adopting a parallel calling file generation method, and redistributing corresponding sub-file generation tasks to other file processing nodes when detecting that the sub-files are generated in failure, wherein the method comprises the following steps of:

Obtaining the total resource amount, the network bandwidth capacity and the current load state of each processing node, screening out candidate processing nodes meeting the subtask segment resource quota requirement, calculating the task affinity score of each candidate processing node, constructing a candidate processing node priority sequence according to the order of the task affinity score from high to low, selecting a plurality of candidate processing nodes as target processing nodes, and distributing the plurality of subtask segments to the target processing nodes;

Starting a file generation process on a target processing node, initializing a file writing handle and a file check code, writing data of a sub-task segment into a file according to the size of a data block in a segmented mode, calculating the check code of each data block, comparing the check code with the file check code, and writing the data block again when the fact that the check codes of the data blocks are not matched is detected, and updating file metadata information after the generation of the sub-file is completed;

And monitoring the processing progress of the subtask fragments on the target processing nodes, and when the fact that the file metadata information of the subtask fragments is unchanged and the processing time exceeds the processing timeout threshold value of each subtask fragment is detected, selecting the next processing node from the candidate processing node priority sequence again to perform task allocation.

Merging all the subfiles into a target download file according to a preset merging strategy, detecting the file integrity of the target download file, and marking the globally unique task identifier as completed, wherein the method comprises the following steps:

acquiring file metadata information of all subfiles, and establishing a subfile merging sequence according to the numbering sequence of the file metadata information;

reading data blocks of the subfiles according to the subfiles merging sequence, writing the data blocks into a file merging buffer area for data merging, controlling the data merging rate through a token bucket algorithm, dynamically adjusting the capacity of the token bucket according to the writing bandwidth of a system disk, suspending the data merging operation when the number of tokens in the token bucket is insufficient, and reducing the reading rate balance data processing pipelines of the upstream subfiles;

Real-time checking is carried out on the combined data based on a sliding window mechanism, cyclic redundancy check codes of the data blocks are calculated in the sliding window, the calculated cyclic redundancy check codes are compared with original check codes of the subfiles, and when mismatching of the check codes is detected, the sliding window is retracted and data combination is carried out again;

And reading the data blocks which pass verification from the merging buffer area, writing the data blocks into a target download file, carrying out integrity check on the target download file to obtain the target download file which passes verification, and marking the globally unique task identifier as a finished state.

Uploading the target download file to a file server cluster, establishing a file state monitoring table, responding to a file list query request of a user, acquiring and returning the target download file from the file server cluster, and comprising the following steps:

uploading the target download file to a file server cluster, acquiring storage position information of the target download file in the file server cluster, and generating a corresponding file address;

establishing a file state monitoring table, recording the file state, the file address and the file access times of the target download file, marking the target download file as a hot spot file when the file access times exceed an access threshold value, and triggering a file preloading mechanism;

And receiving a file list query request sent by a user, obtaining a file state list based on the file state monitoring table, preferentially displaying the hot files, acquiring a target download file from the server cluster, and returning the target download file to the user.

In a second aspect of the embodiment of the present invention, an asynchronous file generation and download implementation system is provided, including:

the first unit is used for receiving a file downloading request of a user, extracting a data source code, creating a file downloading task, writing the file downloading task into a database and generating a global unique task identifier;

The second unit is used for analyzing the file downloading task based on a preset file characteristic identification rule, dynamically dividing the file downloading task into a plurality of subtask fragments according to an analysis result, and distributing independent resource quota and a processing overtime threshold value for each subtask fragment;

The third unit is used for distributing the plurality of sub-task fragments to a plurality of target processing nodes, generating sub-files by adopting a parallel calling file generation method, and redistributing corresponding sub-file generation tasks to other file processing nodes when detecting that the sub-files are failed to be generated;

And the fourth unit is used for uploading the target download file to a file server cluster, establishing a file state monitoring table, responding to a file list query request of a user, and acquiring and returning the target download file from the file server cluster.

In a third aspect of an embodiment of the present invention,

There is provided an electronic device including:

A processor;

a memory for storing processor-executable instructions;

Wherein the processor is configured to invoke the instructions stored in the memory to perform the method described previously.

In a fourth aspect of an embodiment of the present invention,

There is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method as described above.

The beneficial effects of the application are as follows:

By implementing the method for implementing asynchronous file generation and downloading, the file downloading task can be dynamically divided into a plurality of subtask fragments for parallel processing, the processing efficiency of large file generation and downloading is effectively improved, the waiting time of a user is shortened, and the response speed of a system is improved.

The invention adopts an independent resource quota and timeout threshold control mechanism and combines a subtask failure automatic redistribution strategy, thereby remarkably enhancing the stability and fault tolerance of the system under a high concurrency scene, effectively avoiding the problem of integral download failure caused by single-point failure and ensuring the reliability of the file generation process.

In addition, the invention realizes the functions that users can inquire the file processing progress and acquire results at any time through the file server cluster storage and the file state monitoring table management, obviously improves the user experience, simultaneously is convenient for a system administrator to monitor and manage the file processing process, and improves the maintainability of the system.

Drawings

FIG. 1 is a flow chart of an implementation method for asynchronous file generation and downloading according to an embodiment of the present invention;

fig. 2 is a flow chart of a method for processing a file download request.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The technical scheme of the invention is described in detail below by specific examples. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.

Fig. 1 is a flow chart of an implementation method for asynchronous file generation and downloading according to an embodiment of the present invention, as shown in fig. 1, the method includes:

Fig. 2 is a flow chart of a method for processing a file download request. In an alternative embodiment, receiving a file download request from a user, extracting a data source code and creating a file download task, writing the file download task into a database and generating a globally unique task identifier, including:

In this embodiment, a method of receiving a user file download request, extracting a data source code, and creating a file download task is described in detail. By creating an efficient task processing mechanism, the method ensures reasonable allocation of system resources and the priority of task processing.

In this embodiment, a FILE download request sent by a user needs to be received, where the request is usually transmitted in the form of HTTP or HTTPs protocol through a network, and includes a plurality of key information, a data source type identifier is extracted from a request body, where the identifier is usually in the form of a character string, such as "DB" indicates a database type source, "API" indicates an interface type source, "FILE" indicates a FILE system type source, and at the same time, a data source number is extracted, where the number is usually a unique character string, and a format such as "SRC20240501001" is used to precisely locate a specific data source, timestamp information in the request is recorded in UTC format, such as "2024-05-01t12:30:45z", and used to prevent replay attacks, and the request also includes a digital signature, and an encrypted character string generated by using the sha256+rsa algorithm, such as "a1b2c3d4e5f6.

After the file downloading task is created, the priority score of the task is calculated so as to reasonably allocate system resources, three main factors are considered in priority calculation, namely data source coding, user grade and current task queue length, for the data source coding, basic scores are allocated according to different types, such as 10 points for database type, 8 points for API type and 6 points for file system type, the user grade is generally divided into VIP users, common users and trial users, and the weights are respectively set to 1.5, 1.0 and 0.8. When the length of the task queue is less than 100, the score is not adjusted, when the length is between 100 and 500, the priority is reduced by 10%, when the length is more than 500, the priority is reduced by 20%, and the final priority score is calculated by combining the factors. For example, if the VIP user requests to download a database type source, and the current queue length is 80, the priority score is 10×1.5×1.0=15 points.

After the priority score is calculated, the file downloading task is written into a double-layer task processing structure, the structure comprises a high-priority task queue and a low-priority task queue in a memory, when the priority score exceeds 12 times, the tasks enter the high-priority queue, otherwise, enter the low-priority queue, the inside of each queue is organized by adopting a priority stack structure, the priority processing of the high-priority tasks is ensured, the task number of the double-layer task processing structure is continuously monitored, when the total task number reaches a preset upper limit (usually 1000), asynchronous thread processing is started, the threads screen out tasks with priority scores larger than a priority threshold (usually 10 times), the tasks form batches (each batch is at most 100 tasks), a database batch writing interface is called, the tasks are stored in a lasting mode, and the database writing adopts a transaction mechanism to ensure the atomicity of batch writing, namely all success or all failure.

After the database is written successfully, a globally unique Task identifier (Task ID) needs to be generated for each Task, three elements including a database writing time stamp, a Task priority score and a data source code are comprehensively considered in the generation process, in specific implementation, the time stamp accurate to milliseconds needs to be acquired, such as 1588763142358, the priority score is converted into a two-bit fixed-length character string, such as priority 15.5 is converted into 16, the last 6-bit character (such as 01001 is extracted from DB_SRC 20240501001) is extracted from the data source code, the three parts are connected by using a specific separator, and a 4-bit random alphanumeric character string is added as a suffix to form a final Task identifier, for example, the combination mode guarantees the global uniqueness of the identifier, and meanwhile key information of the Task is contained.

After generating task identifiers, writing task information into a cache system to improve the subsequent query efficiency, wherein the cache records contain key information such as task identifiers, data source codes, priority scores, creation time and the like, different cache failure times are set according to task priorities, namely, a high-priority task (score > 15) is set for 30 minutes of a cache period, a medium-priority task (10-15) is set for 20 minutes of a cache period, a low-priority task (10) is set for 10 minutes of a cache period, and after the cache fails, the task information is reloaded from a database to ensure data consistency.

And returning a response containing the global unique task identifier to the user, wherein the user can inquire the task state through the identifier, the response body adopts a JSON format and contains information such as the task identifier, the estimated processing time, the task state and the like, and meanwhile, a detailed operation log containing information such as the user ID, the operation time, the request IP, the task identifier and the like is recorded so as to facilitate subsequent audit and problem investigation.

In an optional implementation manner, the file downloading task is analyzed based on a preset file feature recognition rule, the file downloading task is dynamically divided into a plurality of subtask fragments according to an analysis result, and an independent resource quota and a processing timeout threshold are allocated to each subtask fragment, including:

In practice, a file download task is received, which contains basic information of the file, such as URL, file size, etc., a file data sample is extracted from the download task, typically the first 1MB of the file or a uniformly sampled piece of data, for the extracted file data, an entropy value is calculated reflecting the randomness and information density of the data, the entropy value calculation is calculated by counting the frequency of occurrence of each byte in the data, and a high entropy value (e.g., > 7.5) typically indicates that the data is compressed or encrypted. Meanwhile, calculating the data distribution density, namely the distribution condition of the data values in the range, which can be realized by constructing a histogram, dividing the byte value range of 0-255 into 16 intervals, counting the number of data points of each interval, analyzing the time sequence correlation of data, namely the degree of correlation between adjacent data blocks, quantifying by calculating the correlation coefficient of the adjacent data blocks, wherein the correlation coefficient is close to 1 and shows high correlation, and the correlation coefficient is close to 0 and shows almost irrelevance, and constructing a file feature vector comprising entropy values, data distribution density feature vectors and time sequence correlation indexes by the calculation.

Based on the constructed file feature vector, the association degree between each data block in the file is calculated, the file is divided into basic data blocks with the same size, such as 64KB, each block is divided into basic data blocks, based on time sequence correlation indexes in the feature vector, the association degree between the blocks is calculated, for the text file, the association degree of adjacent blocks is higher (such as 0.8), for different content areas in the multimedia file, the association degree is lower (such as 0.3), a data block dependency graph is constructed by using the association degree values, nodes in the graph represent the data blocks, edges represent the association relation between the data blocks, and weights of the edges represent the association strength.

After the dependency graph is constructed, calculating the ratio of the ingress value to the egress value of each data block node, wherein the ingress value represents the edge number pointing to the node, the egress value represents the edge number starting from the node, and the nodes with the ratio lower than the preset dependency threshold (such as 0.5) are marked as initial segmentation nodes, and are usually positions where the data content is significantly changed and are suitable as task segmentation nodes. For example, a 100MB video file has two low ratio nodes at 33MB and 67MB, with ratios of 0.3 and 0.4, respectively, which correspond to scene transitions or content type changes.

Further calculating the data transmission cost between the initial split nodes, i.e. the resource consumption required for transmitting data from one node to another, the transmission cost can be calculated comprehensively based on the data volume, network conditions and data characteristics between the nodes. For example, for two initial segmentation nodes 20MB apart, if the entropy of the intermediate data is high, which indicates that the data complexity is high, the transmission cost reaches a high value such as 85, while for a data region with a low entropy, even if the entropy is 30MB apart, the transmission cost is only 50, and the initial segmentation node with the data transmission cost smaller than the preset cost threshold is selected as the optimal segmentation node.

Based on the determined optimal segmentation node, the file downloading task is dynamically segmented into a plurality of subtask fragments, each subtask fragment corresponds to a continuous data area of the original file, for example, a 100MB file is divided into three subtasks, namely 0-33MB,33-67MB and 67-100MB, and the data internal structure is considered in the segmentation mode, so that the data relevance in each subtask is strong, and the dependence among the subtasks is weak.

And performing similarity matching on the constructed file feature vector and the feature vector of the historical execution task, wherein the similarity calculation can adopt methods such as cosine similarity or Euclidean distance, the higher the value is, the more similar the task characteristics are, for example, the similarity between the current file feature vector and a certain ZIP file in the historical record is 0.92, the two files are provided with similar processing characteristics, and an independent processing timeout threshold is set for each subtask segment based on a similarity matching result. For subtasks that are highly similar to the historical task (e.g., similarity > 0.9), the average processing time of the historical task plus 20% tolerance is directly used as the timeout threshold, for the case of moderate similarity (0.6-0.9), 150% of the historical time can be used as the threshold, and for the case of low similarity (< 0.6), a looser default threshold is set, such as subtask size divided by the lowest desired download rate.

Meanwhile, independent resource quota is set according to the data complexity of the subtask fragments, the data complexity can be measured through entropy values and distribution density in feature vectors, subtasks with high complexity (such as entropy value > 7.0) need more processing resources and allocate 50% more memory and CPU time, subtasks with low complexity (such as entropy value < 5.0) can use standard quota, for example, for a download task comprising three subtasks, the first subtask is allocated with 2 processing threads and 100MB memory, the timeout threshold is 30 seconds, the second subtask is allocated with 4 threads and 200MB memory, the timeout threshold is 60 seconds, the third subtask is allocated with 6 threads and 400MB memory, and the timeout threshold is 120 seconds.

Through the dynamic fragmentation and resource allocation mechanism, the downloading process can be optimized according to the file characteristics, the downloading success rate and efficiency are improved, and meanwhile, the system resources are reasonably utilized.

In an alternative embodiment, the multiple subtask fragments are distributed to multiple target processing nodes, a parallel calling file generation method is adopted to generate a subfile, when the subfile generation failure is detected, the corresponding subfile generation task is redistributed to other file processing nodes, including:

In this embodiment, resource status information of each processing node is obtained, including load indexes such as the number of CPU cores, the memory capacity, the storage space, the network bandwidth capacity, the current CPU utilization, the memory occupancy rate, the network traffic, and the like of each processing node, taking a server cluster of a certain data center as an example, the processing node a has a 16-core CPU, a 64-GB memory, a 10Gbps network bandwidth, the current CPU utilization rate is 30%, the memory occupancy rate is 40%, the network bandwidth utilization rate is 25%, the processing node B has a 32-core CPU, a 128-GB memory, a 40Gbps network bandwidth, the current CPU utilization rate is 60%, the memory occupancy rate is 70%, and the network bandwidth utilization rate is 45%.

And according to the resource requirement of each subtask segment, for example, the subtask segment 1 needs a 4-core CPU, an 8GB memory and a 1Gbps network bandwidth, the subtask segment is matched with the available resources of each processing node, and candidate processing nodes meeting the resource requirement are screened out. For subtask segment 1, processing nodes A and B both meet their resource requirements, becoming candidate processing nodes.

Calculating task affinity scores of candidate processing nodes, wherein the task affinity scores are based on a plurality of factors including resource matching degree of the processing nodes, historical task completion rate, average processing speed and the like, the resource matching degree considers the matching degree of demands of subtasks on resources and the processing nodes which can provide the resources, the historical task completion rate reflects the stability of the processing nodes, and the average processing speed reflects the efficiency of the processing nodes. For processing node A, the resource matching degree is 85% (sufficient resources but not excessive), the historical task completion rate is 98%, the average processing speed is 120MB/s, the comprehensive calculation results in an affinity score of 0.89, for processing node B, the resource matching degree is 65% (excessive resources), the historical task completion rate is 95%, the average processing speed is 180MB/s, and the comprehensive calculation results in an affinity score of 0.81.

And constructing candidate processing node priority sequences according to the task affinity scores, wherein the priority sequences in the example are processing node A and processing node B. And selecting the processing node A from the priority sequence as a target processing node, distributing the subtask fragment 1 to the processing node A for execution, and distributing other subtask fragments by adopting a similar method.

When the target processing node receives the subtask segment, the target processing node starts a file generation process, creates a file writing handle, allocates a file storage space, initializes a file check code, for example, using an MD5 or SHA256 algorithm, takes the subtask segment 1 as an example, and divides the subtask segment into data blocks of 4MB each, and 500 data blocks in total.

And executing writing operation on each data block, after writing the content of the data block 1 into a file, calculating a check code of the data block, for example, using a CRC32 algorithm to obtain a check value 0x1A2B3C4D, comparing the check value with an expected check value, if the check values are matched, continuing to process the next data block, and if the check values are not matched (for example, obtaining 0x1A2B3C 5E), re-writing the data block until the check codes are matched or the maximum retry number is reached.

During the writing process, the file metadata information is updated periodically (e.g., every 10 data blocks written), including file size, number of processed data blocks, last modification time, etc. After writing all the data blocks, calculating the check code of the whole file, updating the final file metadata, and marking the generation state of the subfiles as 'complete'.

The processing progress of all the subtask fragments is continuously monitored, for each executing subtask fragment, file metadata information is checked at regular intervals (for example, 5 seconds), and if the metadata information (for example, last modification time and number of processed data blocks) of the subtask fragment 1 is found to have no change in 3 continuous checks, and the processing time exceeds a preset processing timeout threshold (for example, 1.5 times of expected processing time calculated according to the data quantity, 150 seconds in this example), the subtask fragment processing failure is determined.

When the failure of the subtask segment 1 processing is detected, selecting the next processing node B from the candidate processing node priority sequence, and reassigning the subtask segment 1 to the processing node B for execution. And after receiving the task, the processing node B restarts the file generation flow, initializes the file writing handle and the check code, writes the file according to the data block and checks the file until the generation of the subfiles is successfully completed.

By the technical means, the method and the device realize intelligent distribution of the subtask fragments, reliable writing of file data and automatic redistribution of failed tasks, and improve the parallel efficiency and reliability of file generation.

In an alternative embodiment, merging all the subfiles into a target download file according to a preset merging policy, detecting the file integrity of the target download file, and marking the globally unique task identifier as completed, including:

In a method for implementing efficient file merging and integrity verification, ordered merging and verification of subfiles is implemented through multiple steps, the method retrieves file metadata information of all subfiles from a storage system, including key attributes such as file name, creation time, file size, file block sequence number, and the like, and constructs a subfile merging sequence based on the sequence of numbers in the file metadata information, typically sequence numbers contained in the subfile names, such as "file_part_001", "file_part_002", and the like, for example, for a download task containing 5 subfiles, a merging queue is established according to the sequence from "file_part_001" to "file_part_005", so as to ensure that merging operations are performed according to the correct file block sequence.

The merging process adopts a token bucket algorithm to control the data processing rate, initializes a token bucket, the initial capacity of the token bucket is set to be 50MB, and the token bucket is dynamically adjusted according to the current system disk writing capacity, a certain amount of tokens are added to the token bucket every 100 milliseconds, the adding amount depends on the disk writing bandwidth monitored in real time, for example, when the system disk writing bandwidth is detected to be 120MB/s, about 12MB of tokens are added every 100 milliseconds. In the data merging process, sub-file data blocks are sequentially read according to a sub-file merging sequence, the default block size is 4MB, tokens with corresponding sizes are consumed for each data block to be read, when the number of available tokens in a token bucket is lower than 4MB, data merging operation is suspended, meanwhile, the reading rate of an upstream sub-file is reduced through a back pressure mechanism, so that the whole data processing pipeline is balanced, in a specific implementation, a reading thread can be enabled to enter a short sleep state, usually 50 milliseconds, when the tokens are insufficient, and the memory occupation is prevented from being too high due to too fast reading operation.

The data merging quality control adopts a sliding window mechanism to carry out real-time verification, a sliding window with the default size of 16MB is created, after merging data is written into a merging buffer area, the sliding window moves in the buffer area and calculates CRC32 cyclic redundancy check codes of the data in the current window, for example, when 32MB data is written into the buffer area, the sliding window sequentially calculates check codes of intervals of 0-16MB, 4MB-20MB, 8MB-24MB and the like and compares the check codes with original check codes of subfiles, if the check codes of intervals [12MB-28MB ] are not matched, if the calculated check codes are 0xA1B2C3D4 and the original check codes are 0xA1B2C3D5, the sliding window is retracted to the position of the last successful check, for example, the data blocks of the corresponding subfiles are read again and combined, and in order to improve the verification efficiency, the block calculation mode is adopted, the window of 16MB is divided into 4 sub-blocks of 4MB to calculate the check codes in parallel, and the merging result is obtained.

And after the data passes the verification, reading the data blocks which pass the verification from the merging buffer area and writing the data blocks into the target downloading file, wherein the writing adopts an asynchronous IO mode, an IO request queue is maintained, the default queue length is 8, the data size of each IO request is 8MB, the IO requests are processed in a background thread, the data are written into a disk, and after all the subfiles are merged, the integrity verification is carried out on the generated target downloading file. The integrity check adopts a mode of combining block check and full file check, the target file is divided into a plurality of blocks with the size of 128MB, the last block is smaller than 128MB, the SHA-256 hash value of each block is calculated and compared with the original hash value obtained during the initialization of the downloading task, if all the blocks pass the check, the MD5 value of the whole file is calculated and is subjected to secondary confirmation, and after the verification passes, the system marks a globally unique task identifier, such as 'download_task_12345678', in the task management system as a completed state and updates related information such as the task completion time, the file size and the like.

If the mismatch condition is found in the verification process of the file integrity, the information of the failed block is recorded, and the corresponding subfiles are tried to be recombined, if the verification is still failed after the retry is performed for 3 times, a detailed error report is generated, and the detailed error report comprises the information of the failed block range, the difference between the expected check value and the actual check value and the like, so that the subsequent analysis and the subsequent processing are facilitated.

In an alternative embodiment, uploading the target download file to a file server cluster, and establishing a file status monitoring table, responding to a file list query request of a user, acquiring and returning the target download file from the file server cluster, including:

In this embodiment, a target download file uploaded by a user is received, the file is transmitted to a preset file server cluster through a secure transmission protocol (such as HTTPS, SFTP, etc.), the file server cluster includes a plurality of distributed file storage servers, distributed storage of the file is implemented through a load balancing policy, a slicing process is performed on the uploaded file, for example, a 2GB video file is divided into 20 100MB data blocks, and a hash value of each data block is calculated for verification, so as to ensure file integrity, and after the uploading is completed, storage location information of the target download file is obtained from the file server cluster, including a server IP address, a storage path, a file slicing location, etc. Based on the above information, a unique file access address is generated, which is in URI format, such as "/cluster/file/repository/2023/07/15/file_id_12345.mp4", for quick locating and accessing of subsequent files.

After the uploading is completed, a file state monitoring table is established and used for recording and managing the state information of the target downloaded file, and the monitoring table is stored in a data structure by adopting key values and comprises the following fields of a file ID, a file name, a file type, a file size, uploading time, file states (such as available, deleted, damaged and the like), file addresses, file access times, final access time, file labels and the like. The file status is checked periodically (e.g., every 5 minutes), file accessibility is ensured, when it is detected that the number of file accesses exceeds a preset access threshold, the target downloaded file is marked as a hot file, e.g., when a certain video file is accessed more than 1000 times within 1 hour, the system updates its status field as a "hot file" and adds a corresponding mark in the database.

The file preloading mechanism is triggered for a target download file marked as a hot spot file, and is implemented by copying the hot spot file from an original storage location to a cache server, such as an edge node server distributed at a different geographic location. These edge node servers configure high speed SSD storage and sufficient bandwidth resources to more quickly respond to user requests. And implementing a multi-copy strategy on the hot spot file, storing the same file copy on a plurality of service nodes, improving the concurrency performance of file access, and generating versions with different quality and formats in advance for the oversized file so as to adapt to the user requirements under different network conditions, for example, generating 1080p, 720p, 480p and other versions for a video file with 4K resolution, wherein the user can automatically select the proper version according to the network bandwidth.

When a file list query request sent by a user is received, related file information is retrieved from a file state monitoring table according to request parameters such as file types, uploading time ranges and the like, a file state list is generated based on records in the monitoring table, the list contains basic information such as file IDs, file names, file types, file sizes, uploading time, file states and the like, when the list is generated, contents marked as hot files are preferentially displayed, the hot files are arranged at the front position of the list, the list is displayed in a paging mode, 20 records are displayed per page by default, and the user can adjust the display quantity of each page through the request parameters.

And for the hot files, the file content is preferentially obtained from the nearest edge node server, the breakpoint continuous transmission function is supported, and the user is allowed to continue to download from the last interruption position after the downloading is interrupted by recording the downloading progress, so that the repeated transmission of the downloaded file part is avoided.

In the file transmission process, the downloading operation is recorded, the file access times field in the file state monitoring table is updated, the last access time is updated, the file access frequency is monitored in real time, and the hot file judgment threshold is dynamically adjusted. For example, during peak network hours (e.g., 18:00-22:00 a day), the hotspot file determination threshold is raised to 2000 accesses per hour, and during low peak hours (e.g., 2:00 a-6:00 a.m.) to 500 accesses per hour to optimize system resource allocation.

By the method, efficient uploading, state monitoring and intelligent distribution of the target download file are achieved, particularly preloading processing of the hot spot file is achieved, file access performance and user experience are remarkably improved, meanwhile load pressure of the server cluster is effectively balanced, and stability and reliability of the whole system are improved.

The asynchronous file generation and downloading realization system of the embodiment of the invention comprises the following steps:

In a third aspect of an embodiment of the present invention, there is provided an electronic device including:

A processor;

a memory for storing processor-executable instructions;

In a fourth aspect of embodiments of the present invention, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method as described above.

The present invention may be a method, apparatus, system, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for performing various aspects of the present invention.

It should be noted that the above embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that the technical solution described in the above embodiments may be modified or some or all of the technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the scope of the technical solution of the embodiments of the present invention.

Claims

1. An asynchronous file generation and download implementation method, characterized by comprising:

Receive a user's file download request, extract the data source identifier and create a file download task, write the file download task into the database and generate a globally unique task identifier;

The file download task is analyzed based on preset file feature recognition rules. According to the analysis results, the file download task is dynamically divided into multiple sub-task segments, and each sub-task segment is allocated an independent resource quota and processing timeout threshold.

The multiple subtask fragments are distributed to multiple distributed processing nodes, and the file generation method is called in parallel to generate sub-files. When the generation of a sub-file fails, the corresponding sub-file generation task is redistributed to other distributed processing nodes. All sub-files are merged into the target download file according to a preset merging strategy, and the file integrity is checked. The globally unique task identifier is marked as completed.

The target download file is uploaded to the distributed file server cluster. In response to the user's file list query request, the target download file is retrieved from the distributed file server cluster and returned.

2. The method according to claim 1, characterized in that receiving a user's file download request, extracting the data source code and creating a file download task, writing the file download task into the database and generating a globally unique task identifier, includes:

Receive the user's file download request, extract the data source type identifier, data source number, timestamp information and digital signature from it, obtain the data source code and create a file download task;

Based on the data source encoding, combined with the user level and task queue length, the priority score of the file download task is calculated. The file download task is written into the two-layer task processing structure according to the priority score. When the number of tasks in the two-layer task processing structure reaches the upper limit, an asynchronous thread is started to write the file download tasks with priority scores greater than the priority threshold into the database in batches.

Based on the database write timestamp, the priority score, and the data source code, a globally unique task identifier is generated for the file download task, and a cache expiration time is set.

3. The method according to claim 1, characterized in that, the file download task is analyzed based on preset file feature recognition rules, and the file download task is dynamically divided into multiple sub-task segments according to the analysis results, and each sub-task segment is allocated an independent resource quota and processing timeout threshold, including:

Extract file data from the file download task, calculate the entropy value, data distribution density, and time series correlation of the file data, and construct a file feature vector;

The correlation degree of data blocks is calculated based on the file feature vector, a data block dependency graph is constructed, and the ratio of the in-degree value to the out-degree value of each data block node is calculated. Data block nodes with a ratio lower than the dependency threshold are used as initial splitting nodes.

Calculate the data transmission cost between the initial segmentation nodes, select the initial segmentation node whose data transmission cost is less than the cost threshold as the optimal segmentation node, and dynamically segment the file download task based on the optimal segmentation node to generate multiple subtask segments.

The file feature vector is matched with the features of historical execution tasks. Based on the similarity matching results, an independent processing timeout threshold is set for each subtask segment, and an independent resource quota is set in combination with its data complexity.

4. The method according to claim 1, characterized in that, the plurality of subtask fragments are allocated to a plurality of target processing nodes, a parallel file generation method is used to generate subfiles, and when the generation of a subfile fails, the corresponding subfile generation task is redistributed to other file processing nodes, including:

Obtain the total resources, network bandwidth capacity and current load status of each processing node, filter out candidate processing nodes that meet the resource quota requirements of subtask segments, calculate the task affinity score of each candidate processing node, construct a priority sequence of candidate processing nodes in descending order of task affinity score, select multiple candidate processing nodes as target processing nodes, and allocate the multiple subtask segments to the target processing nodes.

Start the file generation process on the target processing node, initialize the file write handle and file checksum, write the data of the subtask fragments into the file according to the data block size, calculate the checksum of each data block and compare it with the file checksum, and rewrite the data block when a data block checksum mismatch is detected. After the sub-file generation is completed, update the file metadata information.

Monitor the processing progress of subtask segments on the target processing node. When it is detected that the file metadata information of the subtask segment has not changed and the processing time has exceeded the processing timeout threshold of each subtask segment, reselect the next processing node from the priority sequence of candidate processing nodes for task allocation.

5. The method according to claim 1, characterized in that, merging all sub-files into a target download file according to a preset merging strategy, detecting its file integrity, and marking the globally unique task identifier as completed, includes:

Obtain the file metadata information of all sub-files, and establish a sub-file merging sequence according to the numbering order of the file metadata information;

The data blocks of the sub-files are read according to the sub-file merging sequence and written into the file merging buffer for data merging. The data merging rate is controlled by the token bucket algorithm. The capacity of the token bucket is dynamically adjusted according to the system disk write bandwidth. When the number of tokens in the token bucket is insufficient, the data merging operation is paused and the read rate of the upstream sub-files is reduced to balance the data processing pipeline.

The merged data is verified in real time using a sliding window mechanism. The cyclic redundancy check code of the data block is calculated within the sliding window. The calculated cyclic redundancy check code is compared with the original check code of the sub-file. When a check code mismatch is detected, the sliding window is rolled back and the data is merged again.

The data blocks that have passed verification are read from the merge buffer and written into the target download file. The integrity of the target download file is checked to obtain the verified target download file, and the globally unique task identifier is marked as completed.

6. The method according to claim 1, characterized in that, uploading the target download file to a file server cluster and establishing a file status monitoring table, and in response to a user's file list query request, retrieving and returning the target download file from the file server cluster, includes:

The target download file is uploaded to the file server cluster, the storage location information of the target download file in the file server cluster is obtained, and the corresponding file address is generated.

Establish a file status monitoring table to record the file status, file address, and number of file accesses of the target download file. When the number of file accesses exceeds the access threshold, mark the target download file as a hot file and trigger the file preloading mechanism.

The system receives a file list query request from a user, obtains a file status list based on the file status monitoring table, prioritizes displaying popular files, retrieves the target download file from the server cluster, and returns it to the user.

7. An asynchronous file generation and download implementation system, used to implement the method as described in any one of claims 1-6, characterized in that it comprises:

The first unit is used to receive the user's file download request, extract the data source code and create a file download task, write the file download task into the database and generate a globally unique task identifier;

The second unit is used to analyze the file download task based on preset file feature recognition rules, dynamically divide the file download task into multiple sub-task segments according to the analysis results, and allocate independent resource quotas and processing timeout thresholds to each sub-task segment.

The third unit is used to distribute the multiple sub-task fragments to multiple target processing nodes, generate sub-files using a parallel file generation method, and redistribute the corresponding sub-file generation task to other file processing nodes when the sub-file generation fails. All sub-files are merged into a target download file according to a preset merging strategy, and the file integrity is checked. The globally unique task identifier is marked as completed.

The fourth unit is used to upload the target download file to the file server cluster, establish a file status monitoring table, and in response to the user's file list query request, retrieve and return the target download file from the file server cluster.

8. An electronic device, characterized in that it comprises:

processor;

Memory used to store processor-executable instructions;

The processor is configured to invoke instructions stored in the memory to execute the method according to any one of claims 1 to 6.

9. A computer-readable storage medium having stored thereon computer program instructions, characterized in that, when executed by a processor, the computer program instructions implement the method described in any one of claims 1 to 6.