[go: up one dir, main page]

CN120935167A - Asynchronous file generation and downloading realization method and system - Google Patents

Asynchronous file generation and downloading realization method and system

Info

Publication number
CN120935167A
CN120935167A CN202511437820.1A CN202511437820A CN120935167A CN 120935167 A CN120935167 A CN 120935167A CN 202511437820 A CN202511437820 A CN 202511437820A CN 120935167 A CN120935167 A CN 120935167A
Authority
CN
China
Prior art keywords
file
task
download
data
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202511437820.1A
Other languages
Chinese (zh)
Other versions
CN120935167B (en
Inventor
朱晶晶
吴浩然
黄永洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guancheng Information Technology Suzhou Co ltd
Original Assignee
Guancheng Information Technology Suzhou Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guancheng Information Technology Suzhou Co ltd filed Critical Guancheng Information Technology Suzhou Co ltd
Priority to CN202511437820.1A priority Critical patent/CN120935167B/en
Publication of CN120935167A publication Critical patent/CN120935167A/en
Application granted granted Critical
Publication of CN120935167B publication Critical patent/CN120935167B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/06Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/12Applying verification of the received information
    • H04L63/123Applying verification of the received information received data contents, e.g. message integrity
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/32Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials
    • H04L9/3236Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials using cryptographic hash functions
    • H04L9/3239Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials using cryptographic hash functions involving non-keyed hash functions, e.g. modification detection codes [MDCs], MD5, SHA or RIPEMD
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/32Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials
    • H04L9/3247Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials involving digital signatures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/32Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials
    • H04L9/3297Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols including means for verifying the identity or authority of a user of the system or for message authentication, e.g. authorization, entity authentication, data integrity or data verification, non-repudiation, key authentication or verification of credentials involving time stamps, e.g. generation of time stamps
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L9/00Cryptographic mechanisms or cryptographic arrangements for secret or secure communications; Network security protocols
    • H04L9/40Network security protocols
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L2463/00Additional details relating to network architectures or network communication protocols for network security covered by H04L63/00
    • H04L2463/121Timestamp

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提供异步文件生成下载实现方法及系统,涉及计算机文件处理技术领域,包括接收用户文件下载请求后创建任务并生成唯一标识;基于文件特征识别规则将任务划分为多个子任务片段并分配资源;将子任务分配至多处理节点并行生成子文件;合并子文件为目标下载文件并上传至服务器集群;响应用户查询返回文件。本发明提高了大文件下载效率,增强了系统稳定性,优化了用户体验。

This invention provides a method and system for asynchronous file generation and downloading, relating to the field of computer file processing technology. The method includes: receiving a user's file download request, creating a task and generating a unique identifier; dividing the task into multiple sub-task segments based on file feature recognition rules and allocating resources accordingly; distributing the sub-tasks to multiple processing nodes to generate sub-files in parallel; merging the sub-files into the target download file and uploading it to a server cluster; and responding to user queries by returning the file. This invention improves the efficiency of large file downloads, enhances system stability, and optimizes the user experience.

Description

Asynchronous file generation and downloading realization method and system
Technical Field
The invention relates to the technical field of computer file processing, in particular to a method and a system for realizing asynchronous file generation and downloading.
Background
In an enterprise information system, a file downloading function is a common basic requirement, and particularly in application scenes such as data analysis, report export and the like, a user often needs to export a large amount of data as a file for downloading. The traditional file downloading implementation usually adopts a synchronous processing mode, namely, when a user initiates a downloading request, the system immediately executes data query, file generation and transmission operations, and the user can acquire the file after waiting for the whole process to be completed.
The prior art lacks an effective file generation state tracking and task recovery mechanism, and once abnormality or failure occurs in the file generation process, the whole task needs to be re-executed, thereby wasting system resources and prolonging the waiting time of users.
In recent years, although part of systems start to adopt an asynchronous processing mode to improve the file downloading function, the problems of insufficient flexibility in task scheduling, unreasonable resource allocation, imperfect subtask failure processing mechanism and the like still exist. In complex business scenarios, especially when processing ultra-large data volumes or file generation tasks involving multi-data source integration is required, the requirements of high efficiency and high reliability are difficult to meet in the prior art.
Disclosure of Invention
The embodiment of the invention provides a method and a system for realizing asynchronous file generation and downloading, which can solve the problems in the prior art.
In a first aspect of the embodiment of the present invention, a method for implementing asynchronous file generation and download is provided, including:
receiving a file downloading request of a user, extracting a data source code, creating a file downloading task, writing the file downloading task into a database, and generating a global unique task identifier;
Analyzing the file downloading task based on a preset file characteristic identification rule, dynamically dividing the file downloading task into a plurality of subtask fragments according to an analysis result, and distributing independent resource quota and a processing timeout threshold for each subtask fragment;
Distributing the sub-task fragments to a plurality of target processing nodes, generating sub-files by adopting a parallel calling file generation method, and redistributing corresponding sub-file generation tasks to other file processing nodes when detecting that the sub-files are failed to be generated;
Uploading the target download file to a file server cluster, establishing a file state monitoring table, responding to a file list query request of a user, and acquiring and returning the target download file from the file server cluster.
Receiving a file downloading request of a user, extracting a data source code, creating a file downloading task, writing the file downloading task into a database, and generating a global unique task identifier, wherein the method comprises the following steps of:
receiving a file downloading request of a user, extracting a data source type identifier, a data source number, time stamp information and a digital signature from the file downloading request to obtain a data source code and create a file downloading task;
based on the data source code, calculating the priority score of the file downloading task by combining the user grade and the task queue length, writing the file downloading task into a double-layer task processing structure according to the priority score, and starting an asynchronous thread to write the file downloading task with the priority score larger than a priority threshold into a database in batches when the number of the tasks of the double-layer task processing structure reaches the upper limit;
and generating a global unique task identifier of the file downloading task according to the database writing time stamp, the priority score and the data source code, and setting the cache failure time.
Analyzing the file downloading task based on a preset file characteristic identification rule, dynamically dividing the file downloading task into a plurality of subtask fragments according to an analysis result, and distributing independent resource quota and processing timeout threshold for each subtask fragment, wherein the method comprises the following steps:
Extracting file data in the file downloading task, calculating entropy value, data distribution density and time sequence correlation of the file data, and constructing a file feature vector;
Calculating the association degree of the data blocks according to the file feature vectors, constructing a data block dependency relationship graph, calculating the ratio of the input degree value to the output degree value of each data block node, and taking the data block node with the ratio lower than the dependency threshold value as an initial segmentation node;
Calculating the data transmission cost between the initial segmentation nodes, selecting the initial segmentation nodes with the data transmission cost smaller than a cost threshold as optimal segmentation nodes, dynamically segmenting the file downloading task based on the optimal segmentation nodes, and generating a plurality of subtask fragments;
and performing similarity matching on the file feature vector and the historical execution task feature, setting an independent processing timeout threshold value for each subtask segment based on the similarity matching result, and setting an independent resource quota in combination with the data complexity of the subtask segment.
Distributing the plurality of sub-task fragments to a plurality of target processing nodes, generating sub-files by adopting a parallel calling file generation method, and redistributing corresponding sub-file generation tasks to other file processing nodes when detecting that the sub-files are generated in failure, wherein the method comprises the following steps of:
Obtaining the total resource amount, the network bandwidth capacity and the current load state of each processing node, screening out candidate processing nodes meeting the subtask segment resource quota requirement, calculating the task affinity score of each candidate processing node, constructing a candidate processing node priority sequence according to the order of the task affinity score from high to low, selecting a plurality of candidate processing nodes as target processing nodes, and distributing the plurality of subtask segments to the target processing nodes;
Starting a file generation process on a target processing node, initializing a file writing handle and a file check code, writing data of a sub-task segment into a file according to the size of a data block in a segmented mode, calculating the check code of each data block, comparing the check code with the file check code, and writing the data block again when the fact that the check codes of the data blocks are not matched is detected, and updating file metadata information after the generation of the sub-file is completed;
And monitoring the processing progress of the subtask fragments on the target processing nodes, and when the fact that the file metadata information of the subtask fragments is unchanged and the processing time exceeds the processing timeout threshold value of each subtask fragment is detected, selecting the next processing node from the candidate processing node priority sequence again to perform task allocation.
Merging all the subfiles into a target download file according to a preset merging strategy, detecting the file integrity of the target download file, and marking the globally unique task identifier as completed, wherein the method comprises the following steps:
acquiring file metadata information of all subfiles, and establishing a subfile merging sequence according to the numbering sequence of the file metadata information;
reading data blocks of the subfiles according to the subfiles merging sequence, writing the data blocks into a file merging buffer area for data merging, controlling the data merging rate through a token bucket algorithm, dynamically adjusting the capacity of the token bucket according to the writing bandwidth of a system disk, suspending the data merging operation when the number of tokens in the token bucket is insufficient, and reducing the reading rate balance data processing pipelines of the upstream subfiles;
Real-time checking is carried out on the combined data based on a sliding window mechanism, cyclic redundancy check codes of the data blocks are calculated in the sliding window, the calculated cyclic redundancy check codes are compared with original check codes of the subfiles, and when mismatching of the check codes is detected, the sliding window is retracted and data combination is carried out again;
And reading the data blocks which pass verification from the merging buffer area, writing the data blocks into a target download file, carrying out integrity check on the target download file to obtain the target download file which passes verification, and marking the globally unique task identifier as a finished state.
Uploading the target download file to a file server cluster, establishing a file state monitoring table, responding to a file list query request of a user, acquiring and returning the target download file from the file server cluster, and comprising the following steps:
uploading the target download file to a file server cluster, acquiring storage position information of the target download file in the file server cluster, and generating a corresponding file address;
establishing a file state monitoring table, recording the file state, the file address and the file access times of the target download file, marking the target download file as a hot spot file when the file access times exceed an access threshold value, and triggering a file preloading mechanism;
And receiving a file list query request sent by a user, obtaining a file state list based on the file state monitoring table, preferentially displaying the hot files, acquiring a target download file from the server cluster, and returning the target download file to the user.
In a second aspect of the embodiment of the present invention, an asynchronous file generation and download implementation system is provided, including:
the first unit is used for receiving a file downloading request of a user, extracting a data source code, creating a file downloading task, writing the file downloading task into a database and generating a global unique task identifier;
The second unit is used for analyzing the file downloading task based on a preset file characteristic identification rule, dynamically dividing the file downloading task into a plurality of subtask fragments according to an analysis result, and distributing independent resource quota and a processing overtime threshold value for each subtask fragment;
The third unit is used for distributing the plurality of sub-task fragments to a plurality of target processing nodes, generating sub-files by adopting a parallel calling file generation method, and redistributing corresponding sub-file generation tasks to other file processing nodes when detecting that the sub-files are failed to be generated;
And the fourth unit is used for uploading the target download file to a file server cluster, establishing a file state monitoring table, responding to a file list query request of a user, and acquiring and returning the target download file from the file server cluster.
In a third aspect of an embodiment of the present invention,
There is provided an electronic device including:
A processor;
a memory for storing processor-executable instructions;
Wherein the processor is configured to invoke the instructions stored in the memory to perform the method described previously.
In a fourth aspect of an embodiment of the present invention,
There is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method as described above.
The beneficial effects of the application are as follows:
By implementing the method for implementing asynchronous file generation and downloading, the file downloading task can be dynamically divided into a plurality of subtask fragments for parallel processing, the processing efficiency of large file generation and downloading is effectively improved, the waiting time of a user is shortened, and the response speed of a system is improved.
The invention adopts an independent resource quota and timeout threshold control mechanism and combines a subtask failure automatic redistribution strategy, thereby remarkably enhancing the stability and fault tolerance of the system under a high concurrency scene, effectively avoiding the problem of integral download failure caused by single-point failure and ensuring the reliability of the file generation process.
In addition, the invention realizes the functions that users can inquire the file processing progress and acquire results at any time through the file server cluster storage and the file state monitoring table management, obviously improves the user experience, simultaneously is convenient for a system administrator to monitor and manage the file processing process, and improves the maintainability of the system.
Drawings
FIG. 1 is a flow chart of an implementation method for asynchronous file generation and downloading according to an embodiment of the present invention;
fig. 2 is a flow chart of a method for processing a file download request.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The technical scheme of the invention is described in detail below by specific examples. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.
Fig. 1 is a flow chart of an implementation method for asynchronous file generation and downloading according to an embodiment of the present invention, as shown in fig. 1, the method includes:
receiving a file downloading request of a user, extracting a data source code, creating a file downloading task, writing the file downloading task into a database, and generating a global unique task identifier;
Analyzing the file downloading task based on a preset file characteristic identification rule, dynamically dividing the file downloading task into a plurality of subtask fragments according to an analysis result, and distributing independent resource quota and a processing timeout threshold for each subtask fragment;
Distributing the sub-task fragments to a plurality of target processing nodes, generating sub-files by adopting a parallel calling file generation method, and redistributing corresponding sub-file generation tasks to other file processing nodes when detecting that the sub-files are failed to be generated;
Uploading the target download file to a file server cluster, establishing a file state monitoring table, responding to a file list query request of a user, and acquiring and returning the target download file from the file server cluster.
Fig. 2 is a flow chart of a method for processing a file download request. In an alternative embodiment, receiving a file download request from a user, extracting a data source code and creating a file download task, writing the file download task into a database and generating a globally unique task identifier, including:
receiving a file downloading request of a user, extracting a data source type identifier, a data source number, time stamp information and a digital signature from the file downloading request to obtain a data source code and create a file downloading task;
based on the data source code, calculating the priority score of the file downloading task by combining the user grade and the task queue length, writing the file downloading task into a double-layer task processing structure according to the priority score, and starting an asynchronous thread to write the file downloading task with the priority score larger than a priority threshold into a database in batches when the number of the tasks of the double-layer task processing structure reaches the upper limit;
and generating a global unique task identifier of the file downloading task according to the database writing time stamp, the priority score and the data source code, and setting the cache failure time.
In this embodiment, a method of receiving a user file download request, extracting a data source code, and creating a file download task is described in detail. By creating an efficient task processing mechanism, the method ensures reasonable allocation of system resources and the priority of task processing.
In this embodiment, a FILE download request sent by a user needs to be received, where the request is usually transmitted in the form of HTTP or HTTPs protocol through a network, and includes a plurality of key information, a data source type identifier is extracted from a request body, where the identifier is usually in the form of a character string, such as "DB" indicates a database type source, "API" indicates an interface type source, "FILE" indicates a FILE system type source, and at the same time, a data source number is extracted, where the number is usually a unique character string, and a format such as "SRC20240501001" is used to precisely locate a specific data source, timestamp information in the request is recorded in UTC format, such as "2024-05-01t12:30:45z", and used to prevent replay attacks, and the request also includes a digital signature, and an encrypted character string generated by using the sha256+rsa algorithm, such as "a1b2c3d4e5f6.
After the file downloading task is created, the priority score of the task is calculated so as to reasonably allocate system resources, three main factors are considered in priority calculation, namely data source coding, user grade and current task queue length, for the data source coding, basic scores are allocated according to different types, such as 10 points for database type, 8 points for API type and 6 points for file system type, the user grade is generally divided into VIP users, common users and trial users, and the weights are respectively set to 1.5, 1.0 and 0.8. When the length of the task queue is less than 100, the score is not adjusted, when the length is between 100 and 500, the priority is reduced by 10%, when the length is more than 500, the priority is reduced by 20%, and the final priority score is calculated by combining the factors. For example, if the VIP user requests to download a database type source, and the current queue length is 80, the priority score is 10×1.5×1.0=15 points.
After the priority score is calculated, the file downloading task is written into a double-layer task processing structure, the structure comprises a high-priority task queue and a low-priority task queue in a memory, when the priority score exceeds 12 times, the tasks enter the high-priority queue, otherwise, enter the low-priority queue, the inside of each queue is organized by adopting a priority stack structure, the priority processing of the high-priority tasks is ensured, the task number of the double-layer task processing structure is continuously monitored, when the total task number reaches a preset upper limit (usually 1000), asynchronous thread processing is started, the threads screen out tasks with priority scores larger than a priority threshold (usually 10 times), the tasks form batches (each batch is at most 100 tasks), a database batch writing interface is called, the tasks are stored in a lasting mode, and the database writing adopts a transaction mechanism to ensure the atomicity of batch writing, namely all success or all failure.
After the database is written successfully, a globally unique Task identifier (Task ID) needs to be generated for each Task, three elements including a database writing time stamp, a Task priority score and a data source code are comprehensively considered in the generation process, in specific implementation, the time stamp accurate to milliseconds needs to be acquired, such as 1588763142358, the priority score is converted into a two-bit fixed-length character string, such as priority 15.5 is converted into 16, the last 6-bit character (such as 01001 is extracted from DB_SRC 20240501001) is extracted from the data source code, the three parts are connected by using a specific separator, and a 4-bit random alphanumeric character string is added as a suffix to form a final Task identifier, for example, the combination mode guarantees the global uniqueness of the identifier, and meanwhile key information of the Task is contained.
After generating task identifiers, writing task information into a cache system to improve the subsequent query efficiency, wherein the cache records contain key information such as task identifiers, data source codes, priority scores, creation time and the like, different cache failure times are set according to task priorities, namely, a high-priority task (score > 15) is set for 30 minutes of a cache period, a medium-priority task (10-15) is set for 20 minutes of a cache period, a low-priority task (10) is set for 10 minutes of a cache period, and after the cache fails, the task information is reloaded from a database to ensure data consistency.
And returning a response containing the global unique task identifier to the user, wherein the user can inquire the task state through the identifier, the response body adopts a JSON format and contains information such as the task identifier, the estimated processing time, the task state and the like, and meanwhile, a detailed operation log containing information such as the user ID, the operation time, the request IP, the task identifier and the like is recorded so as to facilitate subsequent audit and problem investigation.
In an optional implementation manner, the file downloading task is analyzed based on a preset file feature recognition rule, the file downloading task is dynamically divided into a plurality of subtask fragments according to an analysis result, and an independent resource quota and a processing timeout threshold are allocated to each subtask fragment, including:
Extracting file data in the file downloading task, calculating entropy value, data distribution density and time sequence correlation of the file data, and constructing a file feature vector;
Calculating the association degree of the data blocks according to the file feature vectors, constructing a data block dependency relationship graph, calculating the ratio of the input degree value to the output degree value of each data block node, and taking the data block node with the ratio lower than the dependency threshold value as an initial segmentation node;
Calculating the data transmission cost between the initial segmentation nodes, selecting the initial segmentation nodes with the data transmission cost smaller than a cost threshold as optimal segmentation nodes, dynamically segmenting the file downloading task based on the optimal segmentation nodes, and generating a plurality of subtask fragments;
and performing similarity matching on the file feature vector and the historical execution task feature, setting an independent processing timeout threshold value for each subtask segment based on the similarity matching result, and setting an independent resource quota in combination with the data complexity of the subtask segment.
In practice, a file download task is received, which contains basic information of the file, such as URL, file size, etc., a file data sample is extracted from the download task, typically the first 1MB of the file or a uniformly sampled piece of data, for the extracted file data, an entropy value is calculated reflecting the randomness and information density of the data, the entropy value calculation is calculated by counting the frequency of occurrence of each byte in the data, and a high entropy value (e.g., > 7.5) typically indicates that the data is compressed or encrypted. Meanwhile, calculating the data distribution density, namely the distribution condition of the data values in the range, which can be realized by constructing a histogram, dividing the byte value range of 0-255 into 16 intervals, counting the number of data points of each interval, analyzing the time sequence correlation of data, namely the degree of correlation between adjacent data blocks, quantifying by calculating the correlation coefficient of the adjacent data blocks, wherein the correlation coefficient is close to 1 and shows high correlation, and the correlation coefficient is close to 0 and shows almost irrelevance, and constructing a file feature vector comprising entropy values, data distribution density feature vectors and time sequence correlation indexes by the calculation.
Based on the constructed file feature vector, the association degree between each data block in the file is calculated, the file is divided into basic data blocks with the same size, such as 64KB, each block is divided into basic data blocks, based on time sequence correlation indexes in the feature vector, the association degree between the blocks is calculated, for the text file, the association degree of adjacent blocks is higher (such as 0.8), for different content areas in the multimedia file, the association degree is lower (such as 0.3), a data block dependency graph is constructed by using the association degree values, nodes in the graph represent the data blocks, edges represent the association relation between the data blocks, and weights of the edges represent the association strength.
After the dependency graph is constructed, calculating the ratio of the ingress value to the egress value of each data block node, wherein the ingress value represents the edge number pointing to the node, the egress value represents the edge number starting from the node, and the nodes with the ratio lower than the preset dependency threshold (such as 0.5) are marked as initial segmentation nodes, and are usually positions where the data content is significantly changed and are suitable as task segmentation nodes. For example, a 100MB video file has two low ratio nodes at 33MB and 67MB, with ratios of 0.3 and 0.4, respectively, which correspond to scene transitions or content type changes.
Further calculating the data transmission cost between the initial split nodes, i.e. the resource consumption required for transmitting data from one node to another, the transmission cost can be calculated comprehensively based on the data volume, network conditions and data characteristics between the nodes. For example, for two initial segmentation nodes 20MB apart, if the entropy of the intermediate data is high, which indicates that the data complexity is high, the transmission cost reaches a high value such as 85, while for a data region with a low entropy, even if the entropy is 30MB apart, the transmission cost is only 50, and the initial segmentation node with the data transmission cost smaller than the preset cost threshold is selected as the optimal segmentation node.
Based on the determined optimal segmentation node, the file downloading task is dynamically segmented into a plurality of subtask fragments, each subtask fragment corresponds to a continuous data area of the original file, for example, a 100MB file is divided into three subtasks, namely 0-33MB,33-67MB and 67-100MB, and the data internal structure is considered in the segmentation mode, so that the data relevance in each subtask is strong, and the dependence among the subtasks is weak.
And performing similarity matching on the constructed file feature vector and the feature vector of the historical execution task, wherein the similarity calculation can adopt methods such as cosine similarity or Euclidean distance, the higher the value is, the more similar the task characteristics are, for example, the similarity between the current file feature vector and a certain ZIP file in the historical record is 0.92, the two files are provided with similar processing characteristics, and an independent processing timeout threshold is set for each subtask segment based on a similarity matching result. For subtasks that are highly similar to the historical task (e.g., similarity > 0.9), the average processing time of the historical task plus 20% tolerance is directly used as the timeout threshold, for the case of moderate similarity (0.6-0.9), 150% of the historical time can be used as the threshold, and for the case of low similarity (< 0.6), a looser default threshold is set, such as subtask size divided by the lowest desired download rate.
Meanwhile, independent resource quota is set according to the data complexity of the subtask fragments, the data complexity can be measured through entropy values and distribution density in feature vectors, subtasks with high complexity (such as entropy value > 7.0) need more processing resources and allocate 50% more memory and CPU time, subtasks with low complexity (such as entropy value < 5.0) can use standard quota, for example, for a download task comprising three subtasks, the first subtask is allocated with 2 processing threads and 100MB memory, the timeout threshold is 30 seconds, the second subtask is allocated with 4 threads and 200MB memory, the timeout threshold is 60 seconds, the third subtask is allocated with 6 threads and 400MB memory, and the timeout threshold is 120 seconds.
Through the dynamic fragmentation and resource allocation mechanism, the downloading process can be optimized according to the file characteristics, the downloading success rate and efficiency are improved, and meanwhile, the system resources are reasonably utilized.
In an alternative embodiment, the multiple subtask fragments are distributed to multiple target processing nodes, a parallel calling file generation method is adopted to generate a subfile, when the subfile generation failure is detected, the corresponding subfile generation task is redistributed to other file processing nodes, including:
Obtaining the total resource amount, the network bandwidth capacity and the current load state of each processing node, screening out candidate processing nodes meeting the subtask segment resource quota requirement, calculating the task affinity score of each candidate processing node, constructing a candidate processing node priority sequence according to the order of the task affinity score from high to low, selecting a plurality of candidate processing nodes as target processing nodes, and distributing the plurality of subtask segments to the target processing nodes;
Starting a file generation process on a target processing node, initializing a file writing handle and a file check code, writing data of a sub-task segment into a file according to the size of a data block in a segmented mode, calculating the check code of each data block, comparing the check code with the file check code, and writing the data block again when the fact that the check codes of the data blocks are not matched is detected, and updating file metadata information after the generation of the sub-file is completed;
And monitoring the processing progress of the subtask fragments on the target processing nodes, and when the fact that the file metadata information of the subtask fragments is unchanged and the processing time exceeds the processing timeout threshold value of each subtask fragment is detected, selecting the next processing node from the candidate processing node priority sequence again to perform task allocation.
In this embodiment, resource status information of each processing node is obtained, including load indexes such as the number of CPU cores, the memory capacity, the storage space, the network bandwidth capacity, the current CPU utilization, the memory occupancy rate, the network traffic, and the like of each processing node, taking a server cluster of a certain data center as an example, the processing node a has a 16-core CPU, a 64-GB memory, a 10Gbps network bandwidth, the current CPU utilization rate is 30%, the memory occupancy rate is 40%, the network bandwidth utilization rate is 25%, the processing node B has a 32-core CPU, a 128-GB memory, a 40Gbps network bandwidth, the current CPU utilization rate is 60%, the memory occupancy rate is 70%, and the network bandwidth utilization rate is 45%.
And according to the resource requirement of each subtask segment, for example, the subtask segment 1 needs a 4-core CPU, an 8GB memory and a 1Gbps network bandwidth, the subtask segment is matched with the available resources of each processing node, and candidate processing nodes meeting the resource requirement are screened out. For subtask segment 1, processing nodes A and B both meet their resource requirements, becoming candidate processing nodes.
Calculating task affinity scores of candidate processing nodes, wherein the task affinity scores are based on a plurality of factors including resource matching degree of the processing nodes, historical task completion rate, average processing speed and the like, the resource matching degree considers the matching degree of demands of subtasks on resources and the processing nodes which can provide the resources, the historical task completion rate reflects the stability of the processing nodes, and the average processing speed reflects the efficiency of the processing nodes. For processing node A, the resource matching degree is 85% (sufficient resources but not excessive), the historical task completion rate is 98%, the average processing speed is 120MB/s, the comprehensive calculation results in an affinity score of 0.89, for processing node B, the resource matching degree is 65% (excessive resources), the historical task completion rate is 95%, the average processing speed is 180MB/s, and the comprehensive calculation results in an affinity score of 0.81.
And constructing candidate processing node priority sequences according to the task affinity scores, wherein the priority sequences in the example are processing node A and processing node B. And selecting the processing node A from the priority sequence as a target processing node, distributing the subtask fragment 1 to the processing node A for execution, and distributing other subtask fragments by adopting a similar method.
When the target processing node receives the subtask segment, the target processing node starts a file generation process, creates a file writing handle, allocates a file storage space, initializes a file check code, for example, using an MD5 or SHA256 algorithm, takes the subtask segment 1 as an example, and divides the subtask segment into data blocks of 4MB each, and 500 data blocks in total.
And executing writing operation on each data block, after writing the content of the data block 1 into a file, calculating a check code of the data block, for example, using a CRC32 algorithm to obtain a check value 0x1A2B3C4D, comparing the check value with an expected check value, if the check values are matched, continuing to process the next data block, and if the check values are not matched (for example, obtaining 0x1A2B3C 5E), re-writing the data block until the check codes are matched or the maximum retry number is reached.
During the writing process, the file metadata information is updated periodically (e.g., every 10 data blocks written), including file size, number of processed data blocks, last modification time, etc. After writing all the data blocks, calculating the check code of the whole file, updating the final file metadata, and marking the generation state of the subfiles as 'complete'.
The processing progress of all the subtask fragments is continuously monitored, for each executing subtask fragment, file metadata information is checked at regular intervals (for example, 5 seconds), and if the metadata information (for example, last modification time and number of processed data blocks) of the subtask fragment 1 is found to have no change in 3 continuous checks, and the processing time exceeds a preset processing timeout threshold (for example, 1.5 times of expected processing time calculated according to the data quantity, 150 seconds in this example), the subtask fragment processing failure is determined.
When the failure of the subtask segment 1 processing is detected, selecting the next processing node B from the candidate processing node priority sequence, and reassigning the subtask segment 1 to the processing node B for execution. And after receiving the task, the processing node B restarts the file generation flow, initializes the file writing handle and the check code, writes the file according to the data block and checks the file until the generation of the subfiles is successfully completed.
By the technical means, the method and the device realize intelligent distribution of the subtask fragments, reliable writing of file data and automatic redistribution of failed tasks, and improve the parallel efficiency and reliability of file generation.
In an alternative embodiment, merging all the subfiles into a target download file according to a preset merging policy, detecting the file integrity of the target download file, and marking the globally unique task identifier as completed, including:
acquiring file metadata information of all subfiles, and establishing a subfile merging sequence according to the numbering sequence of the file metadata information;
reading data blocks of the subfiles according to the subfiles merging sequence, writing the data blocks into a file merging buffer area for data merging, controlling the data merging rate through a token bucket algorithm, dynamically adjusting the capacity of the token bucket according to the writing bandwidth of a system disk, suspending the data merging operation when the number of tokens in the token bucket is insufficient, and reducing the reading rate balance data processing pipelines of the upstream subfiles;
Real-time checking is carried out on the combined data based on a sliding window mechanism, cyclic redundancy check codes of the data blocks are calculated in the sliding window, the calculated cyclic redundancy check codes are compared with original check codes of the subfiles, and when mismatching of the check codes is detected, the sliding window is retracted and data combination is carried out again;
And reading the data blocks which pass verification from the merging buffer area, writing the data blocks into a target download file, carrying out integrity check on the target download file to obtain the target download file which passes verification, and marking the globally unique task identifier as a finished state.
In a method for implementing efficient file merging and integrity verification, ordered merging and verification of subfiles is implemented through multiple steps, the method retrieves file metadata information of all subfiles from a storage system, including key attributes such as file name, creation time, file size, file block sequence number, and the like, and constructs a subfile merging sequence based on the sequence of numbers in the file metadata information, typically sequence numbers contained in the subfile names, such as "file_part_001", "file_part_002", and the like, for example, for a download task containing 5 subfiles, a merging queue is established according to the sequence from "file_part_001" to "file_part_005", so as to ensure that merging operations are performed according to the correct file block sequence.
The merging process adopts a token bucket algorithm to control the data processing rate, initializes a token bucket, the initial capacity of the token bucket is set to be 50MB, and the token bucket is dynamically adjusted according to the current system disk writing capacity, a certain amount of tokens are added to the token bucket every 100 milliseconds, the adding amount depends on the disk writing bandwidth monitored in real time, for example, when the system disk writing bandwidth is detected to be 120MB/s, about 12MB of tokens are added every 100 milliseconds. In the data merging process, sub-file data blocks are sequentially read according to a sub-file merging sequence, the default block size is 4MB, tokens with corresponding sizes are consumed for each data block to be read, when the number of available tokens in a token bucket is lower than 4MB, data merging operation is suspended, meanwhile, the reading rate of an upstream sub-file is reduced through a back pressure mechanism, so that the whole data processing pipeline is balanced, in a specific implementation, a reading thread can be enabled to enter a short sleep state, usually 50 milliseconds, when the tokens are insufficient, and the memory occupation is prevented from being too high due to too fast reading operation.
The data merging quality control adopts a sliding window mechanism to carry out real-time verification, a sliding window with the default size of 16MB is created, after merging data is written into a merging buffer area, the sliding window moves in the buffer area and calculates CRC32 cyclic redundancy check codes of the data in the current window, for example, when 32MB data is written into the buffer area, the sliding window sequentially calculates check codes of intervals of 0-16MB, 4MB-20MB, 8MB-24MB and the like and compares the check codes with original check codes of subfiles, if the check codes of intervals [12MB-28MB ] are not matched, if the calculated check codes are 0xA1B2C3D4 and the original check codes are 0xA1B2C3D5, the sliding window is retracted to the position of the last successful check, for example, the data blocks of the corresponding subfiles are read again and combined, and in order to improve the verification efficiency, the block calculation mode is adopted, the window of 16MB is divided into 4 sub-blocks of 4MB to calculate the check codes in parallel, and the merging result is obtained.
And after the data passes the verification, reading the data blocks which pass the verification from the merging buffer area and writing the data blocks into the target downloading file, wherein the writing adopts an asynchronous IO mode, an IO request queue is maintained, the default queue length is 8, the data size of each IO request is 8MB, the IO requests are processed in a background thread, the data are written into a disk, and after all the subfiles are merged, the integrity verification is carried out on the generated target downloading file. The integrity check adopts a mode of combining block check and full file check, the target file is divided into a plurality of blocks with the size of 128MB, the last block is smaller than 128MB, the SHA-256 hash value of each block is calculated and compared with the original hash value obtained during the initialization of the downloading task, if all the blocks pass the check, the MD5 value of the whole file is calculated and is subjected to secondary confirmation, and after the verification passes, the system marks a globally unique task identifier, such as 'download_task_12345678', in the task management system as a completed state and updates related information such as the task completion time, the file size and the like.
If the mismatch condition is found in the verification process of the file integrity, the information of the failed block is recorded, and the corresponding subfiles are tried to be recombined, if the verification is still failed after the retry is performed for 3 times, a detailed error report is generated, and the detailed error report comprises the information of the failed block range, the difference between the expected check value and the actual check value and the like, so that the subsequent analysis and the subsequent processing are facilitated.
In an alternative embodiment, uploading the target download file to a file server cluster, and establishing a file status monitoring table, responding to a file list query request of a user, acquiring and returning the target download file from the file server cluster, including:
uploading the target download file to a file server cluster, acquiring storage position information of the target download file in the file server cluster, and generating a corresponding file address;
establishing a file state monitoring table, recording the file state, the file address and the file access times of the target download file, marking the target download file as a hot spot file when the file access times exceed an access threshold value, and triggering a file preloading mechanism;
And receiving a file list query request sent by a user, obtaining a file state list based on the file state monitoring table, preferentially displaying the hot files, acquiring a target download file from the server cluster, and returning the target download file to the user.
In this embodiment, a target download file uploaded by a user is received, the file is transmitted to a preset file server cluster through a secure transmission protocol (such as HTTPS, SFTP, etc.), the file server cluster includes a plurality of distributed file storage servers, distributed storage of the file is implemented through a load balancing policy, a slicing process is performed on the uploaded file, for example, a 2GB video file is divided into 20 100MB data blocks, and a hash value of each data block is calculated for verification, so as to ensure file integrity, and after the uploading is completed, storage location information of the target download file is obtained from the file server cluster, including a server IP address, a storage path, a file slicing location, etc. Based on the above information, a unique file access address is generated, which is in URI format, such as "/cluster/file/repository/2023/07/15/file_id_12345.mp4", for quick locating and accessing of subsequent files.
After the uploading is completed, a file state monitoring table is established and used for recording and managing the state information of the target downloaded file, and the monitoring table is stored in a data structure by adopting key values and comprises the following fields of a file ID, a file name, a file type, a file size, uploading time, file states (such as available, deleted, damaged and the like), file addresses, file access times, final access time, file labels and the like. The file status is checked periodically (e.g., every 5 minutes), file accessibility is ensured, when it is detected that the number of file accesses exceeds a preset access threshold, the target downloaded file is marked as a hot file, e.g., when a certain video file is accessed more than 1000 times within 1 hour, the system updates its status field as a "hot file" and adds a corresponding mark in the database.
The file preloading mechanism is triggered for a target download file marked as a hot spot file, and is implemented by copying the hot spot file from an original storage location to a cache server, such as an edge node server distributed at a different geographic location. These edge node servers configure high speed SSD storage and sufficient bandwidth resources to more quickly respond to user requests. And implementing a multi-copy strategy on the hot spot file, storing the same file copy on a plurality of service nodes, improving the concurrency performance of file access, and generating versions with different quality and formats in advance for the oversized file so as to adapt to the user requirements under different network conditions, for example, generating 1080p, 720p, 480p and other versions for a video file with 4K resolution, wherein the user can automatically select the proper version according to the network bandwidth.
When a file list query request sent by a user is received, related file information is retrieved from a file state monitoring table according to request parameters such as file types, uploading time ranges and the like, a file state list is generated based on records in the monitoring table, the list contains basic information such as file IDs, file names, file types, file sizes, uploading time, file states and the like, when the list is generated, contents marked as hot files are preferentially displayed, the hot files are arranged at the front position of the list, the list is displayed in a paging mode, 20 records are displayed per page by default, and the user can adjust the display quantity of each page through the request parameters.
And for the hot files, the file content is preferentially obtained from the nearest edge node server, the breakpoint continuous transmission function is supported, and the user is allowed to continue to download from the last interruption position after the downloading is interrupted by recording the downloading progress, so that the repeated transmission of the downloaded file part is avoided.
In the file transmission process, the downloading operation is recorded, the file access times field in the file state monitoring table is updated, the last access time is updated, the file access frequency is monitored in real time, and the hot file judgment threshold is dynamically adjusted. For example, during peak network hours (e.g., 18:00-22:00 a day), the hotspot file determination threshold is raised to 2000 accesses per hour, and during low peak hours (e.g., 2:00 a-6:00 a.m.) to 500 accesses per hour to optimize system resource allocation.
By the method, efficient uploading, state monitoring and intelligent distribution of the target download file are achieved, particularly preloading processing of the hot spot file is achieved, file access performance and user experience are remarkably improved, meanwhile load pressure of the server cluster is effectively balanced, and stability and reliability of the whole system are improved.
The asynchronous file generation and downloading realization system of the embodiment of the invention comprises the following steps:
the first unit is used for receiving a file downloading request of a user, extracting a data source code, creating a file downloading task, writing the file downloading task into a database and generating a global unique task identifier;
The second unit is used for analyzing the file downloading task based on a preset file characteristic identification rule, dynamically dividing the file downloading task into a plurality of subtask fragments according to an analysis result, and distributing independent resource quota and a processing overtime threshold value for each subtask fragment;
The third unit is used for distributing the plurality of sub-task fragments to a plurality of target processing nodes, generating sub-files by adopting a parallel calling file generation method, and redistributing corresponding sub-file generation tasks to other file processing nodes when detecting that the sub-files are failed to be generated;
And the fourth unit is used for uploading the target download file to a file server cluster, establishing a file state monitoring table, responding to a file list query request of a user, and acquiring and returning the target download file from the file server cluster.
In a third aspect of an embodiment of the present invention, there is provided an electronic device including:
A processor;
a memory for storing processor-executable instructions;
Wherein the processor is configured to invoke the instructions stored in the memory to perform the method described previously.
In a fourth aspect of embodiments of the present invention, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method as described above.
The present invention may be a method, apparatus, system, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for performing various aspects of the present invention.
It should be noted that the above embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that the technical solution described in the above embodiments may be modified or some or all of the technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the scope of the technical solution of the embodiments of the present invention.

Claims (9)

1.异步文件生成下载实现方法,其特征在于,包括:1. An asynchronous file generation and download implementation method, characterized by comprising: 接收用户的文件下载请求,提取数据源标识并创建文件下载任务,将所述文件下载任务写入数据库并生成全局唯一任务标识;Receive a user's file download request, extract the data source identifier and create a file download task, write the file download task into the database and generate a globally unique task identifier; 基于预设的文件特征识别规则对所述文件下载任务进行分析,根据分析结果将所述文件下载任务动态划分为多个子任务片段,并为每个子任务片段分配独立的资源配额及处理超时阈值;The file download task is analyzed based on preset file feature recognition rules. According to the analysis results, the file download task is dynamically divided into multiple sub-task segments, and each sub-task segment is allocated an independent resource quota and processing timeout threshold. 将所述多个子任务片段分配至多个分布式处理节点,采样并行调用文件生成方法生成子文件,当检测到所述子文件生成失败时,将对应的子文件生成任务重新分发至其他分布式处理节点;将全部子文件按照预设合并策略合并为目标下载文件,并检测其文件完整性,将所述全局唯一任务标识标记为已完成;The multiple subtask fragments are distributed to multiple distributed processing nodes, and the file generation method is called in parallel to generate sub-files. When the generation of a sub-file fails, the corresponding sub-file generation task is redistributed to other distributed processing nodes. All sub-files are merged into the target download file according to a preset merging strategy, and the file integrity is checked. The globally unique task identifier is marked as completed. 将所述目标下载文件上传至分布式文件服务器集群,响应用户的文件列表查询请求,从所述分布式文件服务器集群获取并返回所述目标下载文件。The target download file is uploaded to the distributed file server cluster. In response to the user's file list query request, the target download file is retrieved from the distributed file server cluster and returned. 2.根据权利要求1所述的方法,其特征在于,接收用户的文件下载请求,提取数据源编码并创建文件下载任务,将所述文件下载任务写入数据库并生成全局唯一任务标识,包括:2. The method according to claim 1, characterized in that receiving a user's file download request, extracting the data source code and creating a file download task, writing the file download task into the database and generating a globally unique task identifier, includes: 接收用户的文件下载请求,并从中提取数据源类型标识、数据源编号、时间戳信息及数字签名,得到数据源编码并创建文件下载任务;Receive the user's file download request, extract the data source type identifier, data source number, timestamp information and digital signature from it, obtain the data source code and create a file download task; 基于所述数据源编码,结合用户等级和任务队列长度,计算文件下载任务的优先级得分,根据所述优先级得分将所述文件下载任务写入双层任务处理结构,当所述双层任务处理结构的任务数量达到上限时,启动异步线程将所述优先级得分大于优先阈值的文件下载任务批量写入数据库;Based on the data source encoding, combined with the user level and task queue length, the priority score of the file download task is calculated. The file download task is written into the two-layer task processing structure according to the priority score. When the number of tasks in the two-layer task processing structure reaches the upper limit, an asynchronous thread is started to write the file download tasks with priority scores greater than the priority threshold into the database in batches. 根据数据库写入时间戳、所述优先级得分及数据源编码,生成所述文件下载任务的全局唯一任务标识,并设置缓存失效时间。Based on the database write timestamp, the priority score, and the data source code, a globally unique task identifier is generated for the file download task, and a cache expiration time is set. 3.根据权利要求1所述的方法,其特征在于,基于预设的文件特征识别规则对所述文件下载任务进行分析,根据分析结果将所述文件下载任务动态划分为多个子任务片段,并为每个子任务片段分配独立的资源配额及处理超时阈值,包括:3. The method according to claim 1, characterized in that, the file download task is analyzed based on preset file feature recognition rules, and the file download task is dynamically divided into multiple sub-task segments according to the analysis results, and each sub-task segment is allocated an independent resource quota and processing timeout threshold, including: 提取所述文件下载任务中的文件数据,计算所述文件数据的熵值、数据分布密度及时序相关性,构建文件特征向量;Extract file data from the file download task, calculate the entropy value, data distribution density, and time series correlation of the file data, and construct a file feature vector; 根据所述文件特征向量计算数据块关联度,构建数据块依赖关系图,并计算每个数据块节点的入度值与出度值的比值,将所述比值低于依赖阈值的数据块节点,作为初始分割节点;The correlation degree of data blocks is calculated based on the file feature vector, a data block dependency graph is constructed, and the ratio of the in-degree value to the out-degree value of each data block node is calculated. Data block nodes with a ratio lower than the dependency threshold are used as initial splitting nodes. 计算所述初始分割节点之间的数据传输代价,选取所述数据传输代价小于代价阈值的初始分割节点,作为最优分割节点;基于所述最优分割节点对所述文件下载任务进行动态分片,生成多个子任务片段;Calculate the data transmission cost between the initial segmentation nodes, select the initial segmentation node whose data transmission cost is less than the cost threshold as the optimal segmentation node, and dynamically segment the file download task based on the optimal segmentation node to generate multiple subtask segments. 将所述文件特征向量与历史执行任务特征进行相似度匹配,基于所述相似度匹配结果对每个所述子任务片段设置独立的处理超时阈值,结合其数据复杂度设置独立的资源配额。The file feature vector is matched with the features of historical execution tasks. Based on the similarity matching results, an independent processing timeout threshold is set for each subtask segment, and an independent resource quota is set in combination with its data complexity. 4.根据权利要求1所述的方法,其特征在于,将所述多个子任务片段分配至多个目标处理节点,采用并行调用文件生成方法生成子文件,当检测到所述子文件生成失败时,将对应的子文件生成任务重新分发至其他文件处理节点,包括:4. The method according to claim 1, characterized in that, the plurality of subtask fragments are allocated to a plurality of target processing nodes, a parallel file generation method is used to generate subfiles, and when the generation of a subfile fails, the corresponding subfile generation task is redistributed to other file processing nodes, including: 获取各处理节点的资源总量、网络带宽容量与当前负载状态,筛选出满足子任务片段资源配额要求的候选处理节点,并计算各候选处理节点的任务亲和度得分,按照任务亲和度得分由高到低的顺序构建候选处理节点优先级序列,并选取多个候选处理节点作为目标处理节点,将所述多个子任务片段分配至所述目标处理节点;Obtain the total resources, network bandwidth capacity and current load status of each processing node, filter out candidate processing nodes that meet the resource quota requirements of subtask segments, calculate the task affinity score of each candidate processing node, construct a priority sequence of candidate processing nodes in descending order of task affinity score, select multiple candidate processing nodes as target processing nodes, and allocate the multiple subtask segments to the target processing nodes. 在目标处理节点上启动文件生成进程,初始化文件写入句柄与文件校验码,将子任务片段的数据按照数据块大小分段写入文件,计算每个数据块的校验码并与文件校验码进行比对,当检测到数据块校验码不匹配时重新写入该数据块,完成子文件生成后更新文件元数据信息;Start the file generation process on the target processing node, initialize the file write handle and file checksum, write the data of the subtask fragments into the file according to the data block size, calculate the checksum of each data block and compare it with the file checksum, and rewrite the data block when a data block checksum mismatch is detected. After the sub-file generation is completed, update the file metadata information. 监控子任务片段在目标处理节点上的处理进度,当检测到子任务片段的文件元数据信息未发生变化,且处理时间已超过各子任务片段的处理超时阈值时,重新从候选处理节点优先级序列中选择下一各处理节点进行任务分配。Monitor the processing progress of subtask segments on the target processing node. When it is detected that the file metadata information of the subtask segment has not changed and the processing time has exceeded the processing timeout threshold of each subtask segment, reselect the next processing node from the priority sequence of candidate processing nodes for task allocation. 5.根据权利要求1所述的方法,其特征在于,将全部子文件按照预设合并策略合并为目标下载文件,并检测其文件完整性,将所述全局唯一任务标识标记为已完成,包括:5. The method according to claim 1, characterized in that, merging all sub-files into a target download file according to a preset merging strategy, detecting its file integrity, and marking the globally unique task identifier as completed, includes: 获取全部子文件的文件元数据信息,根据所述文件元数据信息的编号顺序建立子文件合并序列;Obtain the file metadata information of all sub-files, and establish a sub-file merging sequence according to the numbering order of the file metadata information; 根据所述子文件合并序列读取子文件的数据块,并将其写入文件合并缓冲区进行数据合并,通过令牌桶算法控制数据合并速率,所述令牌桶的容量根据系统磁盘写入带宽动态调整,当令牌桶中的令牌数量不足时暂停数据合并操作,并降低上游子文件的读取速率平衡数据处理管道;The data blocks of the sub-files are read according to the sub-file merging sequence and written into the file merging buffer for data merging. The data merging rate is controlled by the token bucket algorithm. The capacity of the token bucket is dynamically adjusted according to the system disk write bandwidth. When the number of tokens in the token bucket is insufficient, the data merging operation is paused and the read rate of the upstream sub-files is reduced to balance the data processing pipeline. 基于滑动窗口机制对合并后的数据进行实时校验,在滑动窗口内计算数据块的循环冗余校验码,将计算得到的循环冗余校验码与子文件的原始校验码进行比对,当检测到校验码不匹配时回退滑动窗口并重新进行数据合并;The merged data is verified in real time using a sliding window mechanism. The cyclic redundancy check code of the data block is calculated within the sliding window. The calculated cyclic redundancy check code is compared with the original check code of the sub-file. When a check code mismatch is detected, the sliding window is rolled back and the data is merged again. 从所述合并缓冲区读取校验通过的数据块写入目标下载文件,对所述目标下载文件进行完整性检验,得到验证通过的目标下载文件,并将所述全局唯一任务标识标记为已完成状态。The data blocks that have passed verification are read from the merge buffer and written into the target download file. The integrity of the target download file is checked to obtain the verified target download file, and the globally unique task identifier is marked as completed. 6.根据权利要求1所述的方法,其特征在于,将所述目标下载文件上传至文件服务器集群,并建立文件状态监控表,响应于用户的文件列表查询请求,从所述文件服务器集群获取并返回所述目标下载文件,包括:6. The method according to claim 1, characterized in that, uploading the target download file to a file server cluster and establishing a file status monitoring table, and in response to a user's file list query request, retrieving and returning the target download file from the file server cluster, includes: 将所述目标下载文件上传至文件服务器集群,获取所述目标下载文件在所述文件服务器集群中的存储位置信息,生成对应的文件地址;The target download file is uploaded to the file server cluster, the storage location information of the target download file in the file server cluster is obtained, and the corresponding file address is generated. 建立文件状态监控表,记录所述目标下载文件的文件状态、所述文件地址及文件访问次数,当所述文件访问次数超过访问阈值时,将所述目标下载文件标记为热点文件,触发文件预加载机制;Establish a file status monitoring table to record the file status, file address, and number of file accesses of the target download file. When the number of file accesses exceeds the access threshold, mark the target download file as a hot file and trigger the file preloading mechanism. 接收用户发送的文件列表查询请求,基于所述文件状态监控表得到文件状态列表,并优先展示热点文件,从所述服务器集群中获取目标下载文件并返回给用户。The system receives a file list query request from a user, obtains a file status list based on the file status monitoring table, prioritizes displaying popular files, retrieves the target download file from the server cluster, and returns it to the user. 7.异步文件生成下载实现系统,用于实现如权利要求1-6中任一项所述的方法,其特征在于,包括:7. An asynchronous file generation and download implementation system, used to implement the method as described in any one of claims 1-6, characterized in that it comprises: 第一单元,用于接收用户的文件下载请求,提取数据源编码并创建文件下载任务,将所述文件下载任务写入数据库并生成全局唯一任务标识;The first unit is used to receive the user's file download request, extract the data source code and create a file download task, write the file download task into the database and generate a globally unique task identifier; 第二单元,用于基于预设的文件特征识别规则对所述文件下载任务进行分析,根据分析结果将所述文件下载任务动态划分为多个子任务片段,并为每个子任务片段分配独立的资源配额及处理超时阈值;The second unit is used to analyze the file download task based on preset file feature recognition rules, dynamically divide the file download task into multiple sub-task segments according to the analysis results, and allocate independent resource quotas and processing timeout thresholds to each sub-task segment. 第三单元,用于将所述多个子任务片段分配至多个目标处理节点,采用并行调用文件生成方法生成子文件,当检测到所述子文件生成失败时,将对应的子文件生成任务重新分发至其他文件处理节点;将全部子文件按照预设合并策略合并为目标下载文件,并检测其文件完整性,将所述全局唯一任务标识标记为已完成;The third unit is used to distribute the multiple sub-task fragments to multiple target processing nodes, generate sub-files using a parallel file generation method, and redistribute the corresponding sub-file generation task to other file processing nodes when the sub-file generation fails. All sub-files are merged into a target download file according to a preset merging strategy, and the file integrity is checked. The globally unique task identifier is marked as completed. 第四单元,用于将所述目标下载文件上传至文件服务器集群,并建立文件状态监控表,响应于用户的文件列表查询请求,从所述文件服务器集群获取并返回所述目标下载文件。The fourth unit is used to upload the target download file to the file server cluster, establish a file status monitoring table, and in response to the user's file list query request, retrieve and return the target download file from the file server cluster. 8.一种电子设备,其特征在于,包括:8. An electronic device, characterized in that it comprises: 处理器;processor; 用于存储处理器可执行指令的存储器;Memory used to store processor-executable instructions; 其中,所述处理器被配置为调用所述存储器存储的指令,以执行权利要求1至6中任意一项所述的方法。The processor is configured to invoke instructions stored in the memory to execute the method according to any one of claims 1 to 6. 9.一种计算机可读存储介质,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现权利要求1至6中任意一项所述的方法。9. A computer-readable storage medium having stored thereon computer program instructions, characterized in that, when executed by a processor, the computer program instructions implement the method described in any one of claims 1 to 6.
CN202511437820.1A 2025-10-10 2025-10-10 Asynchronous file generation and download implementation methods and systems Active CN120935167B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202511437820.1A CN120935167B (en) 2025-10-10 2025-10-10 Asynchronous file generation and download implementation methods and systems

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202511437820.1A CN120935167B (en) 2025-10-10 2025-10-10 Asynchronous file generation and download implementation methods and systems

Publications (2)

Publication Number Publication Date
CN120935167A true CN120935167A (en) 2025-11-11
CN120935167B CN120935167B (en) 2026-02-13

Family

ID=97585158

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202511437820.1A Active CN120935167B (en) 2025-10-10 2025-10-10 Asynchronous file generation and download implementation methods and systems

Country Status (1)

Country Link
CN (1) CN120935167B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107743099A (en) * 2017-08-31 2018-02-27 华为技术有限公司 Data flow processing method, device and storage medium
CN111800459A (en) * 2020-05-26 2020-10-20 苏宁云计算有限公司 Asynchronous processing method, device and system for download task and storage medium
CN118590484A (en) * 2024-06-27 2024-09-03 中国建设银行股份有限公司 Data download method, device, equipment, storage medium and program product
CN119739537A (en) * 2025-03-03 2025-04-01 北京科杰科技有限公司 File fuzzy copy method and system based on big data file cluster
CN119996400A (en) * 2024-12-30 2025-05-13 深圳震有科技股份有限公司 A file uploading method and a file downloading method between a satellite and a ground core network
CN119996406A (en) * 2025-03-28 2025-05-13 山东浪潮科学研究院有限公司 A browser large file download method and device
CN120296021A (en) * 2025-03-21 2025-07-11 中国市政工程华北设计研究总院有限公司 Data association maintenance processing method and system based on file reference
CN120408687A (en) * 2025-07-02 2025-08-01 北京科杰科技有限公司 A persistent Flink job file loading and display method and system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107743099A (en) * 2017-08-31 2018-02-27 华为技术有限公司 Data flow processing method, device and storage medium
CN111800459A (en) * 2020-05-26 2020-10-20 苏宁云计算有限公司 Asynchronous processing method, device and system for download task and storage medium
CN118590484A (en) * 2024-06-27 2024-09-03 中国建设银行股份有限公司 Data download method, device, equipment, storage medium and program product
CN119996400A (en) * 2024-12-30 2025-05-13 深圳震有科技股份有限公司 A file uploading method and a file downloading method between a satellite and a ground core network
CN119739537A (en) * 2025-03-03 2025-04-01 北京科杰科技有限公司 File fuzzy copy method and system based on big data file cluster
CN120296021A (en) * 2025-03-21 2025-07-11 中国市政工程华北设计研究总院有限公司 Data association maintenance processing method and system based on file reference
CN119996406A (en) * 2025-03-28 2025-05-13 山东浪潮科学研究院有限公司 A browser large file download method and device
CN120408687A (en) * 2025-07-02 2025-08-01 北京科杰科技有限公司 A persistent Flink job file loading and display method and system

Also Published As

Publication number Publication date
CN120935167B (en) 2026-02-13

Similar Documents

Publication Publication Date Title
US10412170B2 (en) Retention-based data management in a network-based data store
US10467105B2 (en) Chained replication techniques for large-scale data streams
US9471585B1 (en) Decentralized de-duplication techniques for largescale data streams
US9720989B2 (en) Dynamic partitioning techniques for data streams
US11296940B2 (en) Centralized configuration data in a distributed file system
US11494403B2 (en) Method and apparatus for storing off-chain data
US9609044B2 (en) Methods, systems, and media for stored content distribution and access
US9984139B1 (en) Publish session framework for datastore operation records
US10191663B1 (en) Using data store accelerator intermediary nodes and write control settings to identify write propagation nodes
US20210271648A1 (en) Data migration methods and system
US10592140B2 (en) Method and system for automated storage provisioning
CN114691775B (en) A hierarchical storage method and terminal based on blockchain
CN111488333B (en) Data processing method and device, storage medium and electronic equipment
CN120935167B (en) Asynchronous file generation and download implementation methods and systems
CN110866036B (en) Data processing method, system, device, terminal and readable storage medium
CN106873906A (en) Method and apparatus for managing metamessage
US8037242B2 (en) Contents delivery system using cache and data replication
CN120821442B (en) Control methods, devices, and storage media for distributed file systems
US10481813B1 (en) Device and method for extending cache operational lifetime
CN111343256A (en) Network disk file uploading method
CN116991815B (en) Log collection method, device, equipment and medium of distributed storage system
US12487855B2 (en) Replacing stale clusters in a cluster pool
US20240323038A1 (en) Method of uploading and managing single data set larger than maximum block size on blockchain
CN115016724B (en) Data processing method, device, data processing equipment and storage medium
CN120804160A (en) Cache data query method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant