Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. Based on the embodiments of the present application, all other embodiments obtained by a person of ordinary skill in the art without making any inventive effort are within the scope of the present application.
It should be noted that in the description of the present application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "first," "second," and the like in this specification are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
The present application will be further described in detail below with reference to the drawings and detailed description for the purpose of enabling those skilled in the art to better understand the aspects of the present application.
According to an aspect of the embodiment of the present application, a method for configuring a hard disk array is provided, and as an alternative implementation manner, the method for configuring a hard disk array may be applied to, but not limited to, a configuration system of a hard disk array in a hardware environment as shown in fig. 1. The configuration system of the hard disk array may include, but is not limited to, the terminal device 102, the network 110, and the hard disk array 112. The terminal device 102 operates with a target client (as shown in fig. 1, this target client is taken as an example of a client that can perform hard disk array configuration). The terminal device 102 includes a display 108, a processor 106, and a memory 104. The display 108 may be used to display performance data of the hard disk array, etc., and also to provide a human-machine interaction interface to receive operations on the interface and human-machine interaction operations for touch control of different controls. The processor is used for responding to the man-machine interaction operation to generate an interaction instruction and sending the interaction instruction to the server. The memory is used for storing performance data.
Assuming that a client for hard disk array configuration is running in the terminal device 102 in fig. 1, the specific procedure of this embodiment is as follows, i.e. as step S102, the terminal device 102 receives performance data from the hard disk array 112 through the network 110. The terminal equipment executes steps S104-S108 to obtain reference performance data of the hard disk array, wherein the reference performance data are used for indicating performance data of the hard disk array in a time period, candidate decision conditions which are met by the reference performance data in at least one candidate decision condition are determined to be target decision conditions, the candidate decision conditions are used for indicating the range of the performance data of the hard disk array, the at least one candidate decision condition corresponds to candidate configuration of one hard disk array, the candidate configuration corresponding to the target decision condition is determined to be target configuration, and the hard disk array is configured to be target configuration when the current configuration of the hard disk array is inconsistent with the target configuration. The terminal device 102 then executes step S110 to configure the hard disk array 112 according to the target configuration via the network 110.
Optionally, in this embodiment, the terminal device 102 may be a terminal device configured with a target client, and may include, but is not limited to, at least one of a Mobile phone (such as an Android Mobile phone, an iOS Mobile phone, etc.), a notebook computer, a tablet computer, a palm computer, a MID (Mobile INTERNET DEVICES, a Mobile internet device), a PAD, a desktop computer, a smart tv, etc. Such networks may include, but are not limited to, wired networks including local area networks, metropolitan area networks, and wide area networks, wireless networks including bluetooth, WIFI, and other networks that enable wireless communications. The network may also refer to the physical connection of terminal device 102 and hard disk array 112.
Alternatively, the terminal device 102 may be a RAID controller, which may be a hardware device, but may be located inside or outside the hard disk array, and assume core responsibilities of executing instructions of a policy control layer, managing RAID level migration, striping, and monitoring health of the hard disk. A RAID controller is a physical implementation that implements intelligent policies.
The terminal device 102 may also be a software management system on a storage server, which typically runs on the storage server rather than directly in the hard disk array. The method is a main component of a strategy control layer and is responsible for data monitoring, pattern recognition, predictive modeling and decision generation. The software management system indirectly controls the storage operation in the array through communication with the RAID controller, and realizes dynamic RAID configuration adjustment.
The terminal devices 102 may also be cloud storage services that are not part of the hard disk array themselves, but which may be connected to the data center's storage system via a network as additional storage resources. When the local storage resources are tense or the potential faults are handled, the intelligent dynamic RAID system can automatically call the resources of the cloud storage service, and temporary migration or redundant storage of the data is realized.
It should be noted that the terminal device 102 may be the same product as the hard disk array 112 in terms of product appearance, i.e., the terminal device 102 may be located in the hard disk array as part of the hard disk array 112 to control a single hard disk array. The terminal device 102 may also be a separate external terminal, responsible for configuring the different hard disk arrays.
Alternatively, the hard disk array 112 (RAID) may be configured to provide higher performance, greater storage capacity, and/or enhanced data reliability by combining multiple physical hard disks together via specific data distribution and redundancy policies to form a logically single storage device. Which may be an array of magnetic disks or an array of hard disks.
The hard disk array 112 includes a physical hard disk, which may be a conventional Hard Disk Drive (HDD) or a Solid State Disk (SSD), connected to a RAID controller by data lines, and collectively assuming a data storage function.
An embodiment of the present application provides a method for configuring a hard disk array, and fig. 2 is a flowchart of an alternative method for configuring a hard disk array according to an embodiment of the present application, where as shown in fig. 2, the method for configuring a hard disk array includes:
Step S202, obtaining reference performance data of a hard disk array, wherein the reference performance data is used for indicating the performance data of the hard disk array in a time period;
it should be noted that, the hard disk array is a set formed by a plurality of hard disks, which can provide faster read-write speed, larger storage capacity or data redundancy, and is a common form of RAID array. The reference performance data is a set of performance metrics exhibited by the hard disk array over a particular period of time, and may include IOPS (input output operations per second), throughput, latency, SMART data, and other key metrics that measure the performance of the hard disk or array. The time period is a time window for collecting and analyzing performance data, and may be seconds, minutes, hours or longer, depending on the settings of the monitoring and management system.
In an alternative embodiment, the system periodically collects performance data for the hard disk array to reflect the operational status of the array over a specified period of time. These data include not only basic read-write performance metrics, but may also cover disk health information, such as SMART data, as well as the pattern and frequency of data access. The collection of performance data is fundamental to subsequent analysis and prediction, which provides information about the current state of the array and past performance that will be used in decision algorithms to decide whether and how to adjust RAID configurations.
It should be noted that the reference performance data may be IOPS (input/output operands per second) that reflects the capability and strength of the hard disk array to handle read and write operations. Throughput is a measure of the total amount of data transferred per unit time, typically in MB/s or GB/s. Delay the time required for the data request to complete the response, for evaluating the speed of the hard disk array in processing the I/O request. SMART data, self-Monitoring, ANALYSIS AND Reporting Technology (Self-Monitoring, analysis and reporting techniques) provides hard disk health metrics including error rate, read-ahead errors, head time of flight, etc., for Monitoring the hard disk for potential failure. The cache hit rate reflects the effectiveness of the cache of the storage system, and a higher cache hit rate means that more data requests can be directly obtained from the cache, so that disk access is reduced, and the response speed is improved.
In an alternative embodiment, the reference performance data may be predicted data predicted based on performance data of the hard disk array. After the performance data of each period of the hard disk array is obtained, the collected data can be classified and analyzed. Such as based on real-time data over 15 minutes and pattern recognition results, historical data and predictive algorithms, such as predictive models in time series analysis, regression analysis, or machine learning (e.g., ARIMA, LSTM, etc.), are used to predict future 5 minutes of load trend and hard disk health. The predictions here may include two prediction modes:
Load trend prediction, predicting the strength and type of future I/O operations, helps the system to prepare for possible high load or pattern changes in advance, such as increasing stripe size to accommodate successive reads and writes, or switching to a more redundant RAID configuration to cope with possible failures.
And predicting the health state of the hard disk, namely evaluating the health state of the hard disk by analyzing SMART data and other performance indexes and predicting the fault probability. For example, if the error rate of the hard disk is found to be increasing, even if it is still within normal range, it may be predictive of a risk of failure within a few hours of the future.
It should be noted that in intelligent storage systems, it is entirely feasible to use historical data and current data for analysis and decision making, and prediction is not necessary. The use of predictive data is more of a decision to look ahead, such as predicting future loads, risk of failure, etc., to respond in advance, but this does not mean that prediction is the only or necessary means of intelligent management of the storage system.
In alternative embodiments, the reference performance data may be historical data and current data, which may provide past and present operating conditions of the storage system, including I/O patterns, performance metrics, data access frequency, disk health, and the like. By monitoring and collecting this data, it can be analyzed without prediction to optimize the current configuration and performance of the system.
By carrying out statistical analysis on the historical data, the conventional workload mode, performance bottleneck and common fault points of the system can be identified, so that more effective storage strategies and fault recovery plans are formulated.
The current state of the storage system, including IOPS, throughput, delay, disk error rate and the like, can be monitored in real time, and the sudden performance degradation or fault condition can be responded immediately, and measures such as isolation, adjustment of the size of the stripe, RAID level switching and the like are adopted to optimize the performance, so that the data security is ensured.
In alternative embodiments, the use of predictions, histories, and current data in combination may provide a more comprehensive basis for decision making in some scenarios. For example, predictive data may be used to predict future possible load changes or fault risks, allowing the system to readily adjust configuration, while historical and current data may be used to verify the accuracy of the predictions, and to optimize in real-time based on system state, ensuring real-time and accuracy of decisions.
Step S204, determining candidate decision conditions which are met by the reference performance data in at least one candidate decision condition as target decision conditions, wherein the candidate decision conditions are used for indicating the range of the performance data of the hard disk array, and the at least one candidate decision condition corresponds to the candidate configuration of one hard disk array;
it should be noted that, the candidate decision condition is a possible set of decision conditions preset according to the current performance index and the system state in the system optimization or decision process. These conditions dictate in what range the data should be in to trigger the corresponding actions, e.g., when read-write latency is below a certain threshold, the system may tend to be in a more efficient RAID configuration, whereas a configuration that is more focused on data redundancy and security may be selected.
The target decision conditions are those of a series of candidate decision conditions that are determined by the system to be most suitable as decision basis in the current state. Once determined, the system will follow the guidelines under this condition to make corresponding configuration adjustments.
Candidate configurations for a hard disk array may include RAID level, stripe size, redundancy mechanisms, etc., which directly determine the performance, reliability, and storage efficiency of the array. The following is a detailed analysis of these elements:
RAID levels define the manner in which data is stored and redundant in a hard disk array. Different RAID levels address different performance and reliability requirements, for example:
RAID0, data striping, no redundancy, provides the highest read-write performance, but failure of any single hard disk results in loss of the entire array data.
RAID1 is mirror image, data is completely copied to two hard disks, data redundancy is provided, high data reliability is guaranteed, and the storage capacity is only half of the total capacity of the hard disks in the array.
RAID5 striping with distributed parity, requiring at least 3 hard disks, providing data redundancy and higher read-write performance, and being able to continue to operate in the event of a single hard disk failure.
RAID6 is similar to RAID5, but provides double parity, adds a certain fault tolerance, and can protect data when two hard disks simultaneously fail.
RAID10 is a combination of RAID1 and RAID0, provides striping and mirroring, ensures high read-write performance and data redundancy, and is suitable for scenes with high performance and high reliability.
Stripe size refers to the unit size of data blocks distributed among different hard disks in a RAID array. The choice of stripe size has a direct impact on performance, with smaller stripe sizes (e.g., 8KB or 16 KB) being advantageous for random read and write operations because it can read or write small data blocks from or to multiple hard disks faster, reducing seek time. A larger stripe size (e.g., 256KB or greater) is suitable for continuous large data block reads and writes because it can reduce data transfer delay per I/O operation, improving data throughput.
The data redundancy mechanism involves how additional data is stored in the hard disk array to prevent data loss. Redundancy methods include parity checking, such as used in RAID5 and RAID6, by computing parity blocks and distributing the storage across the array, ensure that lost data can be reconstructed in the event of a hard disk failure. Mirroring, as used in RAID1, the data is fully replicated on two hard disks, providing data redundancy. The mode can be immediately switched to the mirror hard disk when the single hard disk fails, so that continuous access of data is ensured.
In an alternative embodiment, the system will check all candidate decision conditions to see which ranges of performance data under conditions match the currently collected reference performance data. Those candidate decision conditions that match successfully will be considered target decision conditions, i.e., the optimal decision scheme upon which the system should follow next.
Candidate decision conditions may cover various workload and performance states that the hard disk array may face, each condition corresponding to one possible optimal configuration or policy. For example, if the current read-write latency is low, the system may determine that it is currently a good time to make RAID level upgrades or striping parameter adjustments to improve performance, whereas if the detected disk error rate increases, the system may prefer a RAID configuration that adds redundancy protection, such as RAID5 or RAID6, to ensure data security and system stability.
The process embodies dynamic adaptation and optimization, and through real-time analysis and decision making, the system can flexibly adjust the storage configuration according to the actual load and performance requirements, so that the utilization of storage resources is maximized, and meanwhile, the reliability of data and the continuity of service are ensured. Such a mechanism is particularly useful in situations where workload or performance demands change frequently, such as data centers, cloud computing environments, or high performance computing clusters.
Step S206, determining candidate configuration corresponding to the target decision condition as target configuration;
As previously mentioned, target decision conditions refer to particular criteria or conditions under which the system may rely when making RAID configuration selections, and may include real-time I/O patterns, load conditions, failure rates, and the like. The target configuration may be a RAID configuration that is determined by the system to be the best or most suitable for the current decision condition after policy processing model evaluation and priority calculation among a plurality of candidate configurations.
In an alternative embodiment, the system analyzes and compares all candidate configurations, and determines an optimal RAID configuration as a target configuration based on current decision conditions, such as disk error rate, I/O optimization requirements, storage utilization or write latency, and priorities of these conditions. The system is able to quantitatively compare the potential benefits and costs of each candidate configuration. Eventually, the system will determine the RAID configuration that balances performance, security, and cost at the current decision conditions and assign it as the target configuration. This configuration will then be applied to the virtualized storage layer to achieve intelligent tuning and optimization of storage resources, thereby improving overall system performance and data redundancy while reducing resource consumption and maintenance costs.
In step S208, in the case where the current configuration of the hard disk array is inconsistent with the target configuration, the hard disk array is configured as the target configuration.
It should be noted that, when the current hard disk array configuration fails to meet the predicted performance requirement or the service continuity requirement, the system actively adjusts the configuration to the target configuration. This means that if the monitoring and analysis finds that the hard disk array under the current configuration cannot reach the desired performance level, or that there is a potential data security risk, the system will automatically or manually initiate the configuration change process to ensure that the storage system can adapt to the changing environment, providing optimal performance and guaranteeing the security of the data.
In alternative embodiments, details of the target configuration, including but not limited to RAID level, stripe size, redundancy mechanism, etc., are first clarified before any configuration changes are initiated. These parameters should be based on real-time analysis of I/O patterns, load conditions, disk health and traffic requirements, i.e., determining target configurations.
After determining the target configuration, a RAID level migration may be performed, e.g., from RAID5 to RAID6, or from RAID0 to RAID10. This typically involves a redistribution of data and reconstruction of parity blocks. During migration, the system may limit data write operations to prevent data inconsistencies while ensuring that there is enough free space to store the generated parity data.
If the stripe size needs to be adjusted, the system reorganizes the data blocks through a background task, and ensures that each data block is distributed on different hard disks according to the new stripe size. This may affect the read-write performance of the data and therefore requires reasonable arrangements to be performed during periods of low load as much as possible.
In adjusting redundancy mechanisms, such as adding redundant hard disks or changing parity policies, a balance between data integrity and storage efficiency needs to be considered. The system may need to ensure a dual write or triple write mechanism during data migration to maintain redundancy and data consistency. And initializing and synchronizing the newly added hard disk or the modified parity check strategy to ensure that all data are correctly stored.
Example 1:
Assuming that in step S202, the system monitors a sudden increase in write latency of the current hard disk array (RAID 5) by continuous performance data acquisition, exceeding a threshold of 100ms, and predicts that this high latency state will last for a period of time according to the predictive model. Furthermore, the system recognizes that the current I/O pattern is mainly continuous large-scale data writing, the sequential write duty cycle has exceeded 80%, and this trend is expected not to change for the next ten minutes.
In steps S204 through S206, the system performs in-depth data analysis, including assessment of disk health, consideration of service level, and calculation of migration costs. According to the above situation, the system judges that the RAID0 configuration is switched, and increases the stripe size to 1MB, so that the writing performance can be remarkably improved, and the current health state of the disk and the service continuity requirement can bear such configuration change.
In step S208, the system confirms that the current configuration (RAID 5) is inconsistent with the target configuration (RAID 0+1mb stripe), and then starts the configuration migration process. The system steps the hard disk array from RAID5 to RAID0 while increasing the stripe size. In this process, the system monitors the performance impact of the traffic to ensure that migration operations do not result in data inconsistencies or traffic disruption. For example, the system may rate-limit execution of the migration to ensure continuity and consistency of data during peak business hours.
After the configuration migration is completed, the system continues to monitor the hard disk array performance under the new RAID configuration to ensure that the target configuration does improve performance and solve the original problem. If the monitored data shows a significant decrease in write latency and no data inconsistency occurs during the next few hours, it is stated that the system successfully optimizes the hard disk array to achieve the desired performance enhancement and stability enhancement goals. Conversely, if the monitoring results are not ideal, the system may need to readjust the policy, even roll back to the previous configuration, to find a better solution.
The method comprises the steps of obtaining reference performance data of a hard disk array, determining candidate decision conditions which are met by the reference performance data in at least one candidate decision condition as target decision conditions, wherein the candidate decision conditions are used for indicating the range of the performance data of the hard disk array, the at least one candidate decision condition corresponds to candidate configuration of one hard disk array, determining the candidate configuration corresponding to the target decision condition as target configuration, and configuring the hard disk array as target configuration under the condition that the current configuration of the hard disk array is inconsistent with the target configuration. The performance data of the currently running hard disk array can be processed to obtain reference performance data, candidate decision conditions which are met by the reference performance data are determined, and therefore candidate configurations corresponding to the candidate decision conditions are determined to be target configurations. The target configuration is the configuration of the hard disk array which is expected to be converted, and therefore, when the current configuration of the hard disk array is inconsistent with the target configuration, the hard disk array is configured as the target configuration. By the method, the proper configuration of the hard disk array can be determined in real time and changed, so that the technical problem of low storage efficiency in the conventional configuration method of the hard disk array can be solved, and the technical effect of improving the storage efficiency of a storage system is achieved.
In an alternative embodiment, the candidate decision condition which is met by the reference performance data in the at least one candidate decision condition is determined to be a target decision condition, wherein the target decision condition is determined to be a first candidate decision condition when the reference performance data indicates that the reliability index of the hard disk array does not meet the first index range, and the target decision condition is determined to be a second candidate decision condition when the reference performance data indicates that the efficiency index of the hard disk array does not meet the second index range.
It should be noted that the first index range generally refers to a series of criteria or thresholds set for reliability of the storage system. This may include disk error rates, failure predictors, data redundancy levels, etc. to ensure data security and stability of the storage system. The second index range may be related to efficiency of the storage system, and relates to key indexes such as read-write speed, I/O operation performance, data throughput, and the like, so as to optimize working efficiency and response speed of the storage system.
In an alternative embodiment, the system will screen out the decision condition conforming to the current performance characteristic from the preset plurality of candidate strategies according to the collected reference performance data, and use the decision condition as the next execution strategy. When the reliability index is monitored to be reduced, if the error rate of the magnetic disk exceeds a preset first index range, the system selects a first candidate decision condition focusing on improving the reliability as a target decision condition. Conversely, if the efficiency indicator, e.g., IOPS, throughput, is below the second indicator range, the system tends to select a second candidate decision condition that focuses on efficiency improvement.
The key to step S204 is to determine a target decision condition based on the reference performance data, thereby guiding the adjustment of the system configuration. The reliability index of the hard disk array can be checked to be in an ideal state, and the health condition of the array is monitored through signals such as SMART data, disk error rate and the like. If the index is found to be lower than the first set range, the system deduces that the risk of data loss or service interruption possibly exists, and at the moment, measures for enhancing the reliability, such as switching to RAID6 configuration or increasing disk redundancy should be preferentially taken to ensure the data security.
On the other hand, if the monitored efficiency indicator, such as IOPS, throughput, or latency, fails to reach the preset second indicator range, indicating that the response speed or data processing capacity of the storage system may not meet the traffic demand, the system will turn to the target decision condition for improving efficiency. This may include adjusting the RAID configuration to increase read and write speed, such as switching from RAID5 to RAID0, or adjusting stripe size to optimize data access patterns, particularly when the system recognizes predominantly contiguous large blocks of data read and write.
According to the embodiment of the application, based on the real-time performance data and the preset strategy conditions, the most suitable target decision condition is dynamically selected and configured and adjusted, and the RAID configuration can be dynamically adjusted by the system on the premise of keeping the data integrity and the system stability so as to meet the continuously changing business requirements and performance challenges, thereby ensuring that the storage system is efficient and stable.
In an alternative embodiment, in case the reference performance data indicates that the reliability index of the hard disk array does not meet the first index range, determining the target decision condition as the first candidate decision condition comprises at least one of:
1) The error rate of the hard disk array is larger than a preset error rate, wherein the error rate is used for indicating the ratio of the number of read-write errors to the total number of read-write errors in unit time of the hard disk array;
2) The error number of the hard disk array is larger than a preset error number, wherein the error number is used for indicating the total number of operation errors of the hard disk array in a period of time;
3) The number of the hard disks sending the prompt information in the hard disk array is larger than the preset number of the hard disks, wherein the prompt information is used for indicating that the hard disks have faults;
It should be noted that the preset error rate may be an error rate threshold defined in the system, which is used as a basis for determining whether the hard disk array needs to enter a higher redundant RAID level (e.g. RAID 6). The preset error number may be a maximum error number threshold allowed in a unit time, and is used to trigger the system to adjust the configuration of the hard disk array. The reminder may be a warning signal sent by the hard disk, typically displayed in SMART data, indicating that the hard disk may have begun to fail, requiring attention or precautions to be taken. The preset number of hard disks may be a hard disk failure warning threshold set by the system, and when this number is reached or exceeded, the system will consider the reliability of the hard disk array to be compromised.
In an alternative embodiment, the system continuously monitors the performance data of the hard disk array, and particularly when the reliability index is monitored to be lower than the preset first index range in real time, the point of interest is transferred to a strategy for enhancing the data redundancy and improving the fault tolerance. This typically means that the system needs to switch from a current RAID configuration (e.g., RAID 5) to a more reliable RAID level (e.g., RAID 6), or take other actions to address the possible failure risk of the hard disk to ensure the security of the stored data and continued operation of the system.
When the error rate of the hard disk array is monitored to be too high, the error number exceeds the safety limit or the fault sign of too many hard disks occurs, the system automatically selects the first candidate decision condition as the target decision condition, namely, prioritizing to improve the redundancy and the safety of storage. This determination may trigger an adjustment of the RAID configuration, such as switching from a RAID of low redundancy level to a RAID level of high redundancy, or initiating a replacement mechanism for the failed hard disk. The system can find the optimal balance point between the data security and the storage efficiency through real-time analysis and dynamic adjustment strategy, thereby effectively aiming at potential storage faults, reducing the risk of data loss and ensuring the continuity of service and the integrity of data.
In an alternative embodiment, in case the reference performance data indicates that the efficiency index of the hard disk array does not meet the second index range, determining the target decision condition as the second candidate decision condition comprises at least one of:
1) The occupation ratio of the sequential read-write operation of the hard disk array is larger than a preset occupation ratio, wherein the sequential read-write instruction of the hard disk array is used for continuously reading or writing data;
2) The delay time of the hard disk array is larger than the preset delay time, wherein the delay time is used for indicating the time of finishing one write operation of the hard disk array.
It should be noted that the preset duty ratio is used for judging whether the current sequential read-write operation is dominant. If the current sequential read-write operation ratio is higher than this ratio, it may be implied that the storage system should adjust the configuration to better support such operations. Delay time refers to the time required from the issuance of a data access request to the request for a response, including but not limited to data retrieval, processing, and return to the requestor. High latency may indicate that the storage system has a performance bottleneck in handling the current workload. The preset delay time is a standard or threshold set by the system for determining whether the current delay time is within an acceptable range. If the actual latency exceeds a preset value, the system may need to take steps to increase storage efficiency or reduce latency, such as adjusting RAID configurations.
In alternative embodiments, if an efficiency indicator of the storage system, such as latency of I/O operations or performance of sequential read and write operations, is below an expected standard or threshold, the system may need to take action to improve efficiency.
When the system observes that sequential read and write operations account for a large portion of the total operation, if this ratio exceeds a preset threshold, the system may tend to optimize configuration to support continuous data access. For example, this may include adjusting to RAID0 configurations or increasing stripe sizes, as these configurations are generally advantageous for improving the efficiency of sequential read and write operations.
If the average time to complete a write operation exceeds the delay threshold set by the system, the system may also consider the efficiency indicator to be substandard. In such a case, the system may need to be reconfigured to reduce latency, such as switching to a more efficient but possibly less redundant RAID configuration, or to relieve the burden on the storage nodes by optimizing the data distribution policy.
According to the embodiment of the application, the storage strategy can be flexibly adjusted according to the real-time efficiency index, the reliability index and the preset threshold value so as to adapt to different business scenes and performance requirements and ensure the smoothness of the service and the optimization of user experience.
In an alternative embodiment, after determining that the target decision condition is a first candidate decision condition, the method includes configuring the hard disk array to be a first target configuration, wherein the first target configuration is a configuration corresponding to the first candidate decision condition;
After configuring the hard disk array to the first target configuration, one of:
1) Transmitting target data to a first hard disk in a hard disk array, and establishing a mirror image of the target data on a second hard disk in the hard disk array;
2) And sending the target data to a third hard disk in the hard disk array, and sending verification information of the target data to the third hard disk.
It should be noted that the target data may be a specific data set that needs to be stored or processed in the RAID array, which is a main object of system management and optimization.
In a RAID configuration, mirroring refers to a copy of data that is fully replicated between two or more disks. Mirroring is the basis of RAID1 (mirror array) for data redundancy, fast failure recovery and read acceleration. The check information is used for check codes or metadata for data integrity and consistency. In RAID configurations, particularly RAID5 or RAID6, the verification information is used to detect or repair data errors, ensuring the correctness of the data.
In an alternative embodiment, if the new configuration is a pair of mirrors, such as RAID1, the system will ensure that data is written to both hard disks at the same time to ensure data consistency and redundancy. If the configuration involves parity, such as RAID5 or RAID6, the system will calculate and store the parity information based on the location of the data blocks and the parity policy to prevent corruption or loss of data and to enable quick recovery of data when needed.
Upon identifying that the hard disk array is facing a serious reliability challenge, the system converts the hard disk array to a first target configuration, which may include switching to RAID1 (mirror array) or RAID6 (stripe array with double parity) to accommodate the current high risk environment. After configuration conversion, the system begins to implement specific redundancy and protection policies. In the case of a mirrored array (e.g., RAID 1), the system will ensure that data is written to at least two copies, a first hard disk and a second hard disk, simultaneously, so that even if one of the hard disks fails, the data on the other hard disk is still available and traffic is not affected. For RAID configurations using parity (e.g., RAID5 or RAID 6), the system may store the parity information for the data on a separate hard disk (the third hard disk), so that even if one or more of the hard disks in the array fail, the system can recover the lost data by the parity information, avoiding data loss or service interruption.
FIG. 3 is a schematic diagram of a first target configuration of an alternative hard disk array according to an embodiment of the application, the architecture shown in FIG. 3 writing data to two or more disks (e.g., the first hard disk and the second hard disk shown in FIG. 3) simultaneously, each containing a complete copy of the data, in the case of receiving target data.
The architecture can provide data redundancy, and if one disk fails, the data can be read from the other disk, so that the immediate recovery of the data is realized. But disk space utilization is low because the data is completely replicated, requiring twice as much storage space or more than the original data. The method is suitable for applications with extremely high requirements on data security and frequent read operation, such as financial transaction systems, key data backup and the like.
Fig. 4 is a schematic diagram of a first target configuration of another alternative hard disk array according to an embodiment of the present application, and as shown in fig. 4, target data (data 1 to data 15) may be respectively stored in five third hard disks, each of which may store corresponding data and verification information. For example, data 1 to data 3 may be sub-data of one data block, and the parity information 1 and parity information 2 may be two parity information generated based on data 1 to data 3.
The architecture uses two sets of parity information, allowing data recovery when two disks fail at the same time. Providing higher data redundancy and fault tolerance, and recovering data even if two disks fail at the same time. But the disk space utilization is further reduced. Suitable for applications with extremely high requirements on data security, such as disaster recovery systems, high availability server clusters.
In an alternative embodiment, after determining that the target decision condition is a second candidate decision condition, the method includes configuring the hard disk array to be in a second target configuration, wherein the second target configuration is a configuration corresponding to the second candidate decision condition;
after configuring the hard disk array to the second target configuration, one of:
1) Dividing target data into a plurality of sub-data according to a preset strip, and transmitting the sub-data to a plurality of hard disks contained in a hard disk array one by one;
2) Dividing target data into a plurality of sub-data according to a preset strip, transmitting the sub-data to a fifth hard disk contained in the hard disk array one by one, and establishing a mirror image of the sub-data on a sixth hard disk in the hard disk array.
It should be noted that the predetermined stripe may be a predefined stripe size for determining how data is divided and distributed on each disk in the hard disk array. The larger preset stripes are suitable for continuous large block data reading and writing, while the smaller stripes are suitable for high frequency random access. The sub data are smaller data units formed by dividing data by a preset stripe, and the sub data are distributed to different magnetic discs of the hard disc array for storage.
After determining the second candidate decision condition as the target decision condition, the system adjusts the configuration of the hard disk array to a second target configuration that matches this decision to optimize storage efficiency or response performance challenges. Once the configuration is adjusted in place, the system reorganizes the manner in which the data is stored based on new configuration parameters, such as stripe size, ensuring that the data can be read and written more efficiently.
In an alternative embodiment, after identifying that the efficiency index (e.g. read/write speed, I/O delay) is lower than the preset second index range, the target decision condition is determined to be a second candidate decision condition. This typically means that the system needs to take measures to increase storage efficiency, such as switching to a RAID0 configuration, to take advantage of the parallel read and write benefits to reduce data access latency. After the second target configuration is determined, the system will perform configuration changes, including adjusting stripe sizes, changing data distribution rules, etc., to accommodate the new RAID level.
After configuration adjustment is completed, the system needs to reorganize the storage mode of the data, so that the data can be effectively divided and distributed on each disk in the array according to the preset stripe size. This step is critical to maximize read-write performance, because reasonably distributing data can reduce single point bottlenecks, fully utilizing the read-write capabilities of all disks. If the system selects a RAID0 configuration, this typically means that the data will be split into multiple sub-data which will then be sent in parallel to multiple disks in the array to enable fast reading and writing of the data.
FIG. 5 is a schematic diagram of a second target configuration of an alternative hard disk array according to an embodiment of the application, which architecture may use a data chunking technique, as shown in FIG. 5, where data is split into chunks of the same size and written to at least two disks simultaneously. As shown in fig. 5, the target data may be divided into data 1 to data 6 and stored on the hard disk one by one, respectively.
The architecture can provide the fastest read-write speed, because data is distributed on a plurality of disks, and read-write operations can be performed in parallel. Failure of any one disk can result in failure of the entire RAID array, with a high risk of data loss. Is suitable for applications with high performance requirements and low data security requirements, such as temporary data processing, cache and the like.
Fig. 6 is a schematic diagram of a second target configuration of another alternative hard disk array according to an embodiment of the present application, and as shown in fig. 6, the architecture combines a mirroring technique and a striping technique, where target data (data 1 to data 6) are respectively stored in a fifth hard disk and mirrored files are stored in a corresponding sixth hard disk.
The architecture provides high performance read and write operations as well as data redundancy, because mirroring provides data protection, while striping increases data access speed. Disk space utilization is not high because mirroring requires additional storage space. The method is suitable for applications with high requirements on performance and redundancy, such as high-end servers, database storage, virtualization platforms and the like.
According to the embodiment of the application, the storage configuration can be dynamically adjusted according to the actual performance requirements and service scenes, so that the efficient and safe data storage and access can be achieved. The capability is particularly important for environments where a large number of concurrent read-write operations, high-frequency data access or potential hardware fault risks need to be processed, the overall performance and reliability of the storage system can be remarkably improved, and the disaster resistance capability of the system is improved.
Example 2:
the candidate decision condition may be preset, and the corresponding candidate configuration is also predetermined, specifically, the relationship between the decision condition and the configuration may be as follows:
1. The target decision condition is when the error frequency of the hard disk exceeds 5 times per minute, or the number of errors of an I/O (input/output) operation exceeds 30 times per second.
The target configuration is that the system automatically initiates RAID6 level migration to increase redundancy and fault tolerance. In order to prevent the business from being influenced in the data migration process, the system limits the speed to 500MB/s, and ensures that the migration is performed stably.
2. The target decision condition is that when the system recognizes that the ratio of sequential read and write operations (e.g., sequential file reads) to total I/O operations exceeds 80% within 300 seconds of a sequence.
The target configuration is to switch the hard disk array to RAID0 level and increase the stripe size to 1MB to maximize the performance of continuous data access and improve the read-write speed.
3. The target decision condition is when the system detects that the utilization of the storage space is continuously lower than 40% and this state continues for more than 1 hour.
The target configuration is that the system can start the integration of RAID arrays, reduce the number of RAID groups and perform defragmentation at the same time so as to improve the use efficiency of storage space and clear useless or scattered data.
It should be noted that RAID group integration refers to merging existing multiple RAID groups into one larger RAID group to improve storage efficiency and resource utilization. When a plurality of RAID groups exist, but the storage space is not fully utilized due to the independent configuration, the unused space can be reduced by integration, and the overall storage utilization rate is improved. The presence of multiple RAID groups increases management difficulties, particularly when performing data backup, recovery, and troubleshooting. Integrating RAID groups can simplify the storage architecture and reduce the operation and maintenance cost.
In alternative embodiments, integrating RAID groups may be analyzing the load, data distribution, and health of existing RAID groups. And designing a new integration scheme according to the service demand and the predicted load adjustment. Performing data migration reallocates data dispersed in different RAID groups into a new large RAID group, which typically requires additional storage space and processing time. Updating the configuration of the RAID controller or the software RAID manager to ensure that the data is distributed in a new structure. The integration process is monitored, so that data consistency is ensured, and possible data anomalies or performance fluctuations are timely processed.
It should be noted that defragmentation on hard disk is a maintenance operation aimed at optimizing data storage layout, reducing seek time in reading and writing, and thus improving performance. In hard disk arrays, random write operations may result in fragmentation of data due to the wide distribution of data, i.e., the data blocks are stored in separate locations on the hard disk rather than contiguously. Excessive fragmentation can reduce read-write efficiency because each I/O operation may require access to multiple discrete locations, increasing seek time and I/O latency.
4. The target decision condition is that when the latency of writing data to the hard disk array exceeds 100 milliseconds, this typically indicates that the system is faced with a complex I/O operation or bottleneck.
The target configuration is that the system enables a strategy for coordinating RAID0 and RAID1, namely, on one hand, the read-write speed is improved through RAID0, and on the other hand, data redundancy and quick recovery are provided by utilizing RAID1 so as to cope with performance challenges in a hybrid read-write mode.
Example 3:
The candidate decision condition may be a description of a scenario, for example, a preset scenario in which the hard disk array is located may be determined by referring to performance data, for example:
1. a burst random write scenario;
and the target configuration is that the random write operation has higher requirement on the dispersion read-write of the magnetic disk, the RAID10 configuration is selected, and the smaller stripe size is adopted, so that the data can be rapidly and uniformly distributed on a plurality of magnetic disks, and the random write speed and the random write efficiency are improved.
2. Sequentially reading scenes for a long time;
The target configuration is that under the scene that a large number of continuous read operations need to be processed for a long time, the system is switched to RAID0, and the stripe size is increased, so that the high-speed reading of continuous data is facilitated, the disk seek time is reduced, and the reading performance is improved.
3. The bandwidth occupancy is in a 30% scene;
The target configuration is that when the bandwidth occupancy rate of the storage system is monitored to be at a lower level (30%), and concurrent writing operation exists, the system considers reducing the number of RAID groups to improve the efficiency of data processing and the bandwidth utilization rate, and tries to adjust the writing mode from concurrent to sequential so as to reduce disk seek operation and optimize writing performance.
4. A multi-disk early warning scene;
And the target configuration is that when the system receives early warning signals of a plurality of hard disks, which indicates that the hard disks have potential fault risks, the system is immediately switched to RAID6 configuration and the rebuilding process of the array is started in order to ensure the data safety and the system availability. RAID6 provides a higher level of data redundancy, and even if two hard disks fail at the same time, data is not lost, which provides greater protection in the event of multiple hard disk warnings.
In an alternative embodiment, configuring the hard disk array into a target configuration comprises calculating a target priority of the target configuration based on a current configuration of the hard disk array, determining a target configuration time according to the target priority, and configuring the hard disk array into the target configuration within the target configuration time if the current configuration of the hard disk array is inconsistent with the target configuration.
It should be noted that the target priority may be a result of quantitative evaluation of importance and urgency of the configuration transition by the system according to various factors. These factors may include data redundancy requirements, current I/O operation types, business continuity requirements, resource consumption estimates, and the like. The target configuration time may be a window of time that the system plans to complete the transition of the hard disk array from the current configuration to the target configuration. The length of this time window needs to be carefully planned to ensure that configuration changes have minimal impact on traffic, and to reach new configurations as quickly as possible to improve performance or enhance data protection.
In an alternative embodiment, the system first analyzes the current status of the hard disk array, including the current RAID level, stripe size, read-write load pattern, etc., and then calculates the importance ranking of the target configuration based on these information and traffic requirements. This step ensures the rationality of configuration transition and avoids unnecessary adjustment or resource waste. Based on the priority calculation result, it is determined when to perform the configuration change. This may be done immediately in an emergency situation or may be scheduled during low traffic peak hours to reduce the impact on the running service.
In an alternative embodiment, determining the target configuration according to the target priority may be determined by a preset weight. If the target priority is greater than the first weight (e.g., 0.9), the hard disk array configuration may be considered "fatal", with high performance impact and high reliability impact. The target priority is greater than the second weight (e.g., 0.7) and less than or equal to the first weight, and the urgency of the hard disk array configuration may be considered "severe", with high reliability impact among the performance impact. The target priority is greater than the third weight (e.g., 0.4) and less than or equal to the second weight, and the urgency of the hard disk array configuration may be considered "normal", with low performance impact and reliability impact. The target priority is greater than the fourth weight (e.g., 0.2) and less than or equal to the third weight, and the urgency of the hard disk array configuration may be considered as a "hint" with low performance impact and low reliability impact. If the target priority is less than or equal to the fourth weight, the configuration task may be temporarily not performed.
For the degree of urgency, the impact of deadly presentation events on performance and reliability is very severe, and countermeasures need to be immediately performed to prevent data loss or system crashes. Severity indicates that the event has less impact on performance, but still significant impact on reliability, action should be taken within 15 minutes to prevent potentially serious consequences. Generally indicates that events have a low impact on both performance and reliability, but should be handled within 1 hour for long-term system health. The prompt indicates that the influence of the event is slight, and the prompt is mainly used as a prompt and can be processed within 24 hours according to the condition of system resources.
The corresponding time may be determined by determining the degree of urgency, and the target configuration time may be preset. For example, when the level of urgency reaches "deadly", the system should immediately perform a recovery action whenever a problem is detected, which is typically within seconds to minutes. For events of the "severity" level, the system should perform the corresponding process flow within 15 minutes to reduce risk, as it may pose a greater threat to data integrity. The task labeled "general", while not significantly affecting current operation, should be handled within 1 hour as a preventative maintenance to keep the system stable for long periods of time. For events at the "hint" level, which are typically slight changes to system state, or low priority demands on resources, the processing can be scheduled within 24 hours without affecting the high priority tasks, which provides the system with sufficient flexibility. The configuration time (immediately executing, 15 minutes, 1 hour and 24 hours) can be set in advance according to the requirements, and tasks with different degrees can be distinguished.
In an alternative embodiment, the first to fourth weights may be determined according to performance data, including but not limited to key indicators such as disk error rate, data access frequency, service level, migration cost, and the like. Through the historical data, a set of self-adaptive curves can be constructed, and the set of curves can be automatically adjusted according to actual running conditions, so that the target priority threshold (first weight, second weight, third weight and fourth weight) at the current moment is determined.
In an alternative embodiment, the system continuously monitors the performance and health status of the hard disk array, and collects real-time data for various metrics, such as disk error rate, type and frequency of I/O operations, storage utilization, service demand level, etc. This portion of the data is periodically aggregated to form a historical dataset. Based on the historical dataset, the system uses data analysis methods (e.g., time series analysis, regression analysis) to identify data trends and patterns. For example, with the ARIMA model, the system can predict the trend of disk error rates over a period of time in the future, which helps to prevent possible hardware failures in advance. Likewise, models regarding data access frequency and other performance metrics may also be constructed to predict their dynamic changes.
In an alternative embodiment, calculating the target priority of the target configuration based on the current configuration of the hard disk array comprises determining a first parameter according to the current configuration of the hard disk array, wherein the first parameter is used for indicating the level of errors occurring in the current configuration, determining the importance of data currently processed by the hard disk array as a second parameter, predicting the bandwidth occupancy rate used by the configuration target configuration, determining the negative number of the bandwidth occupancy rate as a third parameter, and weighting and summing the first parameter, the second parameter, the third parameter and the data access frequency of the hard disk array to obtain the target priority.
It should be noted that the first parameter indicates the severity of a problem or error that may exist in the current configuration. This parameter directly reflects the health level and potential risk of the current state of the system, and is an important basis for assessing the necessity of configuration adjustment. The second parameter may represent importance of the data, measuring criticality of the data stored in the hard disk array to operation of the service or system. In determining the configuration priority, high importance data will encourage the system to take more action to protect the integrity and availability of the data. The third parameter may represent the occupancy of the bandwidth used by the prediction arrangement, typically expressed as a negative number, in order to represent the positive value of the bandwidth savings when weighted summation. The evaluation of bandwidth occupancy helps the system avoid configuration changes that may burden when resources are scarce.
For example, the target priority=α×first parameter (range may be 0-1) +β×data access frequency+γ×second parameter (e.g. 1-5 level) - δ×migration cost (bandwidth occupancy);
Based on the current configuration of the hard disk array, the system decides the order and importance of configuration adjustments by calculating target priorities of target configurations. This process involves determining three key parameters, the level of error of the current configuration, the importance of the processed data, and the bandwidth occupancy that the predicted configuration will use. Finally, by weighting and summing these parameters along with the data access frequency, the system derives a target priority, which assists the system in intelligently selecting and executing the optimal configuration adjustment strategy.
In an alternative embodiment, the first parameter may indicate a fault risk level of the current configuration, and for a certain configuration, the fault risk level may also be determined, and a corresponding relationship between the fault risk level and the configuration may be stored by setting a corresponding table or the like.
In an alternative embodiment, the second parameter (service level) may indicate the importance of the service to which the data currently handled by the hard disk array is subjected. The second parameter may be ranked based on the importance of the business in the project. Critical services, such as online transaction systems, customer relationship management systems (CRM), data repositories, etc., are typically assigned a higher level of service because they are directly tied to customer quality of service and decision support. The level may also be determined based on performance requirements of the traffic, such as the required IOPS, bandwidth, delay, and throughput. For example, real-time data analysis and video streaming services require low latency and high bandwidth, should have a higher level of service than simple data storage or backup.
In alternative embodiments, the system may communicate with business departments to learn the basic requirements of each business, including performance, reliability, cost, and legal compliance. The above requirements are translated into specific quantization metrics such as the required IOPS, throughput, delay, redundancy level, and failure recovery time. Based on the collected information, the traffic is classified and rated. For example, a traffic class may be defined as 1-5, where 5 represents the highest priority and 1 represents the lowest priority. The rating should take into account the immediate economic benefit, long-term value, potential risk and legal compliance requirements of the business.
Example 4:
assuming that in a financial transaction system, the system is operating in a RAID5 configuration, but recent performance monitoring finds that the disk error rate is gradually rising to a level of 0.05% above a preset 0.02% error rate threshold, this indicates that the stability and reliability of the current configuration is degraded, and the determined first parameter is higher, indicating that the existing RAID configuration has a higher risk of failure.
At the same time, considering that the transaction data stored in the system is critical to the operation of the service, the importance of the data is determined as a second parameter, and this parameter is set to a higher value, which means that the integrity and availability of the data should be protected preferentially by the system when deciding.
The system predicts that if the configuration is migrated from RAID5 to RAID6 to increase data redundancy, the predicted bandwidth occupancy is 65%, so the negative of the predicted bandwidth occupancy is weighted as a third parameter, reflecting the positive contribution of bandwidth savings to the configuration priority calculation.
The first parameter, the second parameter, the third parameter and the data access frequency (assuming that the average I/O operation number is 100,000 times per day) are weighted and summed, and the system obtains a target priority, which means that the adjustment of the RAID configuration to the RAID6 should be performed preferentially. The system then switches the configuration of the hard disk array from RAID5 to RAID6, and through dynamic striping and data migration strategies, minimizes impact on existing traffic while significantly enhancing redundancy and reliability of data.
According to the embodiment of the application, the efficient and accurate management of the storage tasks can be realized, and the most important requirements can be preferentially responded when the system processes various events with different properties.
When a disk failure or other event that may immediately affect data integrity and system stability is detected, the system marks it as a "fatal" degree of urgency. This means that the system will immediately perform predefined countermeasures such as data migration, fault isolation or redundancy reconstruction to ensure data security and traffic continuity. This immediate response mechanism can significantly reduce the risk of data loss and speed up the time for failure recovery.
For events with an emergency level marked as "severe", such as a dramatic drop in performance or a deterioration in disk health, the system requires action within 15 minutes. The method ensures that resource scheduling, such as RAID level switching, stripe size adjustment or backup strategy starting, can be timely performed under the condition of serious performance bottleneck or impending failure, so that performance problems are rapidly relieved, and the risk of system downtime or data recovery delay is reduced.
For tasks labeled "general," such as storage utilization optimization, stripe sizing, or disk health monitoring, the system will execute within 1 hour. These tasks typically have less impact on current traffic but are critical to long-term system efficiency and data security. By executing the tasks within a reasonable time range, the storage configuration can be continuously optimized while the normal operation of the service is not affected, and the system is kept in an optimal state.
Tasks at the "hint" level, such as periodic maintenance, performance trimming, or data sort, may be completed within 24 hours. These tasks facilitate long-term maintenance and performance optimization of the system, but do not require immediate execution. The task is given a relatively loose time window, so that the system can automatically execute in a low-load period, interference to real-time service is avoided, and the continuity and effectiveness of system maintenance are ensured.
Through the mechanism, dynamic optimization and fault prevention of storage resources can be realized, and meanwhile, service continuity and high performance are maintained. The efficient resource scheduling and task priority management can remarkably improve the overall efficiency of the storage system, reduce manual intervention of operation and maintenance personnel, reduce operation cost and enhance the reliability and safety of data.
In an alternative embodiment, configuring the hard disk array as a target configuration comprises the steps of acquiring current performance data of the hard disk array, and isolating the target configuration under the condition that the current performance data meets a prompt condition;
in the case that the current performance data meets the hint conditions, isolating the target configuration includes at least one of:
1) Under the condition that the time of the configuration progress deviation larger than the first deviation meets a preset period, isolating target configuration;
2) Under the condition that the verification failure times of the hard disk array are larger than the preset times, isolating target configuration;
3) Under the condition that the delay of the hard disk array is larger than a preset delay, isolating the target configuration;
4) And under the condition that the bandwidth utilization rate of the hard disk array does not meet the preset range, isolating the target configuration.
It should be noted that the hint condition may be a set of predefined performance index thresholds that the system will trigger a particular alarm or action when the actual performance data of the hard disk array meets or exceeds. These conditions are the basis for the system to decide whether emergency measures, such as isolating the target configuration, need to be taken to protect the data or restore performance.
In an alternative embodiment, the system first obtains real-time performance data, such as a disk error rate, a read-write speed, a delay time, a verification error number, and the like, and then compares the performance data with preset prompting conditions according to the data. If the current performance data is found to meet or exceed the prompt condition, indicating that the system is subject to potential performance risk or failure, the system will take steps to isolate the target configuration to prevent the expansion of the business impact of configuration adjustments.
In an alternative embodiment, a configuration progress deviation (migration progress deviation) of less than 5% is considered a normal state, indicating that the process of migrating data from one RAID level to another is proceeding as expected, without significant delay. If the migration progress deviation exceeds 10% (first deviation) in three consecutive monitoring periods (preset period), this may mean resource contention, hardware performance degradation or software level problems, requiring immediate intervention for checking and adjustment to avoid data migration failure or unnecessary traffic impact.
In an alternative embodiment, a verification failure rate of less than 0.001% indicates that the data verification mechanism in the RAID group is operating well, and data consistency and integrity are effectively ensured. If the number of failures of a single verification task exceeds 3, this is typically indicative of a potential problem with the storage medium, such as a disk beginning to have a bad track or RAID controller failure. Under the condition, the system should automatically start fault detection and data recovery processes, prevent further data damage and ensure data safety.
In an alternative embodiment, the business delay profile is less than 20% of the baseline, which is the average delay of the system in the absence of anomalies. This means that the system should be able to maintain a relatively stable performance level even during resource tuning or optimization. Once traffic latency increases by more than 50%, this may be due to performance bottlenecks caused by data migration, RAID reconstruction, or disk health issues. At this point, measures should be taken immediately, such as halting data migration, resizing the stripes, or adding redundancy, to reduce latency and avoid serious degradation of the business experience.
In alternative embodiments, the desired bandwidth utilization may range from 70% to 90%, either too high or too low to facilitate stable operation of the system and efficient utilization of resources. If the bandwidth utilization continues to be higher than 95%, indicating that the system may be in an overload condition, the data transmission or processing capacity reaches a limit, which may affect storage performance and other services that rely on storage resources. In contrast, if the utilization ratio is less than 50%, there may be a waste of resources, indicating that the energy storage capacity is not fully utilized. In both cases, the policy control layer should dynamically adjust the striping policy, RAID level, or load allocation to achieve better resource balancing and utilization efficiency.
According to the embodiment of the application, through a fine monitoring and early warning mechanism and combining with an intelligent dynamic strategy, the optimization adjustment is automatically carried out, so that the abnormal performance can be responded immediately when RAID configuration adjustment is carried out, and the negative influence on the service is avoided. Under abnormal conditions, the target configuration is rapidly isolated, configuration change under unstable conditions is avoided, and data safety, system performance stability and resource utilization maximization are guaranteed, so that strong storage support and service guarantee are provided for various business scenes.
In an alternative embodiment, during the process of configuring the target configuration of the hard disk array, 5% of bandwidth can be reserved as an emergency dedicated channel to cope with emergency events, such as a large amount of data migration, sudden read-write peaks or data recovery operations. This reserved bandwidth ensures that in emergency situations, the system has enough resources to quickly respond and handle problems without causing a decrease in data processing speed or an increase in delay due to resource shortage. The ongoing RAID configuration migration is automatically aborted when the system detects an increase in traffic delay of more than 30% over normal. This is done to avoid possible instabilities in the configuration change process further degrading the service performance while giving time for the system and service personnel to diagnose and solve the problem.
In an alternative embodiment, after the hard disk array is configured using the target configuration, if a problem occurs, a rollback operation may be performed on the hard disk array.
Specifically, upon detection of performance degradation, data consistency problems due to abnormal conditions or configuration changes, the system automatically or manually reverts to the previously known healthy or steady state. When the system automatically or manually tries to change RAID level, striping strategy or data distribution mode, if the change operation causes significant performance degradation or unstable system, the rollback mechanism can be quickly started to restore to the state before change, thereby ensuring that the service is not affected. In the process of migrating data from one RAID group to another, if data loss, verification failure or abnormal migration progress is detected, the system immediately stops migration, and data recovery or rollback to a configuration state before migration is performed so as to ensure data security. If a failure of the hard disk or RAID controller is found in the dynamic health monitoring, when the system tries to replace or reconstruct data, if a new hardware or reconstruction process has a problem, the rollback mechanism ensures that a standby hard disk or data backup is used, and the system state before the failure is restored until the problem is thoroughly solved.
In alternative embodiments, the system may immediately activate the dual write verification mechanism if a problem is encountered during migration, such as a data consistency verification failure or a disk failure exacerbation. This means that all data write operations will be recorded simultaneously on both the original RAID configuration and the new RAID configuration until it is confirmed that the data is fully synchronized and consistency is guaranteed. Once the new configuration is found to be unsuitable or causes additional problems, the system can seamlessly roll back to the original RAID configuration without losing any data.
In an alternative embodiment, if the system switches to the RAID0 and 1MB stripe configuration due to detection of a high duty cycle sequential IO operation, but if the performance is not improved or otherwise reduced later (e.g., due to unexpected random IO increases), the system will automatically monitor the performance change and automatically fall back to the previous RAID configuration within a certain time window (e.g., 24 hours) according to the monitoring result. This mechanism relies on continuous performance monitoring, and when performance is monitored to deviate from the normal range, the system will automatically initiate a rollback procedure and ensure consistency and integrity of the data during rollback.
In an alternative embodiment, while storage utilization continues to be low, the system initiates RAID group consolidation and defragmentation to optimize storage space usage, but if during execution, such as after RAID group consolidation or defragmentation, the sudden traffic demand increases, resulting in a sudden increase in storage pressure, the system will remain in the original configuration setting for at least 24 hours. This means that even if a capacity reclamation strategy is implemented, the system will give a buffer time to ensure that the original RAID configuration can be rolled back quickly if necessary to cope with the sudden increase in storage space requirements.
In an alternative embodiment, the system enables coordinated use of RAID0 and RAID1 based on the detected write latency condition to address performance challenges in the hybrid read-write mode. However, if the load mode is changed again, such as turning to the read-only mode, the dynamic load balancing algorithm built in the system will adjust the usage proportion of the RAID configuration according to the latest load state. The system will automatically detect the change in load mode and reconfigure the RAID level as necessary to accommodate the new workload, ensuring the best state of performance. This mechanism ensures that the system remains in the most efficient operating state even in the event of extreme load changes.
Fig. 7 is a schematic diagram of an alternative configuration method of a hard disk array according to an embodiment of the present application, and as shown in fig. 7, the configuration method of a hard disk array may be applied to the environment in fig. 7, and may include a policy control layer, a virtualized storage layer, and a physical layer device. Specifically:
The policy control layer is responsible for driving the intelligent decision and response mechanism of the whole system. At this level, there are two key components, the data monitoring recognition prediction model and the dynamic policy engine.
The data monitoring recognition prediction model is responsible for collecting and analyzing real-time performance data of the hard disk array, such as IOPS, throughput, response time, SMART data (magnetic disk health index), cache hit rate and the like. Through pattern recognition techniques, the model is able to distinguish between different I/O types (sequential or random) and data block sizes, providing detailed data support for subsequent decisions. The prediction module predicts future load trend and hard disk health state based on the collected historical data by adopting a machine learning or statistical prediction method, and generates an adaptive curve dynamic threshold value, which provides a prospective view for real-time decision of a strategy control layer.
The dynamic policy engine dynamically generates and executes a storage policy based on the output of the data monitoring recognition prediction model and the comprehensive evaluation of three dimensions (first parameter, second parameter, third parameter) of failure risk level, service level and migration cost. The failure risk assessment helps identify potential failure points of the disk array, the service level adjusts storage configuration according to the priority and demand of the current service, and the migration cost dimension considers the influence of configuration change on system performance and resources. Through the three-dimensional decisions, the dynamic policy engine can intelligently select the most appropriate RAID level, stripe size and redundancy mechanism (target configuration), and storage efficiency and data security are optimized while meeting business requirements.
The virtualized storage layer serves as a bridge between the policy control layer and the physical device layer, and is mainly responsible for executing the policy control layer decision to realize RAID level seamless migration, distributed object storage management and dynamic striping policies. When the policy control layer decides to switch RAID configurations, the virtualized storage layer is responsible for the actual data reconstruction and migration process, ensuring a smooth transition from one RAID level to another while taking into account the minimal impact on traffic. The module allows data to be stored in a distributed manner on a plurality of physical disks, supports a large-scale data set and high concurrent read-write requirements, and enhances the overall performance and expandability of the storage system through data slicing and parallel processing. The dynamic striping engine is responsible for dynamically adjusting the stripe size according to the current workload and access mode to optimize the efficiency of the read and write operations. For example, in a concurrent read and write intensive scenario, the striping engine may select a smaller stripe size to speed up data positioning, while in a large number of sequential read and write operations, it may increase stripe size to reduce seek time and improve throughput.
The physical layer device comprises a RAID/SAS controller and a connected hard disk device, and is a hardware basis for realizing intelligent dynamic RAID strategy. At this level, dynamic health monitoring and early warning systems are critical.
The dynamic health monitoring engine monitors health states of the RAID controller and the hard disk in real time, and the health states comprise key indexes such as stability of controller voltage, bad track rate of the hard disk, temperature change, disk rotation speed and the like. The dynamic health monitoring not only monitors the current state, but also can predict future health trend through data analysis, thereby providing more comprehensive decision basis for the policy control layer. The physical layer monitoring system consists of a dynamic health monitoring engine and an early warning system and is used for detecting voltage fluctuation of a controller, increase of hard disk damage rate and the like in real time and ensuring that hardware equipment is in an optimal running state. The early warning system can predict equipment faults 48 hours in advance, and provides sufficient time for a system administrator to perform preventive maintenance or resource allocation adjustment, so that data loss and service interruption are avoided.
In an alternative embodiment, deep learning and multidimensional sensor technology can be introduced, so that not only can the health state and I/O behavior of the hard disk be monitored and predicted in real time, but also potential system-level faults such as power supply fluctuation, temperature abnormality and the like can be identified in advance, and resource management and fault coping strategies are further optimized.
An environment sensing layer is additionally arranged on the physical equipment layer, and a temperature sensor, a humidity sensor, a smoke detector, a vibration sensor and a power supply state monitoring module are integrated. The sensors not only monitor the health states of the hard disk and the RAID controller, but also detect the environment of the machine room and the power supply condition, thereby providing a more comprehensive system health view. Deep neural network models, such as Convolutional Neural Networks (CNNs) and long and short term memory networks (LSTMs), are trained using historical performance data, environmental monitoring data, and system event logs to accurately predict the impact of hard disk failures, power instability, and other environmental factors on a storage system. The model can identify complex correlations and patterns, and the prediction accuracy is higher than that of the traditional statistical model.
When the deep learning model predicts a possible failure, the system automatically triggers a self-healing mechanism. In addition to the original RAID level adjustment, dynamic striping and data migration strategies, the newly added self-healing strategy also comprises automatic power supply voltage stabilization, environmental condition regulation (such as starting a cooling system) and isolation and replacement of a fault hardware unit.
The system can also learn the demands of different time periods and service types on storage resources by collecting the operation modes of the service. This enables the intelligent dynamic RAID system to adjust the configuration more finely, such as switching the relevant storage pool to RAID0 in advance to promote read-write speed before the video on demand peak period is predicted, and automatically to RAID1 to enhance data security when the data backup period is predicted.
Example 5:
It is assumed that in a data center of a large company, it is predicted that grid fluctuations due to nearby construction may affect the stability of the storage system. Based on this prediction, the system automatically turns on the voltage stabilizing function of the UPS and reduces the bandwidth occupancy of the RAID configuration migration to reduce the dependency on power and possible delay impact. Meanwhile, the environment sensing layer detects that the temperature in the machine room slightly rises, and the system immediately starts a cooling system to prevent the performance of the storage hardware from being reduced or from being failed due to the fact that the temperature is too high.
The early warning information received by the intelligent operation and maintenance platform shows that a wave of large-scale data backup operation is expected to exist within 48 hours, and the storage bandwidth and hard disk abrasion can be stressed. The system transfers part of non-key data to cloud storage in advance through the learning of user behaviors and business modes, and releases local storage resources to be ready for the forthcoming peak tasks. When the storage resources in the data center reach the early warning threshold value, the intelligent system also automatically pulls the resources from the cloud storage, provides additional storage space and ensures that the service continuity is not affected.
According to the scheme, through deep learning, environment perception and integration of the intelligent operation and maintenance platform, the robustness and response speed of a storage system are improved, the resistance to environmental factors is enhanced, and a solid foundation is provided for efficient operation of a data center.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment.
An embodiment of the present application further provides a configuration device of a hard disk array, and fig. 8 is a block diagram of a configuration device of an alternative hard disk array according to an embodiment of the present application, as shown in fig. 8, where the device includes:
a performance data obtaining module 802, configured to obtain reference performance data of the hard disk array, where the reference performance data is used to indicate performance data of the hard disk array in a period of time;
a decision condition determining module 804, configured to determine, as a target decision condition, a candidate decision condition that is met by the reference performance data in at least one candidate decision condition, where the candidate decision condition is used to indicate a range of performance data of the hard disk array, and the at least one candidate decision condition corresponds to a candidate configuration of one hard disk array;
a target configuration determining module 806, configured to determine a candidate configuration corresponding to the target decision condition as a target configuration;
a configuration module 808, configured to configure the hard disk array to the target configuration in case the current configuration of the hard disk array is inconsistent with the target configuration.
Optionally, the decision condition determining module 804 is further configured to determine that the target decision condition is a first candidate decision condition if the reference performance data indicates that the reliability index of the hard disk array does not meet the first index range, and determine that the target decision condition is a second candidate decision condition if the reference performance data indicates that the efficiency index of the hard disk array does not meet the second index range.
Optionally, the decision condition determining module 804 is further configured to determine that an error rate of the hard disk array is greater than a preset error rate, where the error rate is used to indicate a ratio of a number of read/write errors of the hard disk array to a total number of read/write errors of the hard disk array in a unit time, the number of errors of the hard disk array is greater than a preset number of errors, where the number of errors is used to indicate a total number of operation errors of the hard disk array occurring in a period of time, the number of hard disks sending a prompt message in the hard disk array is greater than a preset number of hard disks, where the prompt message is used to indicate that the hard disk fails, a ratio of sequential read/write operations of the hard disk array is greater than a preset ratio, where sequential read/write indicates that the hard disk array continuously reads or writes data, and a delay time of the hard disk array is greater than a preset delay time, where the delay time is used to indicate a time when the hard disk array completes a write operation.
Optionally, the decision condition determining module 804 is further configured to configure the hard disk array to be a first target configuration, where the first target configuration is a configuration corresponding to the first candidate decision condition, send target data to a first hard disk in the hard disk array, and establish a mirror image of the target data on a second hard disk in the hard disk array, send the target data to a third hard disk in the hard disk array, and send verification information of the target data to the third hard disk.
Optionally, the decision condition determining module 804 is further configured to configure the hard disk array as a second target configuration, where the second target configuration is a configuration corresponding to a second candidate decision condition, divide the target data into a plurality of sub-data according to a preset stripe, and send the plurality of sub-data to a plurality of hard disks included in the hard disk array one by one, divide the target data into a plurality of sub-data according to the preset stripe, send the plurality of sub-data to a fifth hard disk included in the hard disk array one by one, and establish a mirror image of the sub-data on a sixth hard disk included in the hard disk array.
Optionally, the configuration module 808 is further configured to calculate a target priority of the target configuration based on the current configuration of the hard disk array, determine a target configuration time according to the target priority, and configure the hard disk array to be the target configuration within the target configuration time if the current configuration of the hard disk array is inconsistent with the target configuration.
Optionally, the configuring module 808 is further configured to determine a first parameter according to a current configuration of the hard disk array, where the first parameter is used to indicate a level of an error occurring in the current configuration, determine an importance of data currently processed by the hard disk array as a second parameter, predict a bandwidth occupancy rate used by a configuration target configuration, determine a negative number of the bandwidth occupancy rate as a third parameter, and weight and sum the first parameter, the second parameter, the third parameter and a data access frequency of the hard disk array to obtain a target priority.
Optionally, the configuration module 808 is further configured to obtain current performance data of the hard disk array, isolate the target configuration if the current performance data meets a prompt condition, isolate the target configuration if a time of a configuration progress deviation greater than the first deviation meets a preset period, isolate the target configuration if a number of verification failures of the hard disk array is greater than a preset number of verification failures, isolate the target configuration if a delay of the hard disk array is greater than a preset delay, and isolate the target configuration if a bandwidth utilization of the hard disk array does not meet a preset range.
The description of the features in the embodiment corresponding to the configuration device of the hard disk array can be referred to the related description of the embodiment corresponding to the configuration method of the hard disk array, which is not described in detail herein.
An embodiment of the application also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the above described embodiments of the method of configuring a hard disk array.
Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is configured to perform, when run, the steps of any of the above-described embodiments of the method of configuring a hard disk array.
In an exemplary embodiment, the computer readable storage medium may include, but is not limited to, a U disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media in which a computer program may be stored.
The embodiments of the present application also provide a computer program product, which includes a computer program, and the computer program when executed by a processor implements the steps in any of the embodiments of the method for configuring a hard disk array.
Embodiments of the present application also provide another computer program product comprising a non-volatile computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of any of the configuration method embodiments of a hard disk array described above.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The method and the device for configuring the hard disk array, the storage medium and the electronic equipment provided by the application are described in detail. The principles and embodiments of the present application have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present application and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the application can be made without departing from the principles of the application and these modifications and adaptations are intended to be within the scope of the application as defined in the following claims.