CN118981620A

CN118981620A - A reinforcement learning energy management method for fuel cell vehicles considering operating condition prediction

Info

Publication number: CN118981620A
Application number: CN202411462619.4A
Authority: CN
Inventors: 朱仲文; 张梓睿; 张梓迟; 佟强; 李丞; 王维志
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2024-10-19
Filing date: 2024-10-19
Publication date: 2024-11-19
Anticipated expiration: 2044-10-19
Also published as: CN118981620B

Abstract

The invention provides a fuel cell automobile reinforcement learning energy management method considering working condition prediction, which comprises the following steps: constructing comprehensive driving condition data by adopting four different types of classical driving conditions, dividing the comprehensive driving condition data into segments, analyzing principal components by extracting characteristic parameters of the working conditions, and clustering driving cycle segments by a K-means clustering method; after three different types of driving cycle segment sets are obtained, an off-line training is carried out on a working condition identification module based on Bi-LSTM and a vehicle speed prediction module based on Markov, wherein the vehicle speed prediction module selects data sets representing three different road characteristics to respectively train three different Markov matrixes; extracting working condition characteristic parameters of a vehicle speed prediction sequence, and converting the working condition characteristic parameters into an equivalent factor regulating coefficientAnd willAs a status input to DDPG the energy management algorithm, the fuel cell power P _fc acts to set the appropriate bonus function to achieve comprehensive consideration for achieving multi-objective optimization.

Description

Fuel cell automobile reinforcement learning energy management method considering working condition prediction

Technical Field

The invention relates to the technical field of fuel cells, in particular to a reinforced learning energy management method of a fuel cell automobile considering working condition prediction.

Background

The energy exhaustion and environmental crisis problems in the world are increasingly serious, and automobiles are one of important carriers of energy consumption and air pollution, so that energy efficiency optimization, energy conservation and emission reduction of automobiles are hot spots of current research. The hydrogen fuel cell is used as an emerging energy power generation technology, is applied to automobiles, and has the advantages of zero emission, no pollution, low noise, high energy efficiency, renewable energy utilization, rapid fuel filling and the like. Therefore, the technical route has wide prospect in the field of new energy automobiles, and is one of the viable schemes for relieving the problems of non-renewable energy consumption and pollutant gas emission.

At present, the corresponding technology of the domestic hydrogen fuel cell automobile still needs to be improved, and a certain technical gap still exists in the energy management scheme. The energy management (ENERGY MANAGEMENT SYSTEM, EMS) is one of the core technologies of the hydrogen fuel cell automobile, has great influence on the fuel economy, the service life, the power performance and the comfort of the whole automobile, and is also a research difficulty and a hot spot in the field of the fuel cell automobile. Fuel cell automobiles generally adopt an all-electric power system, but if only a fuel cell provides the power of the whole automobile, problems such as time lag response and incapability of recovering braking energy exist, so that the power requirement cannot be met in time, and the energy cannot be recovered. In the current fuel cell car power system architecture, a battery or a super capacitor is added as an auxiliary power source to cope with abrupt power demand and simultaneously perform braking energy recovery. The power system also has a single power source to be changed into multiple power sources, so that the required power is distributed to each energy source by a higher-efficiency energy management strategy, the economy and the power performance of the whole vehicle are ensured, and the performance of the whole vehicle is improved.

If three factors of equivalent hydrogen consumption, SOC offset and fuel cell service life are considered, the rule-based strategy is calibrated by means of expert experience, and global optimization is not achieved. The optimization-based energy management strategy is poor in real-time performance and has limitations for the multi-objective optimization problem. A new strategy is needed to meet the multi-objective, high real-time, optimal energy management needs.

Because the fuel cell is applied to the truck, the complex working conditions such as urban roads, rural roads, expressways and the like are required to be frequently converted, and the complex working condition requirements of the single energy management algorithm cannot be well met.

The existing energy management strategies are various, but focus on the optimization algorithm, but the advantages of combination of vehicle speed working condition prediction and energy management cannot be fully exerted, if the future vehicle speed is considered in the energy management algorithm, and the front and rear driving state information is considered at the same time by using a Bi-directional long-short-term memory network Bi-LSTM, the vehicle has foresight in the driving process, and the accuracy of working condition identification can be effectively improved according to the situation of sudden changes in the front and rear of the vehicle speed working condition, so that the energy efficiency is improved. Therefore, the invention provides a fuel cell vehicle reinforcement learning energy management method considering working condition prediction.

Disclosure of Invention

The invention provides a fuel cell automobile reinforcement learning energy management method considering working condition prediction, which solves the technical problems mentioned in the background art.

The technical scheme adopted for solving the technical problems is as follows:

A fuel cell vehicle reinforcement learning energy management method that considers operating condition predictions, the method comprising the steps of;

S01, constructing comprehensive driving working condition data capable of reflecting various road working condition characteristics by adopting four different types of classical driving working conditions, dividing the comprehensive driving working condition data into segments according to each segment of 100S, carrying out Principal Component Analysis (PCA) by extracting working condition characteristic parameters of the working conditions, clustering driving cycle segments by a K-means clustering method, and clustering the driving cycle segments into three working conditions of low speed, medium speed and high speed;

s02, after three different types of driving cycle segment sets are obtained, offline training is carried out on a working condition identification module based on Bi-LSTM and a vehicle speed prediction module based on Markov;

S03, extracting working condition characteristic parameters of a vehicle speed prediction sequence, such as average vehicle speed and standard deviation of the vehicle speed, and converting the working condition characteristic parameters into an equivalent factor regulating coefficient And willThe fuel cell power P _fc is input as a state to the DDPG energy management algorithm as an action and appropriate bonus functions including hydrogen consumption, SOC fluctuations and fuel cell degradation are set to achieve the optimization problem of comprehensively considering achieving multiple objectives.

As a further technical scheme of the invention, the step of constructing comprehensive driving condition data capable of reflecting various road condition characteristics by adopting four different types of classical driving conditions comprises the following steps of:

preprocessing working condition data: the method comprises the steps of combining NEDC, WLTC, CLTC-P, FTP75 typical road working conditions to serve as comprehensive driving working condition data capable of reflecting various road working condition characteristics, and using the comprehensive driving working condition data as an algorithm for identifying and predicting subsequent training working conditions;

NEDC (New European DRIVING CYCLE) operating conditions are European endurance standard test operating conditions, including 4 urban and 1 suburban cycles; wherein the urban working condition is 780 seconds, and the highest speed is 50km/h; suburban working conditions are 400 seconds, and the highest speed is 120km/h;

The WLTC standard working condition is closer to the actual road driving condition, the complete test cycle of the WLTC standard working condition is composed of 4 stages of low speed, medium speed, high speed and super speed, and the total duration is 1800s, wherein the idle time is 235s, the stroke is 23266m, the average vehicle speed is 46.5km/h, and the highest vehicle speed is 131.3km/h;

The test working conditions of the CLTC-P (CHINA LIGHT-duty VEHICLE TEST CYCLE-PASSENGER CAR) passenger car comprise 3 speed intervals of low speed, medium speed and high speed, wherein the total duration is 1800s, the total mileage is 14480m, the maximum speed is 114km/h, and the average speed is 28.96km/h;

FTP75 (FEDERAL TEST process) is a standard issued by the united states energy agency for testing the economy and emissions of passenger cars in urban conditions for assessing emissions and fuel economy of light vehicles and light trucks; the complete FTP75 working condition cycle driving time is 1874s, the theoretical driving distance is 17.77km, the average vehicle speed is 34.12km/h, and the highest vehicle speed is 91.25km/h, and the complete FTP75 working condition cycle driving time comprises a cold start transient stage, a steady state stage and a hot start transient stage 3 part.

As a further technical scheme of the invention, the steps of dividing the comprehensive driving working condition into segments according to each segment of 100s and carrying out Principal Component Analysis (PCA) by extracting working condition characteristic parameters of the working condition comprise the following steps:

Setting the comprehensive driving condition data as 100s as one section; the characteristic parameters of the working conditions can reflect the characteristic information of the driving cycle segment of each driving cycle segment; the characteristics of each driving cycle segment are described by 14 parameters, namely an average speed, a maximum speed, an average acceleration, a maximum acceleration, an average deceleration, a maximum deceleration, an idle speed time ratio, an acceleration time ratio, a deceleration time ratio, a uniform speed time ratio, a speed standard deviation, an acceleration standard deviation, a driving mileage and a running time;

however, 14 working condition characteristic parameters are too many, information overlapping and correlation exist among the working condition characteristic parameters, and the calculated amount is large, so that the analysis is not facilitated. In order to reduce the complexity of analysis, main component analysis (PCA) is adopted to perform dimension reduction treatment;

For 100s of comprehensive driving condition data with 14 condition characteristic parameters, reducing the dimension to 3 dimensions, and performing dimension reduction processing on the principal component analysis, wherein the steps comprise:

1) Forming the original data into a matrix X of 100 rows and 14 columns according to columns;

2) Zero-equalizing each row of X, i.e. subtracting the average value of the row;

3) Obtaining covariance matrix ；

4) Obtaining eigenvalues and corresponding eigenvectors of the covariance matrix;

5) Arranging the eigenvectors into a matrix according to the corresponding eigenvalues from top to bottom, and taking the first 3 rows to form a matrix P;

6）、 Namely, the data after dimension reduction to 3 dimensions;

In the process, in order to ensure the integrity of the information after the dimension reduction, the selected main components are required to meet the condition that the accumulated contribution rate reaches more than 80%, the first 3 main components with the largest contribution rate are selected finally, and the accumulated contribution rate reaches more than 80%, so that the characteristic information of most working conditions is covered.

As a further technical scheme of the invention, the step of clustering the driving cycle segments by the K-means clustering method comprises the following steps:

Classifying driving cycle fragments by using a K-means clustering algorithm, wherein in the process, we determine the classification quantity K=3, assign an initial clustering center point for each class, and calculate the Euclidean distance from each sample point to the clustering center; ；

Dividing each point into various types according to the distance minimum principle, and after the first clustering is completed, recalculating the clustering center of each cluster, wherein the new clustering center is the average value of all data points in the cluster, and the calculation formula is as follows:

；

Wherein, Is the firstA collection of data points of a cluster,The number of data points in the set is the number of data points, and the clustering points are updated and redistributed until the clustering center tends to be stable.

Finally, after working condition characteristic parameters of the working condition data are subjected to dimension reduction processing through PCA to obtain the working condition data containing 3 pieces of principal component information, dividing 100 road driving cycle segments into 3 classes through a K-means clustering algorithm, and carrying out vehicle speed working condition prediction by specifically training a Markov transfer matrix.

As a further technical scheme of the invention, after the three different types of driving cycle segment sets are obtained, the step of performing offline training on the Bi-LSTM-based working condition identification module and the Markov-based vehicle speed prediction module comprises the following steps:

Bi-LSTM is composed of forward LSTM and backward LSTM, the forward LSTM processes the input sequence according to time sequence, captures the information from the past to the future, the backward LSTM processes the input sequence from the latest time point to the earliest time point, captures the information from the past; each LSTM of the bidirectional structure internally comprises an input gate, a forget gate, an output gate and a unit state; the input of the model is 14 working condition characteristic parameters, and the output is a working condition mode, which is divided into three types of low speed, medium speed and high speed; the LSTM architecture is:

forgetting the door: the forgetting gate decides which information is forgotten, and the expression is:

；

wherein, Is a sigmoid function of the number of bits,Is the weight coefficient of the forgetting gate,Representing the hidden state of the previous momentAnd the current input stateWhereinIs 14 working condition characteristic parameters including real-time speed, acceleration and the like of the vehicle,Is a bias term;

An input door: the input gate determines which new information needs to be added to the cell state;

；

wherein, AndIs a weight matrix of an input gate, the dimension is adjusted according to the number of the input parameters,AndIs a bias term that is used to determine,Is a candidate cell state;

cell state update: updating the cell state by combining the results of the forgetting gate and the input gate;

；

Output door: deciding what value to output based on the updated cell state;

；

wherein, Is the weight parameter of the output gate,Is a bias term that is used to determine,Is the output state at the current moment;

Bi-LSTM computes a Bi-directional hidden state by running LSTM layers forward and backward along the time axis, forward LSTM sequentially from the first element to the last element of the sequence, and backward LSTM, the two hidden states being concatenated together to form a final Bi-directional hidden state, the Bi-directional LSTM capturing contextual information before and after each time step in the sequence, thereby providing a more comprehensive characterization;

In a vehicle speed working condition prediction module, a Markov vehicle speed prediction method is adopted, three working condition modes of low speed, medium speed and high speed are identified according to working conditions, each mode corresponds to a Markov transition matrix, three Markov matrixes (the low-speed Markov transition matrix, the medium-speed Markov transition matrix and the high-speed Markov transition matrix) are pertinently trained offline through three driving cycle segments clustered in the step S01, a prediction step length is set to be n, and the Markov outputs n steps of predicted vehicle speed information;

after the vehicle speed prediction considering the working condition recognition is completed, a section of p-step predicted vehicle speed sequence is output.

As a further technical scheme of the present invention, in step S03, DDPG algorithm is based on an Actor-Critic framework, so that the algorithm contains an Actor and Critic network, and each network has a corresponding target network, so that the DDPG algorithm comprises four networks, namely an Actor networkCritic networkTarget Actor networkAnd TARGET CRITIC networksThe updating process of DDPG algorithm, the updating mode of the target network and the purpose of introducing the target network;

DDPG the training process includes:

Initializing critic a network And actor networksParameters of (2)And；

The corresponding target network parameters are initialized,，；

Initializing experience playback D, setting the capacity as K, and setting the number of circulating wheels as M;

According to the current state Input to the current network according to the strategy and noise of the current networkThe action is selected so that the user can select,Generates an actionAfter that, get rewardsAnd go to the next stateAnd store the state action pairsInto experience playback D;

Removing a random batch from D Target value of；

Wherein the method comprises the steps ofIs a cost function of the next time instant,Is a function of the policy at the next moment,、Is the network weight parameter at the next moment;

updating critic parameters, wherein the loss is as follows: ；

Updating actor network: ；

Soft update target network: 。

As a further technical scheme of the invention, in step S03, DDPG deep reinforcement learning energy management strategies of working condition identification and vehicle speed working condition prediction are fused;

A set of state:

；

the power is required for the whole vehicle, FOH is the life of the fuel cell in the state of power cell;

Setting an action set:

；

Power for the fuel cell;

Setting a reward function:

；

R is the function of the prize to be awarded, Hydrogen consumption rate for fuel cells; lambda is an equivalent factor adjustment coefficient calculated by the p-step predicted vehicle speed sequence output from the vehicle speed predicting section,In the training process, an optimal energy management strategy is achieved by selecting a proper k value; after training, lambda is adjusted in real time through a predicted vehicle speed sequence which is specifically received by lambda in the actual running process; wherein the method comprises the steps ofTo predict the standard deviation of the vehicle speed for the vehicle speed sequence,Average vehicle speed for the predicted vehicle speed sequence; The power is output for the power battery; LHV is the lower heating value of hydrogen; as the fuel cell life degradation value, To account for a weighting factor of fuel cell life degradation;

in the process of realizing DDPG-based energy management, firstly, inputting a state into an Actor of an Agent by an environment, selecting a corresponding action by the Agent, obtaining a real-time rewarding state S' after transition to a later moment, storing experience samples in a sample pool, removing earliest experience after the samples in the sample pool reach a preset number, continuously enabling an Agent to execute the action by a DDPG algorithm, collecting data interacted with the environment, including the state and rewarding, updating network parameters, enabling the Agent to gradually learn an optimal action strategy, and realizing optimal energy efficiency optimization.

The beneficial effects of the invention are as follows: compared with an energy management algorithm based on rules and optimization, the method adopts an energy management strategy based on a depth deterministic strategy gradient DDPG algorithm. Rule-based algorithms typically rely on preset fixed rules and engineers' experience, and optimization goals and constraints of the optimization algorithm are also relatively fixed, and real-time is relatively poor. The DDPG algorithm is able to adaptively adjust policies through constant interactions and learning with the environment. And in the energy management process, a plurality of factors and uncertainties of mutual association coupling are involved, and the DDPG algorithm can process the problems of high-latitude state space and continuous action space, continuously control the output power of the fuel cell, and can effectively adapt to the energy management scene of the fuel cell automobile. And the optimal solution can be limited by complexity when solving the optimal solution based on the optimized energy management strategy, and only the local optimal solution can be obtained in some cases. And optimization-based algorithms require an accurate mathematical model, whereas DDPG-based algorithms can learn and optimize without knowledge of the system model.

Meanwhile, compared with the traditional DDPG algorithm, the bi-LSTM-based working condition recognition and Markov vehicle speed prediction module is added, so that the bi-LSTM-based working condition recognition and Markov vehicle speed prediction module can adapt to complex road working conditions. And the bi-LSTM algorithm is adopted, compared with the traditional LSTM classification algorithm, the bi-LSTM consists of two LSTM, and bidirectional information can be captured simultaneously, so that the prediction process can consider past and future vehicle speed information simultaneously. For example, when urban working conditions of frequent start and stop are identified, the bi-LSTM can consider the previous start and stop condition and the subsequent possible change trend at the same time, so that the type of the working conditions can be identified more accurately, the identification accuracy can be improved effectively, and information omission can be reduced. After the working condition type is identified based on the bi-LSTM working condition identification model, three well trained Markov state transition matrixes of low-speed, medium-speed and high-speed types are used for carrying out vehicle speed prediction according to different road working conditions in a better pertinence manner, and the vehicle speed prediction precision is improved, so that the energy efficiency of the energy management system is improved.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

fig. 1 shows a schematic diagram of a fuel cell vehicle power system according to an embodiment of the present invention.

FIG. 2 illustrates a reinforcement learning energy management strategy diagram incorporating condition recognition and prediction provided by implementations of the present invention.

FIG. 3 shows a Bi-LSTM architecture diagram provided by an embodiment of the present invention.

FIG. 4 illustrates a DDPG-based fuel cell vehicle energy management architecture diagram provided by an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

1.1 Fuel cell automobile Power System architecture:

the invention relates to a fuel cell automobile, which is powered by a power battery and a fuel cell and driven by a driving motor. The fuel cell system is connected to the main circuit after being stabilized by the DC/DC converter, the power battery is directly connected to the main circuit, and the motor controller controls the driving motor to drive the vehicle to run after the power of the fuel cell and the power of the power battery are output; as shown in fig. 1:

1.2 longitudinal dynamics model:

the invention is a fuel cell automobile energy management problem, the transverse stability is not considered temporarily, and the longitudinal dynamics of the automobile needs to be analyzed and modeled. The vehicle receives four kinds of resistance, air resistance, ramp resistance, rolling resistance and acceleration resistance during running.

Driving force of whole vehicleCan be expressed as:

；

wherein, In order to provide air resistance, the air resistance,For the resistance of the ramp to be the same,In order to provide a rolling resistance,Is acceleration resistance; the calculation equations are respectively as follows:

；

wherein, The mass of the fuel cell automobile; gravitational acceleration; Road grade; Is the air density; Is the forward windward area of the vehicle; is the air resistance coefficient; Is the speed of the vehicle; Is the rolling resistance coefficient; The rotation mass conversion coefficient.

1.3 Lithium battery model:

The lithium battery model adopts a first-order Rint model, the lithium battery is equivalent to an ideal voltage source and a structure connected with an internal resistor in series, and meanwhile, the polarization phenomenon of the battery is ignored, so that the structure is simple. In this model, the output power of the power cell The method comprises the following steps:

；

Battery current The method comprises the following steps:

；

the updating of the state of charge (SOC) of the battery adopts an ampere-hour integration method:

；

wherein, Is an open circuit voltage; Is a current; is the internal resistance of the battery.

1.4 Fuel cell model:

in the process of constructing a fuel cell hydrogen consumption model, the fuel cell hydrogen consumption model is expressed as:

；

In the above-mentioned method, the step of, For the power of the fuel cell system,For the efficiency of the fuel cell system,Is the lower heat value of hydrogen。

1.5 Fuel cell degradation model:

The degree of degradation of a fuel cell stack is evaluated taking into account fuel cell degradation primarily in view of four major degradation conditions. Degradation mechanisms under dynamic cycling, degradation mechanisms under start/stop cycling, degradation mechanisms under idle conditions, degradation mechanisms under high power loads. Expressed by the formula:

；

Degradation mechanism under start/stop cycle:

；

degradation mechanism under dynamic cycling:

；

Degradation mechanism under idle conditions:

；

Degradation mechanism under high power load:

；

Wherein a, b, c, d are constants, which are empirical parameters obtained by calibration in experiments. The remaining life of the fuel cell after normalization is expressed as:

；

Referring to fig. 2, as an embodiment of the present invention, a fuel cell vehicle reinforcement learning energy management method considering condition prediction is provided, and includes the following steps;

In this embodiment, the step of constructing the comprehensive driving condition data capable of reflecting various road condition features by adopting four different types of classical driving conditions includes:

Preprocessing working condition data: the driving condition of the automobile is obtained by post-processing according to actual driving data, and the trend of time along with speed change is reflected. Firstly, in order to comprehensively reflect road driving conditions and improve the accuracy of identifying all road working conditions, a driving cycle of a sample is constructed, four typical road working conditions NEDC, WLTC, CLTC-P, FTP are combined to be used as comprehensive driving working condition three-dimensional data capable of reflecting various road working condition characteristics and used as an algorithm for identifying and predicting subsequent training working conditions;

In this embodiment, the step of performing the segment division on the comprehensive driving condition according to each segment of 100s, and performing the Principal Component Analysis (PCA) by extracting the condition characteristic parameters of the condition includes:

The comprehensive driving condition data constructed through 4 typical road conditions cannot be directly applied to the condition recognition model and the training of the vehicle speed prediction Markov transfer matrix, the comprehensive driving condition data is required to be subjected to sectional processing, and can be assumed to be 100s as a section; the characteristic parameters of the working conditions can reflect the characteristic information of the driving cycle segment of each driving cycle segment; the characteristics of each driving cycle segment are described by 14 parameters, namely an average speed, a maximum speed, an average acceleration, a maximum acceleration, an average deceleration, a maximum deceleration, an idle speed time ratio, an acceleration time ratio, a deceleration time ratio, a uniform speed time ratio, a speed standard deviation, an acceleration standard deviation, a driving mileage and a running time;

3) Obtaining covariance matrix ；

6）、 Namely, the data after dimension reduction to 3 dimensions;

In this embodiment, the step of clustering the driving cycle segments by the K-means clustering method includes:

Assuming a combined operating condition of 10000s, it can be divided into 100 driving cycle segments. The 100 driving cycle segments after PCA dimension reduction processing are divided into three types of working conditions of low speed, medium speed and high speed so as to train three types of Markov transfer matrixes in a vehicle speed prediction module. Classifying driving cycle fragments by using a K-means clustering algorithm, wherein in the process, we determine the classification quantity K=3, assign an initial clustering center point for each class, and calculate the Euclidean distance from each sample point to the clustering center;

；

In this embodiment, after the three different types of driving cycle segment sets are obtained, the step of performing offline training on the Bi-LSTM-based condition recognition module and the markov-based vehicle speed prediction module includes:

The characteristic of the complex road working condition is considered, and the working condition identification module is added before the vehicle speed prediction module, so that the vehicle speed prediction accuracy can be effectively improved. The LSTM network is designed for solving the problem that the gradient vanishes or gradient explodes easily when the traditional RNN (recurrent neural network) processes long sequences, the road condition recognition model of the invention adopts a Bi-LSTM method based on a Bi-directional long-short-term memory network, the Bi-LSTM architecture is shown in figure 3, the Bi-LSTM consists of forward LSTM and backward LSTM, the forward LSTM processes the input sequence according to time sequence, captures the information from the past to the future, the backward LSTM processes the input sequence from the latest time point to the earliest time point, and captures the information never coming to the past; this means that the model not only can predict the current vehicle speed working condition according to the past running state, but also can refer to the future running state information, more comprehensively capture the variation trend of parameters such as vehicle speed, acceleration and the like, adapt to some abrupt change road working conditions, and make more accurate judgment. Each LSTM of the bidirectional structure internally comprises an input gate, a forget gate, an output gate and a unit state; the input of the model is 14 working condition characteristic parameters, and the output is a working condition mode, which is divided into three types of low speed, medium speed and high speed; compared with the traditional LSTM, the method has the advantages that the prediction effect is more accurate, and the adaptability to the identification of the working condition of the complex road is stronger. The LSTM architecture is:

；

Output door: deciding what value to output based on the updated cell state;

；

Referring to fig. 4, in the present embodiment, in step S03, DDPG algorithm is based on an Actor-Critic framework, so that the algorithm contains Actor and Critic networks, and each network has its corresponding target network, so that the DDPG algorithm includes four networks, namely Actor networkCritic networkTarget Actor networkAnd TARGET CRITIC networksThe updating process of DDPG algorithm, the updating mode of the target network and the purpose of introducing the target network;

DDPG the training process includes:

Initializing critic a network And actor networksParameters of (2)And；

The corresponding target network parameters are initialized,，；

Removing a random batch from D Target value of；

updating critic parameters, wherein the loss is as follows: ；

Updating actor network: ；

Soft update target network: 。

In the embodiment, in step S03, a DDPG deep reinforcement learning energy management strategy integrating the working condition identification and the vehicle speed working condition prediction is used;

A set of state:

；

Setting an action set:

；

power for the fuel cell; compared with an energy management algorithm based on DQN and Q-learning to optimize the Q table, the motion space of the energy management algorithm based on DDPG is continuous, so that the power output requirement of an actual fuel cell can be more met;

Setting a reward function:

；

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are to be protected by the present application.

Claims

1. A fuel cell vehicle reinforcement learning energy management method considering operating condition prediction, characterized in that the method comprises the following steps;

S01. Four different types of classic driving conditions are used to construct comprehensive driving condition data that can reflect the characteristics of various road conditions. The comprehensive driving condition data is divided into segments of 100 seconds each. The principal component analysis is performed by extracting the characteristic parameters of the driving conditions. The driving cycle segments are clustered by the K-means clustering method and clustered into three types: low-speed, medium-speed, and high-speed conditions.

S02, after obtaining three different types of driving cycle segments, offline training is performed on the Bi-LSTM-based working condition recognition module and the Markov-based vehicle speed prediction module;

S03. Extract the working condition characteristic parameters of the vehicle speed prediction sequence and convert them into equivalent factor adjustment coefficients , and As a state input into the energy management algorithm of DDPG, the fuel cell power P _fc is used as an action, and a suitable reward function is set, including hydrogen consumption, SOC fluctuation and fuel cell degradation, to achieve a comprehensive consideration of the optimization problem of achieving multiple objectives.

2. A fuel cell vehicle reinforcement learning energy management method considering operating condition prediction according to claim 1, characterized in that the step of using four different types of classic driving conditions to construct comprehensive driving condition data that can reflect various road condition characteristics comprises:

The four typical road conditions of NEDC, WLTC, CLTC-P and FTP75 are combined as a comprehensive driving condition data that can reflect the characteristics of various road conditions, which is used for subsequent training of condition recognition and prediction algorithms;

The NEDC cycle is the European endurance standard test cycle, which includes 4 urban cycles and 1 suburban cycle. The urban cycle lasts for 780 seconds with a maximum speed of 50 km/h; the suburban cycle lasts for 400 seconds with a maximum speed of 120 km/h.

The WLTC standard working condition is closer to the actual road driving conditions. Its complete test cycle consists of four stages: low speed, medium speed, high speed and ultra-high speed, which lasts a total of 1800s, including 235s of idling time, 23266m of travel, an average speed of 46.5km/h and a maximum speed of 131.3km/h.

CLTC-P passenger car test conditions include three speed ranges: low speed, medium speed and high speed, with a total duration of 1800s, a total mileage of 14480m, a maximum speed of 114km/h and an average speed of 28.96km/h;

FTP75 is a standard issued by the U.S. Energy Administration for testing the economy and emissions of passenger cars in urban conditions. It is used to evaluate the emissions and fuel economy of light cars and light trucks. The complete FTP75 operating cycle takes 1874 seconds, with a theoretical driving distance of 17.77 km, an average speed of 34.12 km/h, and a maximum speed of 91.25 km/h. It includes three parts: cold start transient stage, steady state stage, and hot start transient stage.

3. A fuel cell vehicle reinforcement learning energy management method considering operating condition prediction according to claim 1, characterized in that the step of dividing the comprehensive driving condition data into segments of 100 seconds each and performing principal component analysis by extracting the operating condition characteristic parameters of the operating condition comprises:

The comprehensive driving condition data is set to 100s per segment; the condition characteristic parameters reflect the characteristic information of each driving cycle segment; the characteristics of each driving cycle segment are described by 14 parameters including average speed, maximum speed, average acceleration, maximum acceleration, average deceleration, maximum deceleration, idle time ratio, acceleration time ratio, deceleration time ratio, uniform speed time ratio, speed standard deviation, acceleration standard deviation, mileage, and running time;

For 100s of comprehensive driving condition data with 14 condition characteristic parameters, the dimension is reduced to 3 dimensions. The steps of principal component analysis for dimension reduction processing include:

1) Organize the original data into a 100-row 14-column matrix X;

2) Zero-mean each row of X, that is, subtract the mean of this row;

3) Find the covariance matrix ;

4) Find the eigenvalues and corresponding eigenvectors of the covariance matrix;

5) Arrange the eigenvectors into a matrix by row from top to bottom according to the corresponding eigenvalues, and take the first 3 rows to form the matrix P;

6) That is, the data after dimension reduction to 3 dimensions;

In this process, in order to ensure the integrity of the information after dimensionality reduction, the selected principal components must satisfy the cumulative contribution rate of more than 80%. Finally, the top three principal components with the largest contribution rates are selected, and the cumulative contribution rate reaches more than 80%, which is sufficient to cover most of the operating condition characteristic information.

4. The fuel cell vehicle reinforcement learning energy management method considering operating condition prediction according to claim 1, characterized in that the step of clustering driving cycle segments by using the K-means clustering method comprises:

Use K-means clustering algorithm to classify driving cycle segments, determine the number of categories K=3, and assign an initial cluster center point to each category, by calculating the Euclidean distance from each sample point to the cluster center;

;

Each point is divided into each category according to the minimum distance principle. After the first clustering is completed, for each cluster, its cluster center is recalculated. The new cluster center is the mean of all data points in the cluster. The calculation formula is as follows:

;

in, It is A set of data points that are clustered, is the number of data points in the set. The above formula updates and redistributes each cluster point until the cluster center tends to be stable.

5. According to the fuel cell vehicle reinforcement learning energy management method considering operating condition prediction according to claim 4, it is characterized in that after obtaining the three different types of driving cycle segment sets, the step of offline training the Bi-LSTM-based operating condition recognition module and the Markov-based vehicle speed prediction module comprises:

Bi-LSTM consists of forward LSTM and backward LSTM. Forward LSTM processes the input sequence in chronological order and captures information from the past to the future, while backward LSTM processes the input sequence from the latest time point to the earliest time point and captures information from the future to the past. Each LSTM in this bidirectional structure contains an input gate, a forget gate, an output gate, and a unit state. The input of the model is 14 operating condition feature parameters, and the output is the operating mode, which is divided into three types: low speed, medium speed, and high speed. The architecture of LSTM is:

Forget gate: The forget gate determines which information is forgotten. Its expression is:

;

in, is the sigmoid function, is the weight coefficient of the forget gate, Represents the hidden state at the previous moment and the current input state splicing, where There are 14 working condition characteristic parameters. is the bias term;

Input gate: The input gate determines what new information needs to be added to the cell state;

;

in, and is the weight matrix of the input gate, and the dimension is adjusted according to the number of input parameters. and is the bias term, is a candidate cell state;

Unit state update: Update the cell state by combining the results of the forget gate and the input gate;

;

Output gate: determines what value to output based on the updated cell state;

;

in, is the weight parameter of the output gate, is the bias term, is the output state at the current moment;

Bi-LSTM calculates bidirectional hidden states by running the LSTM layer forward and backward along the time axis. The forward LSTM calculates sequentially from the first element to the last element of the sequence, while the backward LSTM does the opposite. The two hidden states are concatenated together to form the final bidirectional hidden state. The bidirectional LSTM captures the contextual information before and after each time step in the sequence, thereby providing a more comprehensive feature representation.

In the vehicle speed condition prediction module, the Markov vehicle speed prediction method is used to identify the three working conditions of low speed, medium speed and high speed. Each mode corresponds to a Markov transfer matrix. Then, the three driving cycle segments clustered in step S01 are used to train three Markov matrices offline in a targeted manner. The prediction step length is set to n, and Markov outputs the predicted vehicle speed information of n steps.

After completing the vehicle speed prediction considering the working condition identification, a p-step predicted vehicle speed sequence will be output.

6. A fuel cell vehicle reinforcement learning energy management method considering operating condition prediction according to claim 1, characterized in that, in step S03, the DDPG algorithm is based on the Actor-Critic framework, the DDPG algorithm contains Actor and Critic networks, and each network has its corresponding target network. The DDPG algorithm includes four networks, namely, Actor network , Critic Network , Target Actor Network and Target Critic Network , the update process of the DDPG algorithm, the update method of the target network and the purpose of introducing the target network;

The DDPG training process includes:

Initialize the critic network and actor networks Parameters and ;

Initialize the corresponding target network parameters, , ;

Initialize the experience playback D, set the capacity to K, and set the number of loop rounds to M;

According to the current status , input to the current network, according to the current network strategy and noise Select the action, , generating action Afterwards, get rewarded , and enter the next state , and store state-action pairs Go to Experience Replay D;

Take a random batch from D , target value ;

in is the value function at the next moment, is the next moment strategy function, , is the network weight parameter at the next moment;

Update the critic parameters, and its loss is: ;

Update the actor network: ;

Soft update target network: .

7. A fuel cell vehicle reinforcement learning energy management method considering operating condition prediction according to claim 1, characterized in that, in step S03, a DDPG deep reinforcement learning energy management strategy integrating operating condition identification and vehicle speed operating condition prediction is used;

Set the state collection:

;

The power required for the vehicle, is the power battery status, FOH is the fuel cell life;

Set action set:

;

is the fuel cell power;

Set the reward function:

;

R is the reward function, is the fuel cell hydrogen consumption rate; λ is the equivalent factor adjustment coefficient calculated by the p-step predicted speed sequence output by the speed prediction part, , in the training process, the optimal energy management strategy is achieved by selecting the appropriate k value; after the training is completed, in the actual operation process, λ is adjusted in real time according to the specific predicted vehicle speed sequence it receives; To predict the standard deviation of the speed series, To predict the average speed of the speed sequence; is the output power of the power battery; LHV is the lower heating value of hydrogen; is the fuel cell life degradation value, is the weight factor to consider the fuel cell lifetime degradation;

In the process of implementing DDPG-based energy management, the environment first inputs the state into the agent’s Actor. The agent selects the corresponding action, obtains real-time rewards, and then transitions to the next state S’. The experience samples are stored in the sample pool. When the number of samples in the sample pool reaches the preset number, the earliest experience is removed. The DDPG algorithm continuously lets the agent perform actions, collects data on interactions with the environment, including states and rewards, and updates network parameters, so that the agent gradually learns the optimal action strategy and achieves optimal energy efficiency optimization.