CN120258951A

CN120258951A - A financial risk prediction method based on multimodal data fusion

Info

Publication number: CN120258951A
Application number: CN202510343420.8A
Authority: CN
Inventors: 罗惠麟; 夏喆
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2025-03-21
Filing date: 2025-03-21
Publication date: 2025-07-04

Abstract

The present invention proposes a financial risk prediction method based on multimodal data fusion, which belongs to the field of artificial intelligence risk prediction technology, including S1: extracting audio feature vectors of the target company's earnings call conference; S2: extracting text minutes feature vectors of the target company's earnings call conference; S3: summarizing the text minutes of the target company's earnings call conference using a large language model to obtain an embedding vector corresponding to the summary of the text minutes; S4: extracting features of the target company's news text to obtain a news text feature vector; S5: extracting features of the target company's time series transaction data within a period before the target date to obtain a feature vector of the time series data; S6: obtaining a joint representation vector; S7: inputting the joint representation vector into a multi-task learning framework to predict risk indicators. The present invention significantly improves prediction accuracy, fully utilizes implicit information, and has good multi-task learning prediction capabilities.

Description

Financial risk prediction method based on multi-mode data fusion

Technical Field

The invention relates to the technical field of artificial intelligence risk prediction, in particular to a financial risk prediction method based on multi-mode data fusion.

Background

With the rapid development of artificial intelligence technology, the level of intelligence in the financial industry is significantly improved. In the field of financial risk prediction, the artificial intelligence technology not only can improve risk identification efficiency, but also provides important support for enterprises to optimize investment decisions and improve risk management. However, current research and practice remains somewhat deficient in how efficiently multimodal data is integrated to achieve more accurate financial risk prediction. Existing financial risk prediction techniques have focused mainly on the utilization of a single data source, such as historical stock price analysis based on time series or news text emotion analysis based on natural language processing. This approach, while effective in certain scenarios, has difficulty adequately reflecting the multidimensional driving factor of market fluctuations. In particular, the limitations of single data source prediction methods are becoming more apparent in the face of complex market environments and increasing data diversity. In addition, a large amount of key information exists in the financial market in the form of unstructured data, such as text summaries of financial teleconferences, voice recordings, media news stories, and the like. These data, which contain implicit features such as the talker's intonation, speech speed, and mood changes, are important clues for assessing market risk. However, the conventional model has technical bottlenecks in processing unstructured data, and lacks effective fusion capability for multi-mode data, so that the prediction result is difficult to comprehensively and accurately reflect market dynamics. The prior art has the obvious defects that the single data source causes the limitation of a prediction result, and the adaptability and the stability of a prediction model are poor, so that the accuracy and the reliability of the prediction result are greatly influenced.

In recent years, large language models (Large Language Model, LLM) have been increasingly introduced into the financial domain by virtue of their advantages in cross-domain text processing, emotion analysis, and multitasking learning. Research shows that the large language model can efficiently process long text and generate high-quality abstract, and has the capability of deep analysis of financial data. However, relying on only a single model makes it difficult to fully capture the multidimensional dynamics of the financial market. How to integrate multi-modal data into a unified prediction framework and combine the advantages of a large language model to perform deep analysis becomes a core problem to be solved in the current financial risk prediction field. In summary, the method for predicting financial risk based on multi-modal data fusion is provided, multi-modal data is comprehensively utilized and deep feature extraction and fusion analysis are performed through a large language model, and the precision and stability of financial risk prediction are remarkably improved, and the financial risk management provides powerful technical support and is very necessary through integration of multi-source data and design of a multi-task prediction framework.

Disclosure of Invention

In view of the above, the invention provides a financial risk prediction method based on multi-mode data fusion, which integrates various data sources, comprehensively captures dynamic and market information of a target company, predicts various risk indexes simultaneously, and has good prediction comprehensiveness and accuracy.

The technical scheme of the invention is realized in such a way that the invention provides a financial risk prediction method based on multi-mode data fusion, which comprises the following steps:

s1, extracting an audio feature vector from a financial report teleconference of a target company;

S2, extracting text summary feature vectors from the financial and newspaper teleconference of the target company;

s3, summarizing text summary of the financial and newspaper teleconference of the target company by using a large language model to obtain an embedded vector corresponding to the summary of the text summary;

s4, extracting news text characteristics of the target company to obtain news text characteristic vectors;

s5, extracting features of time sequence transaction data of a target company in a period of time before a target date to obtain feature vectors of the time sequence data;

S6, fusing the obtained audio feature vector, text summary feature vector, embedded vector corresponding to summary of text summary, news text feature vector and feature vector of time sequence data to obtain joint expression vector, and

And S7, inputting the joint representation vector into a multi-task learning framework to predict the risk index.

Based on the above technical solution, preferably, the specific content of step S1 is:

S11, extracting an embedded vector of audio of a financial teleconference by using WENETSPEECH pre-training language model, and carrying out audio preprocessing on a section of audio Wherein the method comprises the steps ofAn i-th frame representing audio, n representing the number of frames in the audio, each frame of audio being converted into a vector representationThereby obtaining an audio embedded vector of the whole audio A _c

S12, sending the audio embedded vector E _ac into a multi-head self-attention module MHSA, and further extracting the characteristic vector T _ac＝MHSA(E_ac of the audio;

And S13, sending the feature vector T _ac to an average pooling layer Average Pooling Layer to obtain a compressed audio feature vector T _a,T_a＝AveragePooling(T_ac).

Preferably, the vector representation of each frame of audioThe dimension of the compressed audio feature vector T _a is 512, the dimension of the text summary feature vector is 768, the dimension of the embedded vector corresponding to the summary of the text summary is 768, and the dimension of the joint representation vector is 512.

Preferably, the multi-headed self-attention module MHSA processes the input audio embedded vector in parallel using a plurality of attention heads, calculates the attention weight of the audio embedded vector using the following formula:The method comprises the steps of obtaining a query vector Q=E _acW_Q, a key vector K=E _acW_K, a value vector V=E _acW_V,W_Q、W_K and W _V, wherein d _k is a linear projection matrix, d _k is the dimension of attention heads, a softmax function is used for normalization, splicing the output of each attention head, and obtaining a characteristic vector T _ac,T_ac＝Concat(head₁,head₂,head_h)W_o of audio through linear transformation, wherein subscripts 1,2 and h are distinguishing marks of different attention heads, W _o is a combined linear transformation matrix, and Concat is a combining function.

Further preferably, the specific content of step S2 is:

s21, preprocessing the text summary of the financial and newspaper teleconference of the target company to obtain a sentence collection Wherein the method comprises the steps ofRepresenting the L-th sentence in the text summary, L=1, 2, L, L representing the number of sentences in the text summary, using Sentence-BERT pre-training language model to model each sentenceEmbedded vector mapped as financial accounting teleconference text summaryVector representation for obtaining text summary of whole financial accounting teleconference

S22, sending the vector representation of the text summary of the whole financial newspaper teleconference to a multi-head self-attention module MHSA, and further extracting a feature vector T _tc,T_tc＝MHSA(E_tc of the text summary;

S23, sending the text summary feature vector T _tc into an average pooling layer Average Pooling Layer to obtain a compressed text summary feature vector T _t＝AveragePooling(T_tc).

Still more preferably, the specific content of step S3 is:

S31, segmenting paragraphs, namely segmenting a text summary of a financial and newspaper teleconference of a target company according to logic paragraphs to obtain paragraphs p _M＝p₁,p₂,,p_m, M=1, 2, M, and the subscript M represents the number of the paragraphs;

s32, merging all paragraph summaries { S ₁,s₂,s_M } into a whole text, inputting a large language model to generate a comprehensive summary S, S=LLM ({ S ₁,s₂,s_M });

And S33, vectorizing the comprehensive summary S by using a Sentence-BERT pre-training language model to generate an embedded vector T _l,T_l = SBERT (S) corresponding to the summary of the text summary.

Still more preferably, the specific content of step S4 is:

S41, collecting news text data of a target company several days before a target transaction day, wherein the news text data are represented as N= { N ₁,n₂,n_k }, N _K0 represents a K0 th news text, and K is the total number of news;

S42, analyzing each news n _K0 by using a large language model LLM, extracting metadata M _K0,m_K0＝LLM(n_K0), and integrating the metadata of all news into an integral metadata set M _N＝{m₁,m₂,,m_k;

s43, finding k historical news groups N _K0＝N₁,N₂,,N_k which are most similar to the metadata of news text data N of a plurality of days before the target transaction in the historical news data set, calculating the semantic relevance of N and any one of the historical news groups N _K0, F is a function of converting news texts into embedded vectors, and selecting a group of historical news H with highest semantic similarity with news text data N of a plurality of days before a target transaction day;

And S44, splicing the relevant texts of market trend after N, H and H occur, and then converting the spliced texts into embedded vectors T _n by using a SBERT model, wherein the embedded vectors T _n are used as news text feature vectors of a target company several days before the target transaction day.

Still more preferably, the specific content of step S5 is:

S51, collecting time sequence transaction data D of a target company 30 days before a target date, wherein the time sequence transaction data D comprises a daily closing price and a daily deal volume, and the time sequence transaction data are expressed as D= { (p ₁,v₁),(p₂,v₂),,(p_d,v_d) }, wherein p _F＝p₁,p₂,,p_d represents the closing price on the F th day, and v _F＝v₁,v₂,,v_d represents the deal volume on the F th day;

S52, inputting the time sequence transaction data D into a Bi-directional long-short term memory network Bi-LSTM, capturing time sequence dynamic characteristics of the transaction data by the Bi-directional long-short term memory network Bi-LSTM, and outputting a characteristic vector T _v,T_v = BiLSTM (D) containing the time sequence data;

S53, capturing dynamic relations among different time sequence features through a vector autoregressive VAR model:

log(σ_3,t)＝α₃+β_1,1log(σ_-3,t)+β_1,2log(σ_-7,t)+β_1.3log(σ_-15,t)+β_1.4.log(σ_-30,t)+u_3,t;

log(σ_7,t)＝α₇+β_2,1log(σ_-3,t)+β_2,2log(σ_-7,t)+β_2.3log(σ_-15,t)+β_2.4.log(σ_-30,t)+u_7,t;

log(σ_15,t)＝α₁₅+β_3,1log(σ_-3,t)+β_3,2log(σ_-7,t)+β_3.3log(σ_-15,t)+β_3.4.log(σ_-30,t)+u_15,t;

log(σ_30,t)＝α₃₀+β_4,1log(σ_-3,t)+β_4,2log(σ_-7,t)+β_4.3log(σ_-15,t)+β_4.4.log(σ_-30,t)+u_30,t;

σ _z,t represents the volatility of the stock price of the target company in z= 3,7,15 and 30 days, z+.f, z ε t, u _z,t is a white noise term, β _a,b is a coefficient matrix of dynamic relationship, a, b=1, 2,3,4, α _z is an intercept term, and the volatility of the stock price is defined as the standard deviation of the profitability of the target company in z days:

Still more preferably, the joint representation vector E of step S6 is fused ：E＝w₀+w₁T_a+w₂T_t+w₃T_l+w₄T_n+w₅T_v+ε, by the following formula, where w ₀ is the bias term, w ₁,w₂,w₃,w₄, w is the ₅ fusion weight, and epsilon is the error term, representing random noise.

Still further preferably, the specific content of step S7 is that, using the joint representation vector E generated in step S6 as an input to a multi-task learning framework, the multi-task learning framework predicts risk indexes including volatility σ _3,t、σ_7,t、σ_15,t and σ _30,t of stock prices of the target company over 3 days, 7 days, 15 days and 30 days, and a single day risk value VAR;

constructing an independent first prediction sub-network MLP aiming at stock price volatility of each time span, wherein the prediction result of the stock price volatility is that Wherein the method comprises the steps ofF _MLP,z (·) is the first predicted subnetwork MLP of the time span z;

For a single-day risk value VAR, an independent second prediction sub-network MLP is constructed, the input is a joint expression vector E, and the output is a prediction value of VAR Wherein the method comprises the steps ofF _MLP,VAR (·) is the second prediction subnetwork MLP;

the joint loss function simultaneously optimizes stock price volatility and VAR prediction tasks, and the formula of the joint loss function is as follows: Mu is a weight superparameter for balancing stock price volatility prediction error and VAR prediction error, y _j and A true value and a predicted value representing the stock price volatility index; Mean square error representing stock price volatility prediction, q represents quantile threshold of single day risk value, V and Respectively representing a real single-day risk value and a predicted single-day risk value; And (5) representing a quantile regression loss function of the single-day risk value prediction task.

Compared with the prior art, the financial risk prediction method based on multi-mode data fusion has the following beneficial effects:

(1) According to the invention, a brand-new financial risk prediction framework is constructed by fusing multi-mode data, including financial teleconference audio, text, news text and time sequence data, and the multi-mode feature fusion method remarkably improves the perception capability of financial market risks, can capture multi-level information of company operation and market environment, and solves the problem of insufficient prediction accuracy caused by a single data source in the prior art;

(2) The prior art often ignores the non-explicit features contained in the audio of the financial and newspaper teleconference, such as the intonation, the speech speed and the emotion of a speaker, and the invention can reveal the potential information relevance which is difficult to capture by the traditional text analysis method by introducing a multi-head self-attention mechanism MHSA and a large language model LLM to perform joint extraction and deep analysis on the features of the audio and the text data, thereby improving the accuracy and the insight of prediction;

(3) The invention adopts a multi-task learning framework, can simultaneously predict various risk indexes, including stock price volatility and daily risk value VAR of different time spans, and the technical scheme not only improves the prediction efficiency of the model, but also effectively relieves the dependence of the traditional method on a single task and enhances the adaptability of the model to complex task scenes. In addition, by introducing the joint loss function, the multi-task learning process is optimized, and the prediction performance is further improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic flow chart of a financial risk prediction method based on multi-modal data fusion.

Detailed Description

The following description of the embodiments of the present invention will clearly and fully describe the technical aspects of the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, are intended to fall within the scope of the present invention.

Most of the existing financial risk predictions adopt a single data source, cannot capture complex association between enterprise internal operation dynamics and external market environments, are not comprehensive and accurate, and cannot predict multiple risk indexes simultaneously. In view of this, as shown in fig. 1, the present invention provides a financial risk prediction method based on multi-modal data fusion, which includes the following steps:

s1, extracting audio feature vectors from the financial accounting teleconference of the target company.

The specific content of the step S1 is as follows:

S10, audio preprocessing, namely segmenting the audio of the financial and newspaper teleconference, dividing special provisions the audio of the teleconference by taking a frame as a unit, wherein the length of each frame is fixed, for example, 25 milliseconds, and the adjacent frames are overlapped by 10 milliseconds to ensure the capture of the time sequence characteristics of continuous voice, normalizing the audio of the financial and newspaper teleconference, normalizing the waveform amplitude to eliminate the influence of recording equipment or volume difference, and converting the audio format of the financial and newspaper teleconference, such as WAV or FLAC, so as to extract the subsequent characteristics.

S11, extracting an embedded vector of audio of a financial teleconference by using WENETSPEECH pre-training language model, and carrying out audio preprocessing on a section of audioWherein the method comprises the steps ofAn i-th frame representing audio, n representing the number of frames in the audio, each frame of audio being converted into a vector representationThereby obtaining an audio embedded vector of the whole audio A _c The pre-training language model WENETSPEECH samples a transducer architecture, and the model is trained on a large-scale voice dataset, so that hidden features of voice can be effectively extracted, and the model is suitable for the field of natural language processing.

S12, the audio embedded vector E _ac is sent to a Multi-Head Self-Attention module MHSA, the characteristic vector T _ac＝MHSA(E_ac of the audio is further extracted, and the Multi-Head Self-Attention module, namely Multi-Head Self-Attention, belongs to a technical means which is conventional in the art, usually adopts 8 or 16 Attention heads to process the input audio embedded vector in parallel, and each Attention Head pays Attention to the global dependency relationship among audio frames in different modes.

The multi-head self-attention module MHSA processes the input audio embedded vector in parallel by using a plurality of attention heads, and calculates the attention weight of the audio embedded vector by using the following formula:The method comprises the steps of obtaining a query vector Q=E _acW_Q, a key vector K=E _acW_K, a value vector V=E _acW_V,W_Q、W_K and W _V, wherein d _k is a linear projection matrix, d _k is the dimension of attention heads, a softmax function is used for normalization, splicing the output of each attention head, and obtaining a characteristic vector T _ac,T_ac＝Concat(head₁,head₂,head_h)W_o of audio through linear transformation, wherein subscripts 1,2 and h are distinguishing marks of different attention heads, W _o is a combined linear transformation matrix, and Concat is a combining function.

In order to preserve a sufficient amount of characteristic information, as a preferred embodiment, the present invention provides that the vector representation of each frame of audioThe dimension of the compressed audio feature vector T _a is 512, the average value is taken for each dimension of the audio feature vector T _ac, the compressed audio feature vector T _ac is 1 vector of 512 dimensions, namely, the operation result of a pooling layer is averaged, the feature dimension is effectively reduced, global voice information is reserved, and the obtained compressed audio feature vector T _a can represent implicit information such as intonation, emotion, speech speed and the like contained in audio data and provides high-quality input for subsequent multi-modal feature fusion.

In addition, the dimension of the text summary feature vector is 768, the dimension of the embedded vector corresponding to the summary of the text summary is 768, and the dimension of the joint representation vector is 512.

The invention fully utilizes the unstructured data characteristics of the financial teleconference audio to extract the audio characteristics with high dimension and information density, thereby providing important support for the financial risk prediction model.

S2, extracting text summary feature vectors for the financial and newspaper teleconference of the target company.

The specific content of the step S2 is as follows:

s21, preprocessing the text summary of the financial and newspaper teleconference of the target company to obtain a sentence collection Where t _c ^L represents the L-th sentence in the text summary, L=1, 2, L, L represents the number of sentences in the text summary, using a Sentence-BERT pre-trained language model, abbreviated as SBERT, each sentenceEmbedded vector mapped into 768-dimensional financial teleconference text summaryVector representation for obtaining text summary of whole financial accounting teleconferenceSBERT is a sentence-level semantic representation model based on the BERT architecture, and the sentence pairs are compared and learned through the Siamese network, so that the capturing capability of semantic similarity among sentences is optimized.

The text preprocessing content is to standardize the text summary of the financial and newspaper teleconference of the target company, clear redundant characters and unify formats, remove nonsensical stop words such as a one-way word, a hiccup and the like, and divide sentences according to punctuation marks such as periods, semicolons and the like in the text summary of the financial and newspaper teleconference of the target company to obtain sentence sets.

The multi-head self-attention module MHSA is similar to the content in the step S12, but has the function of capturing the global dependency relationship among sentences, using a plurality of attention heads, each focusing on different semantic relationships, and generating the text summary feature vector T _tc of the final context enhancement feature through linear transformation after splicing the outputs of the attention heads.

The method and the device have the advantages that average values are taken for each dimension of feature vectors T _tc of the text summary, the average values are compressed into 1 768-dimension text summary feature vectors, comprehensive semantic information of the text summary of the whole financial and newspaper teleconference is represented, semantic and context relations in the text are fully reserved, the logical association among key financial information, semantic emotion and sentences is included, high-quality text input is provided for subsequent multi-modal feature fusion, and the method and the device have remarkable advantages in the aspect of capturing the text implicit information of the text summary of the financial and newspaper teleconference by combining the context features, so that an important data basis is provided for financial risk prediction.

And S3, summarizing the text summary of the financial and newspaper teleconference of the target company by using a large language model to obtain an embedded vector corresponding to the summary of the text summary, extracting core content by cutting and summarizing the text summary of the financial and newspaper teleconference, and generating the embedded vector corresponding to the summary of the text summary for subsequent use.

The specific content of the step S3 is as follows:

S31, segmenting paragraphs, namely segmenting a text summary of a financial and newspaper teleconference of a target company according to logical paragraphs to obtain paragraphs p _M＝p₁,p₂,,p_m, M=1, 2, M, wherein the subscript M represents the number of the paragraphs, inputting each paragraph p _M into a large language model LLM, and extracting a paragraph level abstract S _M＝LLM(p_M),s_M as an abstract of the paragraph p _M.

S32, merging all paragraph summaries { S ₁,s₂,s_M } into an overall text, inputting a large language model to generate a comprehensive summary S, S=LLM ({ S ₁,s₂,s_M }), and after the comprehensive summary is generated, further adjusting the granularity of the summary according to requirements, for example, performing length control on the longer text summary or removing redundant information.

And S33, vectorizing the comprehensive summary S by using a Sentence-BERT pre-training language model to generate an embedded vector T _l,T_l = SBERT (S) corresponding to the summary of the text summary. Summary of text summary the corresponding embedded vector T _l is 768-dimensional feature vector representing semantic information of the text summary. The vector fuses the core content and the global information of the financial and newspaper meeting text, and provides important support for multi-mode feature fusion.

Through the steps, after paragraph segmentation, paragraph level summarization and comprehensive summarization are carried out on the summary of the financial and newspaper teleconference text, the key content of the text is effectively summarized by the generated embedded vector T _l, and the utilization efficiency and prediction precision of the financial risk prediction model on the text data can be remarkably improved.

And S4, extracting news text characteristics of the target company to obtain news text characteristic vectors. The method comprises the steps of carrying out semantic analysis and similarity calculation on news texts related to target companies, and generating news text feature vectors of the daily front of target transactions.

The specific content of the step S4 is as follows:

And S41, collecting news text data of a target company several days before the target transaction day, wherein the news text data is represented as N= { N ₁,n₂,n_k }, N _K0 represents a K0 th news text, and K is the total number of news.

S42, analyzing each news n _K0 by using a large language model LLM, extracting metadata m _K0,m_K0＝LLM(n_K0), wherein the extracted metadata comprises emotion tendencies (positive, negative or neutral), financial indexes (such as income, profit and the like) related to news and other key information related to a target company. The metadata of all news is integrated into one whole metadata set M _N＝{m₁,m₂,,m_k.

S43, finding k historical news groups N _K0＝N₁,N₂,,N_k which are most similar to the metadata of news text data N of a plurality of days before the target transaction in the historical news data set, calculating the semantic relevance of N and any one of the historical news groups N _K0,Where f is a function of converting the news text into an embedded vector, and selecting a set of historical news H with the highest semantic similarity to the news text data N several days before the target transaction day.

The method corresponds to texts from three sources, namely news text data N1 several days before a target transaction day, historical news H2, and market trend related texts after the occurrence of the historical news H3, namely news after the occurrence of the historical news H, such as that the historical news 1 month and 1 day raise H to the liability crisis of company A, the news 1 month and 2 days raise the stock price of company A to drop greatly, and the news 1 month and 2 days are market trend related texts after the occurrence of the historical news H.

Through the step S4, the generated news text feature vector T _n of the target company several days before the target transaction day effectively fuses semantic information, emotion analysis results and similar relations with the historical news of the target news, and provides high-quality input features for subsequent multi-modal feature fusion.

And S5, carrying out feature extraction on time sequence transaction data of the target company in a period before the target date to obtain feature vectors of the time sequence data. As a preferred embodiment, this step extracts key features by processing time series transaction data of the target company 30 days before the target date, and captures dynamic relationships between different time series features.

The specific content of the step S5 is as follows:

s52, inputting the time sequence transaction data D into a Bi-long-short-term memory network Bi-LSTM, capturing time sequence dynamic characteristics of the transaction data by the Bi-long-term memory network Bi-LSTM, outputting a 128-dimensional feature vector T _v,T_v =BiLSTM (D) containing the time sequence data, modeling the time sequence data by the Bi-long-term memory network Bi-LSTM, and capturing the time sequence dynamic characteristics of the transaction data.

σ _z,t represents the volatility of the stock price of the target company in z= 3,7,15 and 30 days, z+.f, z ε t, u _z,t is a white noise term, β _a,b is a coefficient matrix of dynamic relationship, a, b=1, 2,3,4, α _z is an intercept term, and the volatility of the stock price is defined as the standard deviation of the profitability of the target company in z days: the number of days herein, z, is 3,7,15 and 30, by way of example only, and not by way of limitation of the protocol itself.

Through this step S5, the time series transaction data is converted into the high-dimensional feature vector T _v and the parameters of the dynamic relationship, providing key time series information and dynamic relationship support for financial risk prediction.

And S6, fusing the obtained audio feature vector, the text summary feature vector, the embedded vector corresponding to the summary of the text summary, the news text feature vector and the feature vector of the time sequence data to obtain a joint representation vector.

The joint representation vector E described in step S6 is fused ：E＝w₀+w₁T_a+w₂T_t+w₃T_l+w₄T_n+w₅T_v+ε, by the following formula, where w ₀ is the bias term, w ₁,w₂,w₃,w₄,w₅ is the fusion weight, epsilon is the error term, and represents random noise.

The dimension of the joint representation vector E is fixed to 512 dimensions, representing the integrated features of the multimodal data. The vector retains important information of each modal feature and can be used for subsequent multi-task prediction.

The method comprises the following steps of using a joint representation vector E generated in the step S6 as input of a multi-task learning framework, wherein the multi-task learning framework simultaneously predicts the following risk indexes, namely the volatility sigma _3,t、σ_7,t、σ_15,t and sigma _30,t of stock prices of target companies in 3 days, 7 days, 15 days and 30 days and a single-day risk value VAR;

Through the step S7, the multi-mode features can be effectively integrated by the multi-task learning framework, the fluctuation rate and the single-day risk value of different time spans can be accurately predicted, and reliable support is provided for financial risk management.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. A financial risk prediction method based on multi-mode data fusion is characterized by comprising the following steps:

2. The financial risk prediction method based on multi-modal data fusion according to claim 1, wherein the specific content of step S1 is:

3. A method of financial risk prediction based on multimodal data fusion as claimed in claim 2, wherein the vector representation of each frame of audioThe dimension of the compressed audio feature vector T _a is 512, the dimension of the text summary feature vector is 768, the dimension of the embedded vector corresponding to the summary of the text summary is 768, and the dimension of the joint representation vector is 512.

4. The financial risk prediction method based on multi-modal data fusion of claim 2, wherein the multi-headed self-attention module MHSA processes the input audio embedded vector in parallel using a plurality of attention heads, calculates the attention weight of the audio embedded vector using the following formula: The query vector Q=E _acW_Q, the key vector K=E _acW_K, the value vectors V=E _acW_V,W_Q、W_K and W _V are linear projection matrices, d _k is the dimension of the attention head, the softmax function is used for normalization, the output of each attention head is spliced, the characteristic vector T _ac,T_ac＝Concat(head₁,head₂,...head_h)W_o of the audio is obtained through linear transformation, the subscripts 1,2 are the distinguishing marks of different attention head heads, W _o is the combined linear transformation matrix, and Concat is a combining function.

5. The financial risk prediction method based on multi-modal data fusion according to claim 4, wherein the specific content of step S2 is:

s21, preprocessing the text summary of the financial and newspaper teleconference of the target company to obtain a sentence collection Wherein the method comprises the steps ofRepresenting the L-th sentence in the text summary, l=1, 2, L, L represents the number of sentences in the text summary, mapping each sentence t _c ^L to an embedded vector of financial teleconference text summary using a Sentence-BERT pre-trained language model Vector representation for obtaining text summary of whole financial accounting teleconference

6. The financial risk prediction method based on multi-modal data fusion according to claim 5, wherein the specific content of step S3 is:

S31, segmenting paragraphs, namely segmenting a text summary of a financial and newspaper teleconference of a target company according to a logic paragraph to obtain paragraphs p _M＝p₁,p₂,...,p_m, M=1, 2, M, wherein the subscript M represents the number of the paragraphs;

S32, merging all paragraph summaries { S ₁,s₂,...s_M } into a whole text, inputting a large language model to generate a comprehensive summary S, S=LLM ({ S ₁,s₂,...s_M });

7. The financial risk prediction method based on multi-modal data fusion according to claim 6, wherein the specific content of step S4 is:

S41, collecting news text data of a target company several days before a target transaction day, wherein the news text data are represented as N= { N ₁,n₂,...n_k }, N _K0 represents a K0 th news text, and K is the total number of news;

S42, analyzing each news n _K0 by using a large language model LLM, extracting metadata M _K0,m_K0＝LLM(n_K0), and integrating the metadata of all news into an integral metadata set M _N＝{m₁,m₂,...,m_k;

s43, finding k historical news groups N _K0＝N₁,N₂,...,N_k which are most similar to the metadata of news text data N of a plurality of days before the target transaction in the historical news data set, calculating the semantic relevance of N and any one of the historical news groups N _K0, F is a function of converting news texts into embedded vectors, and selecting a group of historical news H with highest semantic similarity with news text data N of a plurality of days before a target transaction day;

8. The financial risk prediction method based on multi-modal data fusion according to claim 7, wherein the specific content of step S5 is:

S51, collecting time sequence transaction data D of a target company 30 days before a target date, wherein the time sequence transaction data D comprises a daily closing price and a daily deal volume, and the time sequence transaction data are expressed as D= { (p ₁,v₁),(p₂,v₂),...,(p_d,v_d) }, wherein p _F＝p₁,p₂,...,p_d represents the closing price on the F th day, and v _F＝v₁,v₂,...,v_d represents the deal volume on the F th day;

9. the method of claim 8, wherein the joint representation vector E in step S6 is fused ：E＝w₀+w₁T_a+w₂T_t+w₃T_l+w₄T_n+w₅T_v+ε, by the following formula, wherein w ₀ is a bias term, w ₁,w₂,w₃,w₄,w₅ is a fusion weight, and epsilon is an error term, and represents random noise.

10. The financial risk prediction method based on multi-modal data fusion according to claim 9, wherein the specific content of step S7 is that the joint representation vector E generated in step S6 is used as input to a multi-task learning framework, and the multi-task learning framework predicts the risk indexes of the volatility σ _3,t、σ_7,t、σ_15,t and σ _30,t of the stock prices of the target company in 3 days, 7 days, 15 days and 30 days and the single-day risk value VAR at the same time;

For a single-day risk value VAR, an independent second prediction sub-network MLP is constructed, the input is a joint expression vector E, and the output is a prediction value of VAR Wherein the method comprises the steps ofIs a predictive value of VAR;

f _MLP,VAR (·) is the second prediction subnetwork MLP;