US20180189572A1 - Method and System for Multi-Modal Fusion Model - Google Patents
Method and System for Multi-Modal Fusion Model Download PDFInfo
- Publication number
- US20180189572A1 US20180189572A1 US15/472,797 US201715472797A US2018189572A1 US 20180189572 A1 US20180189572 A1 US 20180189572A1 US 201715472797 A US201715472797 A US 201715472797A US 2018189572 A1 US2018189572 A1 US 2018189572A1
- Authority
- US
- United States
- Prior art keywords
- vectors
- modal
- vector
- content
- weights
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G06K9/00744—
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G06F17/28—
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
-
- G06K9/00718—
-
- G06K9/4671—
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/57—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for processing of video signals
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/2343—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
- H04N21/234336—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by media transcoding, e.g. video is transformed into a slideshow of still pictures or audio is converted into text
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
- H04N21/4394—Processing of audio elementary streams involving operations for analysing the audio stream, e.g. detecting features or characteristics in audio streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/85—Assembly of content; Generation of multimedia applications
- H04N21/854—Content authoring
- H04N21/8549—Creating video summaries, e.g. movie trailer
Definitions
- This invention generally relates to a method and system for describing multi-modal data, and more specifically to a method and system for video description.
- Video captioning refers to the automatic generation of a natural language description (e.g., a sentence) that narrates an input video.
- Video description can be widespread applications including video retrieval, automatic description of home movies or online uploaded video clips, video descriptions for the visually impaired, warning generation for surveillance systems and scene understanding for knowledge sharing between human and machine.
- Video description systems extract salient features from the video data, which may be multimodal features such as image features representing some objects, motion features representing some actions, and audio features indicating some events, and generate a description narrating events so that the words in the description are relevant to those extracted features and ordered appropriately as natural language.
- Some embodiments of a present disclosure are based on generating content vectors from input data including multiple modalities.
- the modalities may be audio signals, video signals (image signals) and motion signals contained in video signals.
- the present disclosure is based on a multimodal fusion system that generates the content vectors from the input data that include multiple modalities.
- the multimodal fusion system receives input signals including image (video) signals, motion signals and audio signals and generates a description narrating events relevant to the input signals.
- a system for generating a word sequence from multi-modal input vectors include one or more processors and one or more storage devices storing instructions that are operable, when executed by the one or more processors, to cause the one or more processors to perform operations that include receiving first and second input vectors according to first and second sequential intervals; extracting first and second feature vectors using first and second feature extractors, respectively, from the first and second input; estimating a first set of weights and a second set of weights, respectively, from the first and second feature vectors and a prestep context vector of a sequence generator; calculating a first content vector from the first set of weights and the first feature vectors, and calculating a second content vector from the second set of weights and the second feature vectors; transforming the first content vector into a first modal content vector having a predetermined dimension and transforming the second content vector into a second modal content vector having the predetermined dimension; estimating a set of modal attention weights from the pre-step context vector and
- some embodiments of the present disclosure provide a non-transitory computer-readable medium storing software comprising instructions executable by one or more processors which, upon such execution, cause the one or more processors to perform operations.
- the operations include receiving first and second input vectors according to first and second sequential intervals; extracting first and second feature vectors using first and second feature extractors, respectively, from the first and second input; estimating a first set of weights and a second set of weights, respectively, from the first and second feature vectors and a prestep context vector of a sequence generator; calculating a first content vector from the first set of weights and the first feature vectors, and calculating a second content vector from the second set of weights and the second feature vectors; transforming the first content vector into a first modal content vector having a predetermined dimension and transforming the second content vector into a second modal content vector having the predetermined dimension; estimating a set of modal attention weights from the pre-step context vector and the first and second content vectors or the first and second
- a method for generating a word sequence from multi-modal input vectors from multi-modal input vectors includes receiving first and second input vectors according to first and second sequential intervals; extracting first and second feature vectors using first and second feature extractors, respectively, from the first and second input; estimating a first set of weights and a second set of weights, respectively, from the first and second feature vectors and a prestep context vector of a sequence generator; calculating a first content vector from the first set of weights and the first feature vectors, and calculating a second content vector from the second set of weights and the second feature vectors; transforming the first content vector into a first modal content vector having a predetermined dimension and transforming the second content vector into a second modal content vector having the predetermined dimension; estimating a set of modal attention weights from the pre-step context vector and the first and second content vectors or the first and second modal content vectors; generating a weighted content vector having the predetermined dimension from the
- FIG. 1 is a block diagram illustrating a multimodal fusion system according to some embodiments of the present disclosure
- FIG. 2A is a block diagram illustrating a simple multimodal method according to embodiments of the present disclosure
- FIG. 2B is a block diagram illustrating a multimodal attention method according to embodiments of the present disclosure
- FIG. 3 is a block diagram illustrating an example of the LSTM-based encoder-decoder architecture according to embodiments of the present disclosure
- FIG. 4 is a block diagram illustrating an example of the attention-based sentence generator from video according to embodiments of the present disclosure
- FIG. 5 is a block diagram illustrating an extension of the attention-based sentence generator from video according to embodiments of the present disclosure
- FIG. 6 is a diagram illustrating a simple feature fusion approach (simple multimodal method) according to embodiments of the present disclosure
- FIG. 7 is a diagram illustrating an architecture of a sentence generator according to embodiments of the present disclosure.
- FIG. 8 shows comparisons of performance results obtained by conventional methods and the multimodal attention method according to embodiments of the present disclosure
- FIGS. 9A, 9B, 9C and 9D show comparisons of performance results obtained by conventional methods and the multimodal attention method according to embodiments of the present disclosure.
- individual embodiments may be described as a process, which is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process may be terminated when its operations are completed, but may have additional steps not discussed or included in a figure. Furthermore, not all operations in any particularly described process may occur in all embodiments.
- a process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, the function's termination can correspond to a return of the function to the calling function or the main function.
- embodiments of the subject matter disclosed may be implemented, at least in part, either manually or automatically.
- Manual or automatic implementations may be executed, or at least assisted, through the use of machines, hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof.
- the program code or code segments to perform the necessary tasks may be stored in a machine-readable medium.
- a processor(s) may perform the necessary tasks.
- a system for generating a word sequence from multi-modal input vectors includes one or more processors in connection with one of more memories and one or more storage devices storing instructions that are operable. When the instructions are executed by the one or more processors, the instructions cause the one or more processors to perform operations that include receiving first and second input vectors according to first and second sequential intervals, extracting first and second feature vectors using first and second feature extractors, respectively, from the first and second input, estimating a first set of weights and a second set of weights, respectively, from the first and second feature vectors and a prestep context vector of a sequence generator, calculating a first content vector from the first weight and the first feature vector, and calculating a second content vector from the second weight and the second feature vector, transforming the first content vector into a first modal content vector having a predetermined dimension and transforming the second content vector into a second modal content vector having the predetermined dimension, estimating a set of modal attention weights from the pre
- the first modal content vector, the second modal content vector and the weighted content vector have the same predetermined dimension.
- those vectors can be easily handled in data processing of the multi-modal fusion model because those vectors are expressed by use of an identical data format having the identical dimension.
- the multi-modal fusion model method or system according to embodiments of the present disclosure can reduce central processing unit usage and power consumption for generating a word sequence from multi-modal input vectors.
- the number of the vectors may be changed to predetermined N vectors according to the requirement of system design.
- the predetermined N is set to be three
- the three input vectors can be image features, motion features and audio features obtained from image data, video signals and audio signals received via an input/output interface included in the system.
- first and second sequential intervals may be an identical interval, and the first and second vectors may be different modalities.
- FIG. 1 shows a block diagram of a multimodal fusion system 100 according to some embodiments of the present disclosure.
- the multimodal fusion system 100 can include a human machine interface (HMI) with input/output (I/O) interface 110 connectable with a keyboard 111 and a pointing device/medium 112 , a microphone 113 , a receiver 114 , a transmitter 115 , a 3D sensor 116 , a global positioning system (GPS) 117 , one or more I/O interfaces 118 , a processor 120 , a storage device 130 , a memory 140 , a network interface controller 150 (NIC) connectable with a network 155 including local area networks and internet network (not shown), a display interface 160 connected to a display device 165 , an imaging interface 170 connectable with an imaging device 175 , a printer interface 180 connectable with a printing device 185 .
- HMI human machine interface
- I/O interface 110 connectable with a keyboard 111 and a
- the HMI with I/O interface 110 may include analog/digital and digital/analog converters.
- the HMI with I/O interface 110 includes a wireless communication interface that can communicate with other 3D point cloud display systems or other computers via wireless internet connections or wireless local area networks, which enable to construct multiple 3D point clouds.
- the 3D point cloud system 100 can include a power source 190 .
- the power source 190 may be a battery rechargeable from an external power source (not shown) via the I/O interface 118 . Depending upon the application the power source 190 may be optionally located outside of the system 100 .
- the HMI and I/O interface 110 and the I/O interface 118 can be adapted to connect to another display device (not shown) including a computer monitor, camera, television, projector, or mobile device, among others.
- the multimodal fusion system 100 can receive electric text/imaging documents 195 including speech data via the network 155 connected to the NIC 150 .
- the storage device 130 includes a sequence generation model 131 , a feature extraction model 132 and a multimodal fusion model 200 , in which algorithms of the sequence generation model 131 , the feature extraction model 132 and the multimodal fusion model 200 are stored into the storage 130 as program code data.
- the algorithms of the models 131 - 132 and 200 may be stored to a computer readable recording medium (not shown) so that the processor 120 can execute the algorithms of the models 131 - 132 and 200 by loading the algorithms from the medium.
- the pointing device/medium 112 may include modules that read and perform programs stored on a computer readable recording medium.
- instructions may be transmitted to the system 100 using the keyboard 111 , the pointing device/medium 112 or via the wireless network or the network 190 connected to other computers (not shown).
- the algorithms of the models 131 - 132 and 200 may be started in response to receiving an acoustic signal of a user by the microphone 113 using pre-installed conventional speech recognition program stored in the storage 130 .
- the system 100 includes a turn-on/off switch (not shown) to allow the user to start/stop operating the system 100 .
- the HMI and I/O interface 110 may include an analogy-digital (A/D) converter, a digital-analogy (D/A) converter and wireless signal antenna for connecting the network 190 .
- the one or more than one I/O interface 118 may be connectable to a cable television (TV) network or a conventional television (TV) antenna receiving TV signals.
- the signals received via the interface 118 can be converted into digital images and audio signals, which can be processed according to the algorithms of the models 131 - 132 and 200 in connection with the processor 120 and the memory 140 so that video scripts are generated and displayed on the display device 165 with picture frames of the digital images while the sound of the acoustic of the TV signals are output via a speaker 19 .
- the speaker may be included in the system 100 , or an external speaker may be connected via the interface 110 or the I/O interface 118 .
- the processor 120 may be a plurality of processors including one or more than graphics processing units (GPUs).
- the storage 130 may include speech recognition algorithms (not shown) that can recognize speech signals obtained via the microphone 113 .
- the multimodal fusion system module 200 , the sequence generation model 131 and the feature extraction model 132 may be formed by neural networks.
- FIG. 2A is a block diagram illustrating a simple multimodal method according to embodiments of the present disclosure.
- the simple multimodal method can be performed by the processor 120 executing programs of the sequence generation model 131 , the feature extraction model 132 and the multimodal fusion model 200 stored in the storage 130 .
- the sequence generation model 131 , the feature extraction model 132 and the multimodal fusion model 200 may be stored into a computer-readable recoding medium, so that the simple multimodal method can be performed when the processor 120 loads and executes the algorithms of the sequence generation model 131 , the feature extraction model 132 and the multimodal fusion model 200 .
- the simple multimodal method is performed in combination of the sequence generation model 131 , the feature extraction model 132 and the multimodal fusion model 200 .
- the simple multimodal method uses feature extractors 211 , 221 and 231 (feature extractors 1 ⁇ K), attention estimators 212 , 222 and 232 (attention estimators 1 ⁇ K), weighted sum processors 213 , 223 and 233 (weighted sum processors (calculators) 1 ⁇ K), feature transformation modules 214 , 224 and 234 (feature transformation modules 1 ⁇ K), a simple sum processor (calculator) 240 and a Sequence Generator 250 .
- FIG. 2B is a block diagram illustrating a multimodal attention method according to embodiments of the present disclosure.
- the multimodal attention method further includes a modal attention estimator 255 and a weighted sum processor 245 instead of using the simple sum processor 240 .
- the multimodal attention method is performed in combination of the sequence generation model 131 , the feature extraction model 132 and the multimodal fusion model 200 . In both methods, the sequence generation model 131 provides the Sequence Generator 250 and the feature extraction model 132 provides the feature extractors 1 ⁇ K. Further, the feature transformation modules 1 ⁇ K, the modal attention estimator 255 and the weighted sum processors 1 ⁇ K and the weighted sum processor 245 may be provided by the multimodal fusion model 200 .
- Modal-1 data are converted to a fixed-dimensional content vector using the feature extractor 211 , the attention estimator 212 and the weighted-sum processor 213 for the data, where the feature extractor 211 extracts multiple feature vectors from the data, the attention estimator 212 estimates each weight for each extracted feature vector, and the weighted-sum processor 213 outputs (generates) the content vector computed as a weighted sum of the extracted feature vectors with the estimated weights.
- Modal-2 data are converted to a fixed-dimensional content vector using the feature extractor 221 , the attention estimator 222 and the weighted-sum processor 223 for the data.
- Modal-K data K fixed-dimensional content vectors are obtained, where the feature extractor 231 , the attention estimator 232 and the weighted-sum processor 233 are used for Modal-K data.
- Each of Modal-1, Modal-2, . . . , Modal-K data may be sequential data in a time sequential order with an interval or other predetermined orders with predetermined time intervals.
- Each of the K content vectors is then transformed (converted) into a N-dimensional vector by each feature transformation modules 214 , 224 , and 234 , and K transformed N-dimensional vectors are obtained, where N is a predefined positive integer.
- the K transformed N-dimensional vectors are summed into a single N-dimensional content vector in the simple multimodal method of FIG. 2A , whereas the vectors are converted to a single N-dimensional content vector using the modal attention estimator 255 and the weighted-sum processor 245 in the multimodal attention method of FIG. 2B , wherein the modal attention estimator 255 estimates each weight for each transformed N-dimensional vector, and the weighted-sum processor 245 outputs (generates) the N-dimensional content vector computed as a weighted sum of the K transformed N-dimensional vectors with the estimated weights.
- the Sequence Generator 250 receives the single N-dimensional content vector and predicts one label corresponding to a word of a sentence that describes the video data. For predicting the next word, the Sequence Generator 250 provides contextual information of the sentence, such as a vector that represents the previously-generated words, to the attention estimators 212 , 222 , 232 and the modal attention estimator 255 for estimating the attention weights to obtain appropriate content vectors.
- the vector may be referred to as a pre-step (or prestep) context vector.
- the Sequence Generator 250 predicts the next word beginning with the start-of-sentence token, “ ⁇ sos>,” and generates a descriptive sentence or sentences by predicting the next word (predicted word) iteratively until a special symbol “ ⁇ eos>” corresponding to “end of sentence” is predicted. In other words, the Sequence Generator 250 generates a word sequence from multi-modal input vectors. In some cases, the multi-modal input vectors may be received via different input/output interface such as the HMI and I/O interface 110 or one or more I/O interfaces 118 .
- a predicted word is generated to have a highest probability in all possible words given from the weighted content vector and the prestep context vector. Further, the predicted word can be accumulated into the memory 140 , the storage device 130 or more storage devices (not shown) to generate the word sequence, and this accumulation process can be continued until the special symbol (end of sequence) is received.
- the system 100 can transmit the predicted words generated from the Sequence Generator 250 via the NIC 150 and the network 190 , the HMI and I/O interface 110 or one or more I/O interfaces 118 , so that the data of the predicted words can be used other computers 195 or other output devices (not shown).
- this multimodal attention method can utilize different features inclusively or selectively using attention weights over different modalities or features to infer each word of the description.
- the multimodal fusion model 200 in the system 100 includes a data distribution module (not shown), which receives multiple time-sequential data via the I/O interface 110 or 118 and distributes the received data into Modal-1, Modal-2, . . . , Modal-K data, divides each distributed time-sequential data according to a mined interval or intervals, and then provides the Modal-1, Modal-2, . . . , Modal-K data to the feature extractors 1 ⁇ K, respectively.
- a data distribution module (not shown), which receives multiple time-sequential data via the I/O interface 110 or 118 and distributes the received data into Modal-1, Modal-2, . . . , Modal-K data, divides each distributed time-sequential data according to a mined interval or intervals, and then provides the Modal-1, Modal-2, . . . , Modal-K data to the feature extractors 1 ⁇ K, respectively.
- the multiple time-sequential data may be video signals and audio signals included in a video clip.
- the video clip is provided to the feature extractors 211 , 221 and 231 in the system 100 via the I/O interface 110 or 118 .
- the feature extractors 211 , 221 and 231 receive Modal-1 data, Modal-2 data and Modal-3 according to first, second and third intervals, respectively, from data stream of the video clip.
- the data distribution module may divide the multiple time-sequential data with predetermined different time intervals, respectively, when image features, motion features, or audio features can be captured with different time intervals.
- An approach to video description can be based on sequence-to-sequence learning.
- the input sequence i.e., image sequence
- the output sequence i.e., word sequence
- both the encoder and the decoder are usually modeled as Long Short-Term Memory (LSTM) networks.
- FIG. 3 shows an example of the LSTM-based encoder-decoder architecture.
- a feature extractor which can be a pretrained Convolutional Neural Network (CNN) for an image or video classification task such as GoogLeNet, VGGNet, or C3D.
- CNN Convolutional Neural Network
- the sequence of feature vectors is then fed to the LSTM encoder, and the hidden state of the LSTM is given by
- ⁇ ( ) is the element-wise sigmoid function
- i t , f t , to and c t are, respectively, the input gate, forget gate, output gate, and cell activation vectors for the t-th input vector.
- the weight matrices W zz ( ⁇ ) and the bias vectors b Z ( ⁇ ) are identified by the subscript z ⁇ x, h, i, f, o, c ⁇ .
- W hi is the hidden-input gate matrix
- W xo is the input-output gate matrix. Peephole connections are not used in this procedure.
- the decoder predicts the next word iteratively beginning with the start-of-sentence token, “ ⁇ sos>” until it predicts the end-of-sentence token, “ ⁇ eos>.”
- the start-of-sentence token may be referred to as a start label
- the end-of sentence token may be referred to as an end label.
- the decoder network ⁇ D infers the next word probability distribution as
- the decoder state is updated using the LSTM network of the decoder as
- Y ⁇ ⁇ argmax Y ⁇ V * ⁇ P ⁇ ( Y
- X ) ⁇ argmax v 1 , ... ⁇ , y M ⁇ V * ⁇ P ⁇ ( y 1
- a beam search in the test phase can be used to keep multiple states and hypotheses with the highest cumulative probabilities at each m-th step, and select the best hypothesis from those having reached the end-of-sentence token.
- Another approach to video description can be an attention based sequence generator, which enables the network to emphasize features from specific times or spatial regions depending on the current context, enabling the next word to be predicted more accurately.
- the attention-based generator can exploit input features selectively according to the input and output contexts. The efficacy of attention models has been shown in many tasks such as machine translation.
- FIG. 4 is a block diagram illustrating an example of the attention-based sentence generator from video, which has a temporal attention mechanism over the input image sequence.
- the input image sequence may be a time sequential order with predetermined time intervals.
- the input sequence of feature vectors is obtained using one or more feature extractors.
- attention-based generators may employ an encoder based on a bidirectional LSTM (BLSTM) or Gated Recurrent Units (GRU) to further convert the feature vector sequence as in FIG. 5 so that each vector contains its contextual information.
- BLSTM bidirectional LSTM
- GRU Gated Recurrent Units
- CNN-based features may be used directly, or one more feed-forward layer may be added to reduce the dimensionality.
- the activation vectors i.e., encoder states
- h t (f) and h t (b) are the forward and backward hidden activation vectors:
- the attention mechanism is realized by using attention weights to the hidden activation vectors throughout the input sequence. These weights enable the network to emphasize features from those time steps that are most important for predicting the next output word.
- ⁇ i,t be an attention weight between the i th output word and the t th input feature vector.
- the vector representing the relevant content of the input sequence is obtained as a weighted sum of hidden unit activation vectors:
- the decoder network is an Attention-based Recurrent Sequence Generator (ARSG) that generates an output label sequence with content vectors c i .
- the network also has an LSTM decoder network, where the decoder state can be updated in the same way as Equation (9).
- word y i is generated according to
- the probability distribution is conditioned on the content vector ci, which emphasizes specific features that are most relevant to predicting each subsequent word.
- One more feed-forward layer can be inserted before the softmax layer.
- the probabilities are computed as follows:
- the attention weights may be computed as
- W A and V A are matrices
- w A and b A are vectors
- e i,t is a scalar
- Embodiments of the present disclosure provide an attention model to handle fusion of multiple modalities, where each modality has its own sequence of feature vectors.
- multimodal inputs such as image features, motion features, and audio features are available.
- combination of multiple features from different feature extraction methods are often effective to improve the description accuracy.
- content vectors from VGGNet image features
- C3D spatialotemporal motion features
- K the number of modalities, i.e., the number of sequences of input feature vectors
- the following activation vector is computed instead of Eq. (19)
- c k,i is the k-th content vector corresponding to the k-th feature extractor or modality.
- these content vectors are combined with weight matrices W c1 and W c2 , which are commonly used in the sentence generation step. Consequently, the content vectors from each feature type (or one modality) are always fused using the same weights, independent of the decoder state.
- This architecture may introduce the ability to exploit multiple types of features effectively for allowing the relative weights of each feature type (of each modality) to change based on the context.
- the attention mechanism can be extended to multimodal fusion.
- the decoder network can selectively attend to specific modalities of input (or specific feature types) to predict the next word.
- the attention-based feature fusion in accordance with embodiments of the present disclosure may be performed using
- the multimodal attention weights ⁇ k,i are obtained in a similar way to the temporal attention mechanism:
- W B and V Bk are matrices
- W B and b Bk are vectors
- v k,i is a scalar
- FIG. 7 shows the architecture of the sentence generator according to embodiments of the present disclosure, including the multimodal attention mechanism.
- the feature-level attention weights can change according to the decoder state and the content vectors, which enables the decoder network to pay attention to a different set of features and/or modalities when predicting each subsequent word in the description.
- the dataset has 1,970 video clips with multiple natural language descriptions. Each video clip is annotated with multiple parallel sentences provided by different Mechanical Turkers. There are 80,839 sentences in total, with about 41 annotated sentences per clip. Each sentence on average contains about 8 words. The words contained in all the sentences constitute a vocabulary of 13,010 unique lexical entries.
- the dataset is open-domain and covers a wide range of topics including sports, animals and music. The dataset is split into a training set of 1,200 video clips, a validation set of 100 clips, and a test set consisting of the remaining 670 clips.
- the image data are extracted from each video clip, which consist of 24 frames per second, and rescaled to 224 ⁇ 224 pixel images.
- a pretrained GoogLeNet CNN M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2013.
- Caffe Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell.
- Caffe Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
- Features are extracted from the hidden layer pool5/7x7 s1. We select one frame out of every 16 frames from each video clip and feed them into the CNN to obtain 1024-dimensional frame-wise feature vectors.
- VGGNet K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
- ImageNet dataset A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097-1105. Curran Associates, Inc., 2012.).
- the hidden activation vectors of fully connected layer fc7 are used for the image features, which produces a sequence of 4096-dimensional feature vectors.
- Audio features are incorporated to use in the attention-based feature fusion method according to embodiments of the present disclosure. Since YouTube2Text corpus does not contain audio track, we extracted the audio data via the original video URLs. Although a subset of the videos were no longer available on YouTube, we were able to collect the audio data for 1,649 video clips, which covers 84% of the corpus.
- the 44 kHz-sampled audio data are down-sampled to 16 kHz, and Mel-Frequency Cepstral Coefficients (MFCCs) are extracted from each 50 ms time window with 25 ms shift.
- the sequence of 13-dimensional MFCC features are then concatenated into one vector from every group of 20 consecutive frames, which results in a sequence of 260-dimensional vectors.
- the MFCC features are normalized so that the mean and variance vectors are 0 and 1 in the training set.
- the validation and test sets are also adjusted with the original mean and variance vectors of the training set.
- MFCC features which is trained jointly with the decoder network. If audio data are missing for a video clip, then we feed in a sequence of dummy MFCC features, which is simply a sequence of zero vectors.
- the caption generation model i.e. the decoder network
- the caption generation model is trained to minimize the cross entropy criterion using the training set.
- Image features are fed to the decoder network through one projection layer of 512 units, while audio features, i.e. MFCCs, are fed to the BLSTM encoder followed by the decoder network.
- the encoder network has one projection layer of 512 units and bidirectional LSTM layers of 512 cells.
- the decoder network has one LSTM layer with 512 cells. Each word is embedded to a 256-dimensional vector when it is fed to the LSTM layer.
- AdaDelta optimizer M. D. Zeiler. ADADELTA: an adaptive learning rate method.
- Cider Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, Mass., USA, Jun. 7-12, 2015, pages 4566-4575, 2015.). We used the publicly available evaluation script prepared for image captioning challenge (X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Doll'ar, and C. L. Zitnick.
- Microsoft COCO captions Data collection and evaluation server. CoRR, abs/1504.00325, 2015.).
- FIG. 8 shows comparisons of performance results obtained by conventional methods and the multimodal attention method according to embodiments of the present disclosure regarding the Youtube2text data set.
- the conventional methods are a simple additive multimodal fusion (Simple Multimodal), unimodal models with temporal attention (Unimodal), and baseline systems that used temporal attention, are performed.
- the first three rows of the table use temporal attention but only one modality (one feature type).
- the next two rows do multimodal fusion of two modalities (image and spatiotemporal) using either Simple Multimodal fusion (see FIG. 6 ) or our proposed Multimodal Attention mechanism (see FIG. 7 ).
- the next two rows also perform multimodal fusion, this time of three modalities (image, spatiotemporal, and audio features).
- the scores of the top two methods are shown in boldface.
- the Simple Multimodal model performed better than the Unimodal models.
- the Multimodal Attention model outperformed the Simple Multimodal model.
- the audio feature degrades the performance of the baseline because some YouTube data includes noise such as background music, which is unrelated to the video content.
- the Multimodal Attention model mitigated the impact of the noise of the audio features.
- combining the audio features using our proposed method reached the best performance of CIDEr for all experimental conditions.
- Multimodal Attention model improves upon the Simple Multimodal.
- FIGS. 9A, 9B, 9C and 9D show comparisons of performance results obtained by conventional methods and the multimodal attention method according to embodiments of the present disclosure.
- FIGS. 9A-9C show three example video clips, for which the attention-based multimodal fusion method (Temporal & Multimodal attention with VGG and C3D) outperformed the single modal method (Temporal attention with VGG) and the simple modal fusion method (Temporal attention with VGG and C3D) in CIDEr measure.
- FIG. 9D shows an example video clip, for which the attention-based multimodal fusion method (Temporal & Multimodal attention) including audio features outperformed the single modal method (Temporal attention with VGG), the simple modal fusion method (Temporal attention with VGG, C3D) with/without audio features.
- These examples show efficacy of multimodal attention mechanism.
- video script when the multi-modal fusion model described above is installed in a computer system, video script can be effectively generated with less computing power, thus the use of the multi-modal fusion model method or system can reduce central processing unit usage and power consumption.
- embodiments according to the present disclosure provide effective method for performing the multimodal fusion model, thus, the use of a method and system using the multimodal fusion model can reduce central processing unit (CPU) usage, power consumption and/or network band width usage.
- CPU central processing unit
- the embodiments can be implemented in any of numerous ways.
- the embodiments may be implemented using hardware, software or a combination thereof.
- the software code can be executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers.
- processors may be implemented as integrated circuits, with one or more processors in an integrated circuit component.
- a processor may be implemented using circuitry in any suitable format.
- the various methods or processes outlined herein may be coded as software that is executable on one or more processors that employ any one of a variety of operating systems or platforms. Additionally, such software may be written using any of a number of suitable programming languages and/or programming or scripting tools, and also may be compiled as executable machine language code or intermediate code that is executed on a framework or virtual machine. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments.
- embodiments of the present disclosure may be embodied as a method, of which an example has been provided.
- the acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts concurrently, even though shown as sequential acts in illustrative embodiments.
- use of ordinal terms such as first, second, in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Molecular Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Library & Information Science (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Analysis (AREA)
- Machine Translation (AREA)
Abstract
A system for generating a word sequence includes one or more processors in connection with a memory and one or more storage devices storing instructions causing operations that include receiving first and second input vectors, extracting first and second feature vectors, estimating a first set of weights and a second set of weights, calculating a first content vector from the first set of weights and the first feature vectors, and calculating a second content vector, transforming the first content vector into a first modal content vector having a predetermined dimension and transforming the second content vector into a second modal content vector having the predetermined dimension, estimating a set of modal attention weights, generating a weighted content vector having the predetermined dimension from the set of modal attention weights and the first and second modal content vectors, and generating a predicted word using the sequence generator.
Description
- This invention generally relates to a method and system for describing multi-modal data, and more specifically to a method and system for video description.
- Automatic video description, known as video captioning, refers to the automatic generation of a natural language description (e.g., a sentence) that narrates an input video. Video description can be widespread applications including video retrieval, automatic description of home movies or online uploaded video clips, video descriptions for the visually impaired, warning generation for surveillance systems and scene understanding for knowledge sharing between human and machine.
- Video description systems extract salient features from the video data, which may be multimodal features such as image features representing some objects, motion features representing some actions, and audio features indicating some events, and generate a description narrating events so that the words in the description are relevant to those extracted features and ordered appropriately as natural language.
- One inherent problem in video description is that the sequence of video features and the sequence of words in the description are not synchronized. In fact, objects and actions may appear in the video in a different order than they appear in the sentence. When choosing the right words to describe something, only the features that directly correspond to that object or action are relevant, and the other features are a source of clutter. In addition, some events are not always observed in all features.
- Accordingly, there is a need to use different features inclusively or selectively to infer each word of the description to achieve high-quality video description.
- Some embodiments of a present disclosure are based on generating content vectors from input data including multiple modalities. In some cases, the modalities may be audio signals, video signals (image signals) and motion signals contained in video signals.
- The present disclosure is based on a multimodal fusion system that generates the content vectors from the input data that include multiple modalities. In some cases, the multimodal fusion system receives input signals including image (video) signals, motion signals and audio signals and generates a description narrating events relevant to the input signals.
- According to embodiments of the present disclosure, a system for generating a word sequence from multi-modal input vectors include one or more processors and one or more storage devices storing instructions that are operable, when executed by the one or more processors, to cause the one or more processors to perform operations that include receiving first and second input vectors according to first and second sequential intervals; extracting first and second feature vectors using first and second feature extractors, respectively, from the first and second input; estimating a first set of weights and a second set of weights, respectively, from the first and second feature vectors and a prestep context vector of a sequence generator; calculating a first content vector from the first set of weights and the first feature vectors, and calculating a second content vector from the second set of weights and the second feature vectors; transforming the first content vector into a first modal content vector having a predetermined dimension and transforming the second content vector into a second modal content vector having the predetermined dimension; estimating a set of modal attention weights from the pre-step context vector and the first and second content vectors or the first and second modal content vectors; generating a weighted content vector having the predetermined dimension from the set of modal attention weights and the first and second modal content vectors; and generating a predicted word using the sequence generator for generating the word sequence from the weighted content vector.
- Further, some embodiments of the present disclosure provide a non-transitory computer-readable medium storing software comprising instructions executable by one or more processors which, upon such execution, cause the one or more processors to perform operations. The operations include receiving first and second input vectors according to first and second sequential intervals; extracting first and second feature vectors using first and second feature extractors, respectively, from the first and second input; estimating a first set of weights and a second set of weights, respectively, from the first and second feature vectors and a prestep context vector of a sequence generator; calculating a first content vector from the first set of weights and the first feature vectors, and calculating a second content vector from the second set of weights and the second feature vectors; transforming the first content vector into a first modal content vector having a predetermined dimension and transforming the second content vector into a second modal content vector having the predetermined dimension; estimating a set of modal attention weights from the pre-step context vector and the first and second content vectors or the first and second modal content vectors; generating a weighted content vector having the predetermined dimension from the set of modal attention weights and the first and second modal content vectors; and generating a predicted word using the sequence generator for generating the word sequence from the weighted content vector.
- According to another embodiment of the present disclosure, a method for generating a word sequence from multi-modal input vectors from multi-modal input vectors, includes receiving first and second input vectors according to first and second sequential intervals; extracting first and second feature vectors using first and second feature extractors, respectively, from the first and second input; estimating a first set of weights and a second set of weights, respectively, from the first and second feature vectors and a prestep context vector of a sequence generator; calculating a first content vector from the first set of weights and the first feature vectors, and calculating a second content vector from the second set of weights and the second feature vectors; transforming the first content vector into a first modal content vector having a predetermined dimension and transforming the second content vector into a second modal content vector having the predetermined dimension; estimating a set of modal attention weights from the pre-step context vector and the first and second content vectors or the first and second modal content vectors; generating a weighted content vector having the predetermined dimension from the set of modal attention weights and the first and second modal content vectors; and generating a predicted word using the sequence generator for generating the word sequence from the weighted content vector.
- The presently disclosed embodiments will be further explained with reference to the attached drawings. The drawings shown are not necessarily to scale, with emphasis instead generally being placed upon illustrating the principles of the presently disclosed embodiments.
-
FIG. 1 is a block diagram illustrating a multimodal fusion system according to some embodiments of the present disclosure; -
FIG. 2A is a block diagram illustrating a simple multimodal method according to embodiments of the present disclosure; -
FIG. 2B is a block diagram illustrating a multimodal attention method according to embodiments of the present disclosure; -
FIG. 3 is a block diagram illustrating an example of the LSTM-based encoder-decoder architecture according to embodiments of the present disclosure; -
FIG. 4 is a block diagram illustrating an example of the attention-based sentence generator from video according to embodiments of the present disclosure; -
FIG. 5 is a block diagram illustrating an extension of the attention-based sentence generator from video according to embodiments of the present disclosure; -
FIG. 6 is a diagram illustrating a simple feature fusion approach (simple multimodal method) according to embodiments of the present disclosure; -
FIG. 7 is a diagram illustrating an architecture of a sentence generator according to embodiments of the present disclosure; -
FIG. 8 shows comparisons of performance results obtained by conventional methods and the multimodal attention method according to embodiments of the present disclosure; -
FIGS. 9A, 9B, 9C and 9D show comparisons of performance results obtained by conventional methods and the multimodal attention method according to embodiments of the present disclosure; and - While the above-identified drawings set forth presently disclosed embodiments, other embodiments are also contemplated, as noted in the discussion. This disclosure presents illustrative embodiments by way of representation and not limitation. Numerous other modifications and embodiments can be devised by those skilled in the art which fall within the scope and spirit of the principles of the presently disclosed embodiments.
- The following description provides exemplary embodiments only, and is not intended to limit the scope, applicability, or configuration of the disclosure. Rather, the following description of the exemplary embodiments will provide those skilled in the art with an enabling description for implementing one or more exemplary embodiments. Contemplated are various changes that may be made in the function and arrangement of elements without departing from the spirit and scope of the subject matter disclosed as set forth in the appended claims.
- Specific details are given in the following description to provide a thorough understanding of the embodiments. However, understood by one of ordinary skill in the art can be that the embodiments may be practiced without these specific details. For example, systems, processes, and other elements in the subject matter disclosed may be shown as components in block diagram form in order not to obscure the embodiments in unnecessary detail. In other instances, well-known processes, structures, and techniques may be shown without unnecessary detail in order to avoid obscuring the embodiments. Further, like reference numbers and designations in the various drawings indicated like elements.
- Also, individual embodiments may be described as a process, which is depicted as a flowchart, a flow diagram, a data flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be re-arranged. A process may be terminated when its operations are completed, but may have additional steps not discussed or included in a figure. Furthermore, not all operations in any particularly described process may occur in all embodiments. A process may correspond to a method, a function, a procedure, a subroutine, a subprogram, etc. When a process corresponds to a function, the function's termination can correspond to a return of the function to the calling function or the main function.
- Furthermore, embodiments of the subject matter disclosed may be implemented, at least in part, either manually or automatically. Manual or automatic implementations may be executed, or at least assisted, through the use of machines, hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the necessary tasks may be stored in a machine-readable medium. A processor(s) may perform the necessary tasks.
- In accordance with embodiments of the present disclosure, a system for generating a word sequence from multi-modal input vectors includes one or more processors in connection with one of more memories and one or more storage devices storing instructions that are operable. When the instructions are executed by the one or more processors, the instructions cause the one or more processors to perform operations that include receiving first and second input vectors according to first and second sequential intervals, extracting first and second feature vectors using first and second feature extractors, respectively, from the first and second input, estimating a first set of weights and a second set of weights, respectively, from the first and second feature vectors and a prestep context vector of a sequence generator, calculating a first content vector from the first weight and the first feature vector, and calculating a second content vector from the second weight and the second feature vector, transforming the first content vector into a first modal content vector having a predetermined dimension and transforming the second content vector into a second modal content vector having the predetermined dimension, estimating a set of modal attention weights from the pre-step context vector and the first and second modal content vectors, generating a weighted content vector having the predetermined dimension from the set of modal attention weights and the first and second content vectors, and generating a predicted word using the sequence generator for generating the word sequence from the weighted content vector.
- In this case, the first modal content vector, the second modal content vector and the weighted content vector have the same predetermined dimension. This makes it possible for the system to perform a multi-modal fusion model. In other words, by designing or determining dimensions of the input vectors and the weighted content vectors to have an identical dimension, those vectors can be easily handled in data processing of the multi-modal fusion model because those vectors are expressed by use of an identical data format having the identical dimension. As the data processing is simplified by using data transformed to have the identical dimension, the multi-modal fusion model method or system according to embodiments of the present disclosure can reduce central processing unit usage and power consumption for generating a word sequence from multi-modal input vectors.
- Of cause, the number of the vectors may be changed to predetermined N vectors according to the requirement of system design. For instance, when the predetermined N is set to be three, the three input vectors can be image features, motion features and audio features obtained from image data, video signals and audio signals received via an input/output interface included in the system.
- In some cases, the first and second sequential intervals may be an identical interval, and the first and second vectors may be different modalities.
-
FIG. 1 shows a block diagram of amultimodal fusion system 100 according to some embodiments of the present disclosure. Themultimodal fusion system 100 can include a human machine interface (HMI) with input/output (I/O)interface 110 connectable with akeyboard 111 and a pointing device/medium 112, amicrophone 113, a receiver 114, atransmitter 115, a3D sensor 116, a global positioning system (GPS) 117, one or more I/O interfaces 118, aprocessor 120, astorage device 130, amemory 140, a network interface controller 150 (NIC) connectable with a network 155 including local area networks and internet network (not shown), adisplay interface 160 connected to adisplay device 165, animaging interface 170 connectable with animaging device 175, aprinter interface 180 connectable with aprinting device 185. The HMI with I/O interface 110 may include analog/digital and digital/analog converters. The HMI with I/O interface 110 includes a wireless communication interface that can communicate with other 3D point cloud display systems or other computers via wireless internet connections or wireless local area networks, which enable to construct multiple 3D point clouds. The 3Dpoint cloud system 100 can include apower source 190. Thepower source 190 may be a battery rechargeable from an external power source (not shown) via the I/O interface 118. Depending upon the application thepower source 190 may be optionally located outside of thesystem 100. - The HMI and I/
O interface 110 and the I/O interface 118 can be adapted to connect to another display device (not shown) including a computer monitor, camera, television, projector, or mobile device, among others. - The
multimodal fusion system 100 can receive electric text/imaging documents 195 including speech data via the network 155 connected to theNIC 150. Thestorage device 130 includes asequence generation model 131, afeature extraction model 132 and amultimodal fusion model 200, in which algorithms of thesequence generation model 131, thefeature extraction model 132 and themultimodal fusion model 200 are stored into thestorage 130 as program code data. The algorithms of the models 131-132 and 200 may be stored to a computer readable recording medium (not shown) so that theprocessor 120 can execute the algorithms of the models 131-132 and 200 by loading the algorithms from the medium. Further, the pointing device/medium 112 may include modules that read and perform programs stored on a computer readable recording medium. - In order to start performing the algorithms of the models 131-132 and 200, instructions may be transmitted to the
system 100 using thekeyboard 111, the pointing device/medium 112 or via the wireless network or thenetwork 190 connected to other computers (not shown). The algorithms of the models 131-132 and 200 may be started in response to receiving an acoustic signal of a user by themicrophone 113 using pre-installed conventional speech recognition program stored in thestorage 130. Further, thesystem 100 includes a turn-on/off switch (not shown) to allow the user to start/stop operating thesystem 100. - The HMI and I/
O interface 110 may include an analogy-digital (A/D) converter, a digital-analogy (D/A) converter and wireless signal antenna for connecting thenetwork 190. Further the one or more than one I/O interface 118 may be connectable to a cable television (TV) network or a conventional television (TV) antenna receiving TV signals. The signals received via theinterface 118 can be converted into digital images and audio signals, which can be processed according to the algorithms of the models 131-132 and 200 in connection with theprocessor 120 and thememory 140 so that video scripts are generated and displayed on thedisplay device 165 with picture frames of the digital images while the sound of the acoustic of the TV signals are output via a speaker 19. The speaker may be included in thesystem 100, or an external speaker may be connected via theinterface 110 or the I/O interface 118. - The
processor 120 may be a plurality of processors including one or more than graphics processing units (GPUs). Thestorage 130 may include speech recognition algorithms (not shown) that can recognize speech signals obtained via themicrophone 113. - The multimodal
fusion system module 200, thesequence generation model 131 and thefeature extraction model 132 may be formed by neural networks. -
FIG. 2A is a block diagram illustrating a simple multimodal method according to embodiments of the present disclosure. The simple multimodal method can be performed by theprocessor 120 executing programs of thesequence generation model 131, thefeature extraction model 132 and themultimodal fusion model 200 stored in thestorage 130. Thesequence generation model 131, thefeature extraction model 132 and themultimodal fusion model 200 may be stored into a computer-readable recoding medium, so that the simple multimodal method can be performed when theprocessor 120 loads and executes the algorithms of thesequence generation model 131, thefeature extraction model 132 and themultimodal fusion model 200. The simple multimodal method is performed in combination of thesequence generation model 131, thefeature extraction model 132 and themultimodal fusion model 200. Further, the simple multimodal method uses 211, 221 and 231 (feature extractors feature extractors 1˜K), 212, 222 and 232 (attention estimators attention estimators 1˜K), 213, 223 and 233 (weighted sum processors (calculators) 1˜K),weighted sum processors 214, 224 and 234 (feature transformation modules feature transformation modules 1˜K), a simple sum processor (calculator) 240 and aSequence Generator 250. -
FIG. 2B is a block diagram illustrating a multimodal attention method according to embodiments of the present disclosure. In addition to thefeature extractors 1˜K, theattention estimators 1˜K, theweighted sum processors 1˜K, thefeature transformation modules 1˜K and theSequence Generator 250, the multimodal attention method further includes amodal attention estimator 255 and aweighted sum processor 245 instead of using thesimple sum processor 240. The multimodal attention method is performed in combination of thesequence generation model 131, thefeature extraction model 132 and themultimodal fusion model 200. In both methods, thesequence generation model 131 provides theSequence Generator 250 and thefeature extraction model 132 provides thefeature extractors 1˜K. Further, thefeature transformation modules 1˜K, themodal attention estimator 255 and theweighted sum processors 1˜K and theweighted sum processor 245 may be provided by themultimodal fusion model 200. - Given multimodal video data including K modalities such that K≥2 and some of the modalities may be the same, Modal-1 data are converted to a fixed-dimensional content vector using the
feature extractor 211, theattention estimator 212 and the weighted-sum processor 213 for the data, where thefeature extractor 211 extracts multiple feature vectors from the data, theattention estimator 212 estimates each weight for each extracted feature vector, and the weighted-sum processor 213 outputs (generates) the content vector computed as a weighted sum of the extracted feature vectors with the estimated weights. Modal-2 data are converted to a fixed-dimensional content vector using thefeature extractor 221, theattention estimator 222 and the weighted-sum processor 223 for the data. Until Modal-K data, K fixed-dimensional content vectors are obtained, where thefeature extractor 231, theattention estimator 232 and the weighted-sum processor 233 are used for Modal-K data. Each of Modal-1, Modal-2, . . . , Modal-K data may be sequential data in a time sequential order with an interval or other predetermined orders with predetermined time intervals. - Each of the K content vectors is then transformed (converted) into a N-dimensional vector by each
214, 224, and 234, and K transformed N-dimensional vectors are obtained, where N is a predefined positive integer.feature transformation modules - The K transformed N-dimensional vectors are summed into a single N-dimensional content vector in the simple multimodal method of
FIG. 2A , whereas the vectors are converted to a single N-dimensional content vector using themodal attention estimator 255 and the weighted-sum processor 245 in the multimodal attention method ofFIG. 2B , wherein themodal attention estimator 255 estimates each weight for each transformed N-dimensional vector, and the weighted-sum processor 245 outputs (generates) the N-dimensional content vector computed as a weighted sum of the K transformed N-dimensional vectors with the estimated weights. - The
Sequence Generator 250 receives the single N-dimensional content vector and predicts one label corresponding to a word of a sentence that describes the video data. For predicting the next word, theSequence Generator 250 provides contextual information of the sentence, such as a vector that represents the previously-generated words, to the 212, 222, 232 and theattention estimators modal attention estimator 255 for estimating the attention weights to obtain appropriate content vectors. The vector may be referred to as a pre-step (or prestep) context vector. - The
Sequence Generator 250 predicts the next word beginning with the start-of-sentence token, “<sos>,” and generates a descriptive sentence or sentences by predicting the next word (predicted word) iteratively until a special symbol “<eos>” corresponding to “end of sentence” is predicted. In other words, theSequence Generator 250 generates a word sequence from multi-modal input vectors. In some cases, the multi-modal input vectors may be received via different input/output interface such as the HMI and I/O interface 110 or one or more I/O interfaces 118. - In each generating process, a predicted word is generated to have a highest probability in all possible words given from the weighted content vector and the prestep context vector. Further, the predicted word can be accumulated into the
memory 140, thestorage device 130 or more storage devices (not shown) to generate the word sequence, and this accumulation process can be continued until the special symbol (end of sequence) is received. Thesystem 100 can transmit the predicted words generated from theSequence Generator 250 via theNIC 150 and thenetwork 190, the HMI and I/O interface 110 or one or more I/O interfaces 118, so that the data of the predicted words can be usedother computers 195 or other output devices (not shown). - When each of the K content vectors comes from a distinct modality data and/or through a distinct feature extractor, modality or feature fusion with the weighted-sum of the K transformed vectors enables a better prediction of each word by paying attention to different modalities and/or different features according to the contextual information of the sentence. Thus, this multimodal attention method can utilize different features inclusively or selectively using attention weights over different modalities or features to infer each word of the description.
- Further, the
multimodal fusion model 200 in thesystem 100 includes a data distribution module (not shown), which receives multiple time-sequential data via the I/ 110 or 118 and distributes the received data into Modal-1, Modal-2, . . . , Modal-K data, divides each distributed time-sequential data according to a mined interval or intervals, and then provides the Modal-1, Modal-2, . . . , Modal-K data to theO interface feature extractors 1˜K, respectively. - In some cases, the multiple time-sequential data may be video signals and audio signals included in a video clip. When the video clip is used for Modal data, the
system 100 uses the 211, 221 and 231 (set K=3) infeature extractors FIG. 2B . The video clip is provided to the 211, 221 and 231 in thefeature extractors system 100 via the I/ 110 or 118. The feature extractors 211, 221 and 231 can extract image data, audio data and motion data, respectively, from the video clip as Modal-1 data, Modal-2 data and Modal-3 (e.g. K=3 inO interface FIG. 2B ). In this case, the 211, 221 and 231 receive Modal-1 data, Modal-2 data and Modal-3 according to first, second and third intervals, respectively, from data stream of the video clip.feature extractors - In some cases, the data distribution module may divide the multiple time-sequential data with predetermined different time intervals, respectively, when image features, motion features, or audio features can be captured with different time intervals.
- Encoder-Decoder-Based Sentence Generator
- An approach to video description can be based on sequence-to-sequence learning. The input sequence, i.e., image sequence, is first encoded to a fixed-dimensional semantic vector. Then the output sequence, i.e., word sequence, is generated from the semantic vector. In this case, both the encoder and the decoder (or generator) are usually modeled as Long Short-Term Memory (LSTM) networks.
-
FIG. 3 shows an example of the LSTM-based encoder-decoder architecture. Given a sequence of images, X=x1, x2, . . . , xL, each image is first fed to a feature extractor, which can be a pretrained Convolutional Neural Network (CNN) for an image or video classification task such as GoogLeNet, VGGNet, or C3D. The sequence of image features, X′=x′1, x′2, . . . , x′L, is obtained by extracting the activation vector of a fully-connected layer of the CNN for each input image. The sequence of feature vectors is then fed to the LSTM encoder, and the hidden state of the LSTM is given by -
h t=LSTM(h t−1 ,x′ t;λE), (1) - where the LSTM function of the encoder network λE is computed as
-
LSTM(h t−1 ,x t;λ)=o t tan h(c t), (2) -
where o t=σ(W xo (λ) x t +W ho (λ) h t−1 +b o (λ)) (3) -
c t =f t c t−1 +t t tan h(W xc (λ) x t +W hc (λ) h t−1 +b c (λ)) (4) -
f t=σ(W xf (λ) x t +W hf (λ) h t−1 +b f (λ)) (5) -
i t=σ(W xi (λ) x t +W hi (λ) h t−1 +b i (λ)), (6) - where σ( ) is the element-wise sigmoid function, and it, ft, to and ct are, respectively, the input gate, forget gate, output gate, and cell activation vectors for the t-th input vector. The weight matrices Wzz (λ) and the bias vectors bZ (λ) are identified by the subscript zϵ{x, h, i, f, o, c}. For example, Whi is the hidden-input gate matrix and Wxo is the input-output gate matrix. Peephole connections are not used in this procedure.
- The decoder predicts the next word iteratively beginning with the start-of-sentence token, “<sos>” until it predicts the end-of-sentence token, “<eos>.” The start-of-sentence token may be referred to as a start label, and the end-of sentence token may be referred to as an end label.
- Given decoder state si-1, the decoder network λD infers the next word probability distribution as
-
P(y|s i-1)=softmax(W s (λD ) s i-1 +b s (λD )), (7) - and generates word yi, which has the highest probability, according to
-
- where V denotes the vocabulary. The decoder state is updated using the LSTM network of the decoder as
-
s i=LSTM(s i-1 ,y′ i;λD), (9) - where y′i is a word-embedding vector of ym, and the initial state so is obtained from the final encoder state hL and y′0=Embed(<sos>) as in
FIG. 3 . - In the training phase, Y=y1, . . . , yM is given as the reference. However, in the test phase, the best word sequence needs to be found based on
-
- Accordingly, a beam search in the test phase can be used to keep multiple states and hypotheses with the highest cumulative probabilities at each m-th step, and select the best hypothesis from those having reached the end-of-sentence token.
- Attention-Based Sentence Generator
- Another approach to video description can be an attention based sequence generator, which enables the network to emphasize features from specific times or spatial regions depending on the current context, enabling the next word to be predicted more accurately. Compared to the basic approach described above, the attention-based generator can exploit input features selectively according to the input and output contexts. The efficacy of attention models has been shown in many tasks such as machine translation.
-
FIG. 4 is a block diagram illustrating an example of the attention-based sentence generator from video, which has a temporal attention mechanism over the input image sequence. The input image sequence may be a time sequential order with predetermined time intervals. The input sequence of feature vectors is obtained using one or more feature extractors. In this case, attention-based generators may employ an encoder based on a bidirectional LSTM (BLSTM) or Gated Recurrent Units (GRU) to further convert the feature vector sequence as inFIG. 5 so that each vector contains its contextual information. - In video description tasks, however, CNN-based features may be used directly, or one more feed-forward layer may be added to reduce the dimensionality.
- If an BLSTM encoder is used following the feature extraction as in
FIG. 5 , then the activation vectors (i.e., encoder states) can be obtained as -
- where ht (f) and ht (b) are the forward and backward hidden activation vectors:
-
h t (f)=LSTM(h t−1 (f) ,x′ t;λE (f)) (13) -
h t (b)=LSTM(h t+1 (b) ,x′ t;λE (b)) (14) - If a feed-forward layer is used, then the activation vector is calculated as
-
h t=tan h(W p x′ t +b p), (15) - where Wp is a weight matrix and bp is a bias vector. Further, when the CNN features are directly used, then it is assumed to be ht=xt.
- The attention mechanism is realized by using attention weights to the hidden activation vectors throughout the input sequence. These weights enable the network to emphasize features from those time steps that are most important for predicting the next output word.
- Let αi,t be an attention weight between the ith output word and the tth input feature vector. For the ith output, the vector representing the relevant content of the input sequence is obtained as a weighted sum of hidden unit activation vectors:
-
- The decoder network is an Attention-based Recurrent Sequence Generator (ARSG) that generates an output label sequence with content vectors ci. The network also has an LSTM decoder network, where the decoder state can be updated in the same way as Equation (9).
- Then, the output label probability is computed as
-
P(y|s i-1 ,c i)=softmax(W s (λD ) s i-1 +W c (λD ) c i +b s (λD )), (17) - and word yi is generated according to
-
- In contrast to Equations (7) and (8) of the basic encoder-decoder, the probability distribution is conditioned on the content vector ci, which emphasizes specific features that are most relevant to predicting each subsequent word. One more feed-forward layer can be inserted before the softmax layer. In this case, the probabilities are computed as follows:
-
g i=tan h(W s (λD ) s i-1 +W c (λD ) c i +b s (λD )), (19) -
and -
P(y|s i-1 ,c i)=softmax(W g (λD ) g i +b g (λD )), (20) - The attention weights may be computed as
-
- and
-
e i,t =w A T tan h(W A s i-1 +V A h t +b A), (22) - where WA and VA are matrices, wA and bA are vectors, and ei,t is a scalar.
- Attention-Based Multimodal Fusion
- Embodiments of the present disclosure provide an attention model to handle fusion of multiple modalities, where each modality has its own sequence of feature vectors. For video description, multimodal inputs such as image features, motion features, and audio features are available. Furthermore, combination of multiple features from different feature extraction methods are often effective to improve the description accuracy.
- In some cases, content vectors from VGGNet (image features) and C3D (spatiotemporal motion features) may be combined into one vector, which is used to predict the next word. This can be performed in the fusion layer. Let K be the number of modalities, i.e., the number of sequences of input feature vectors, the following activation vector is computed instead of Eq. (19),
-
- where
-
d k,i =W ck (λD ) c k,i, (24) - and ck,i is the k-th content vector corresponding to the k-th feature extractor or modality.
-
FIG. 6 shows the simple feature fusion approach (simple multimodal method) assuming K=2, in which content vectors are obtained with attention weights for individual input sequences x11, . . . , x1L and x21′, . . . , x2L′, respectively. However, these content vectors are combined with weight matrices Wc1 and Wc2, which are commonly used in the sentence generation step. Consequently, the content vectors from each feature type (or one modality) are always fused using the same weights, independent of the decoder state. This architecture may introduce the ability to exploit multiple types of features effectively for allowing the relative weights of each feature type (of each modality) to change based on the context. - According to embodiments of the present disclosure, the attention mechanism can be extended to multimodal fusion. Using the multimodal attention mechanism, based on the current decoder state, the decoder network can selectively attend to specific modalities of input (or specific feature types) to predict the next word. The attention-based feature fusion in accordance with embodiments of the present disclosure may be performed using
-
- where
-
d k,i =W ck (λD ) c k,i +b ck (λD ). (26) - The multimodal attention weights βk,i are obtained in a similar way to the temporal attention mechanism:
-
- where
-
v k,i =w B T tan h(W B s i-1 +V Bk c k,i +b Bk), (28) - where WB and VBk are matrices, WB and bBk are vectors, and vk,i is a scalar.
-
FIG. 7 shows the architecture of the sentence generator according to embodiments of the present disclosure, including the multimodal attention mechanism. Unlike the simple multimodal fusion method inFIG. 6 , inFIG. 7 , the feature-level attention weights can change according to the decoder state and the content vectors, which enables the decoder network to pay attention to a different set of features and/or modalities when predicting each subsequent word in the description. - Dataset for Evaluation
- Some experimental results are described below for discussing the feature fusion according to an embodiments of the present disclosure using theYoutube2Text video corpus. This corpus is well suited for training and evaluating automatic video description generation models. The dataset has 1,970 video clips with multiple natural language descriptions. Each video clip is annotated with multiple parallel sentences provided by different Mechanical Turkers. There are 80,839 sentences in total, with about 41 annotated sentences per clip. Each sentence on average contains about 8 words. The words contained in all the sentences constitute a vocabulary of 13,010 unique lexical entries. The dataset is open-domain and covers a wide range of topics including sports, animals and music. The dataset is split into a training set of 1,200 video clips, a validation set of 100 clips, and a test set consisting of the remaining 670 clips.
- Video Preprocessing
- The image data are extracted from each video clip, which consist of 24 frames per second, and rescaled to 224×224 pixel images. For extracting image features, a pretrained GoogLeNet CNN (M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2013.) is used to extract fixed-length representation with the help of the popular implementation in Caffe (Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.). Features are extracted from the hidden layer pool5/7x7 s1. We select one frame out of every 16 frames from each video clip and feed them into the CNN to obtain 1024-dimensional frame-wise feature vectors.
- We also use a VGGNet (K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.) that was pretrained on the ImageNet dataset (A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097-1105. Curran Associates, Inc., 2012.). The hidden activation vectors of fully connected layer fc7 are used for the image features, which produces a sequence of 4096-dimensional feature vectors. Furthermore, to model motion and short-term spatiotemporal activity, we use the pretrained C3D (D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, Dec. 7-13, 2015, pages 4489-4497, 2015.) (which was trained on the Sports-1M dataset (A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1725-1732, 2014.)). The C3D network reads sequential frames in the video and outputs a fixed-length feature vector every 16 frames. Activation vectors were extracted from fully-connected layer fc6-1, which has 4096-dimensional features.
- Audio Processing
- Audio features are incorporated to use in the attention-based feature fusion method according to embodiments of the present disclosure. Since YouTube2Text corpus does not contain audio track, we extracted the audio data via the original video URLs. Although a subset of the videos were no longer available on YouTube, we were able to collect the audio data for 1,649 video clips, which covers 84% of the corpus. The 44 kHz-sampled audio data are down-sampled to 16 kHz, and Mel-Frequency Cepstral Coefficients (MFCCs) are extracted from each 50 ms time window with 25 ms shift. The sequence of 13-dimensional MFCC features are then concatenated into one vector from every group of 20 consecutive frames, which results in a sequence of 260-dimensional vectors. The MFCC features are normalized so that the mean and variance vectors are 0 and 1 in the training set. The validation and test sets are also adjusted with the original mean and variance vectors of the training set. Unlike with the image features, we apply a BLSTM encoder network for MFCC features, which is trained jointly with the decoder network. If audio data are missing for a video clip, then we feed in a sequence of dummy MFCC features, which is simply a sequence of zero vectors.
- Setup for Describing Multi-Modal Data
- The caption generation model, i.e. the decoder network, is trained to minimize the cross entropy criterion using the training set. Image features are fed to the decoder network through one projection layer of 512 units, while audio features, i.e. MFCCs, are fed to the BLSTM encoder followed by the decoder network. The encoder network has one projection layer of 512 units and bidirectional LSTM layers of 512 cells. The decoder network has one LSTM layer with 512 cells. Each word is embedded to a 256-dimensional vector when it is fed to the LSTM layer. We apply the AdaDelta optimizer (M. D. Zeiler. ADADELTA: an adaptive learning rate method. CoRR, abs/1212.5701, 2012.) to update the parameters, which is widely used for optimizing attention models. The LSTM and attention models were implemented using Chainer (S. Tokui, K. Oono, S. Hido, and J. Clayton. Chainer: a next generation open source framework for deep learning. In Proceedings of Workshop on Machine Learning Systems (Learn-7 ingSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS), 2015).
- The similarity between ground truth and automatic video description results are evaluated using machine-translation motivated metrics: BLEU (K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Jul. 6-12, 2002, Philadelphia, Pa., USA., pages 311-318, 2002.), METEOR (M. J. Denkowski and A. Lavie. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the Ninth Workshop on Statistical Machine Translation, WMT@ACL 2014, Jun. 26-27, 2014, Baltimore, Md., USA, pages 376-380, 2014.), and the other metric for image description, CIDEr (R. Vedantam, C. L. Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, Mass., USA, Jun. 7-12, 2015, pages 4566-4575, 2015.). We used the publicly available evaluation script prepared for image captioning challenge (X. Chen, H. Fang, T. Lin, R. Vedantam, S. Gupta, P. Doll'ar, and C. L. Zitnick. Microsoft COCO captions: Data collection and evaluation server. CoRR, abs/1504.00325, 2015.).
- Evaluation Results
-
FIG. 8 shows comparisons of performance results obtained by conventional methods and the multimodal attention method according to embodiments of the present disclosure regarding the Youtube2text data set. The conventional methods are a simple additive multimodal fusion (Simple Multimodal), unimodal models with temporal attention (Unimodal), and baseline systems that used temporal attention, are performed. - The first three rows of the table use temporal attention but only one modality (one feature type). The next two rows do multimodal fusion of two modalities (image and spatiotemporal) using either Simple Multimodal fusion (see
FIG. 6 ) or our proposed Multimodal Attention mechanism (seeFIG. 7 ). The next two rows also perform multimodal fusion, this time of three modalities (image, spatiotemporal, and audio features). In each column, the scores of the top two methods are shown in boldface. - The Simple Multimodal model performed better than the Unimodal models. However, the Multimodal Attention model outperformed the Simple Multimodal model. The audio feature degrades the performance of the baseline because some YouTube data includes noise such as background music, which is unrelated to the video content. The Multimodal Attention model mitigated the impact of the noise of the audio features. Moreover, combining the audio features using our proposed method reached the best performance of CIDEr for all experimental conditions.
- Accordingly, Multimodal Attention model improves upon the Simple Multimodal.
-
FIGS. 9A, 9B, 9C and 9D show comparisons of performance results obtained by conventional methods and the multimodal attention method according to embodiments of the present disclosure. -
FIGS. 9A-9C show three example video clips, for which the attention-based multimodal fusion method (Temporal & Multimodal attention with VGG and C3D) outperformed the single modal method (Temporal attention with VGG) and the simple modal fusion method (Temporal attention with VGG and C3D) in CIDEr measure.FIG. 9D shows an example video clip, for which the attention-based multimodal fusion method (Temporal & Multimodal attention) including audio features outperformed the single modal method (Temporal attention with VGG), the simple modal fusion method (Temporal attention with VGG, C3D) with/without audio features. These examples show efficacy of multimodal attention mechanism. - In some embodiments of the present disclosure, when the multi-modal fusion model described above is installed in a computer system, video script can be effectively generated with less computing power, thus the use of the multi-modal fusion model method or system can reduce central processing unit usage and power consumption.
- Further, embodiments according to the present disclosure provide effective method for performing the multimodal fusion model, thus, the use of a method and system using the multimodal fusion model can reduce central processing unit (CPU) usage, power consumption and/or network band width usage.
- The above-described embodiments of the present disclosure can be implemented in any of numerous ways. For example, the embodiments may be implemented using hardware, software or a combination thereof. When implemented in software, the software code can be executed on any suitable processor or collection of processors, whether provided in a single computer or distributed among multiple computers. Such processors may be implemented as integrated circuits, with one or more processors in an integrated circuit component. Though, a processor may be implemented using circuitry in any suitable format.
- Also, the various methods or processes outlined herein may be coded as software that is executable on one or more processors that employ any one of a variety of operating systems or platforms. Additionally, such software may be written using any of a number of suitable programming languages and/or programming or scripting tools, and also may be compiled as executable machine language code or intermediate code that is executed on a framework or virtual machine. Typically, the functionality of the program modules may be combined or distributed as desired in various embodiments.
- Further, the embodiments of the present disclosure may be embodied as a method, of which an example has been provided. The acts performed as part of the method may be ordered in any suitable way. Accordingly, embodiments may be constructed in which acts are performed in an order different than illustrated, which may include performing some acts concurrently, even though shown as sequential acts in illustrative embodiments. Further, use of ordinal terms such as first, second, in the claims to modify a claim element does not by itself connote any priority, precedence, or order of one claim element over another or the temporal order in which acts of a method are performed, but are used merely as labels to distinguish one claim element having a certain name from another element having a same name (but for use of the ordinal term) to distinguish the claim elements.
- Although the present disclosure has been described with reference to certain preferred embodiments, it is to be understood that various other adaptations and modifications can be made within the spirit and scope of the present disclosure. Therefore, it is the aspect of the append claims to cover all such variations and modifications as come within the true spirit and scope of the present disclosure.
Claims (20)
1. A system for generating a word sequence from multi-modal input vectors, comprising:
one or more processors in connection with a memory and one or more storage devices storing instructions that are operable, when executed by the one or more processors, to cause the one or more processors to perform operations comprising:
receiving first and second input vectors according to first and second sequential intervals;
extracting first and second feature vectors using first and second feature extractors, respectively, from the first and second input;
estimating a first set of weights and a second set of weights, respectively, from the first and second feature vectors and a prestep context vector of a sequence generator;
calculating a first content vector from the first set of weights and the first feature vectors, and calculating a second content vector from the second set of weights and the second feature vectors;
transforming the first content vector into a first modal content vector having a predetermined dimension and transforming the second content vector into a second modal content vector having the predetermined dimension;
estimating a set of modal attention weights from the pre-step context vector and the first and second content vectors or the first and second modal content vectors;
generating a weighted content vector having the predetermined dimension from the set of modal attention weights and the first and second modal content vectors; and
generating a predicted word using the sequence generator for generating the word sequence from the weighted content vector.
2. The system of claim 1 , wherein the first and second sequential intervals are an identical interval.
3. The system of claim 1 , wherein the first and second input vectors are different modalities.
4. The system of claim 1 , wherein the operations further comprising:
accumulating the predicted word into the memory or the one or more storage devices to generate the word sequence.
5. The system of claim 4 , wherein the accumulating is continued until an end label is received.
6. The system of claim 1 , wherein the operations further comprising:
transmitting the predicted word generated from the sequence generator.
7. The system of claim 1 , wherein the first and second feature extractors are pretrained Convolutional Neural Networks (CNNs) having been trained for an image or a video classification task.
8. The system of claim 1 , wherein the feature extractors are Long Short-Term Memory (LSTM) networks.
9. The system of claim 1 , wherein the predicted word having a highest probability in all possible words given the weighted content vector and the prestep context vector is determined.
10. The system of claim 1 , wherein the sequence generator employs a Long Short-Term Memory (LSTM) network.
11. The system of claim 1 , wherein the first input vector is received via a first input/output (I/O) interface and the second input vector is received via a second I/O interface.
12. A non-transitory computer-readable medium storing software comprising instructions executable by one or more processors which, upon such execution, cause the one or more processors in connection with a memory to perform operations comprising:
receiving first and second input vectors according to first and second sequential intervals;
extracting first and second feature vectors using first and second feature extractors, respectively, from the first and second input;
estimating a first set of weights and a second set of weights, respectively, from the first and second feature vectors and a prestep context vector of a sequence generator;
calculating a first content vector from the first set of weights and the first feature vectors, and calculating a second content vector from the second set of weights and the second feature vectors;
transforming the first content vector into a first modal content vector having a predetermined dimension and transforming the second content vector into a second modal content vector having the predetermined dimension;
estimating a set of modal attention weights from the pre-step context vector and the first and second content vectors or the first and second modal content vectors;
generating a weighted content vector having the predetermined dimension from the set of modal attention weights and the first and second modal content vectors; and
generating a predicted word using the sequence generator for generating the word sequence from the weighted content vector.
13. The computer-readable medium of claim 12 , wherein the first and second sequential intervals are an identical interval.
14. The computer-readable medium of claim 12 , wherein the first and second input vectors are different modalities.
15. The computer-readable medium of claim 12 , wherein the operations further comprising:
accumulating the predicted word into the memory or the one or more storage devices to generate the word sequence.
16. The computer-readable medium of claim 15 , wherein the accumulating is continued until an end label is received.
17. The computer-readable medium of claim 12 , wherein the operations further comprising:
transmitting the predicted word generated from the sequence generator.
18. The computer-readable medium of claim 12 , wherein the first and second feature extractors are pretrained Convolutional Neural Networks (CNNs) having been trained for an image or a video classification 6task.
19. A method for generating a word sequence from multi-modal input, comprising:
receiving first and second input vectors according to first and second sequential intervals;
extracting first and second feature vectors using first and second feature extractors, respectively, from the first and second input;
estimating a first set of weights and a second set of weights, respectively, from the first and second feature vectors and a prestep context vector of a sequence generator;
calculating a first content vector from the first set of weights and the first feature vectors, and calculating a second content vector from the second set of weights and the second feature vectors;
transforming the first content vector into a first modal content vector having a predetermined dimension and transforming the second content vector into a second modal content vector having the predetermined dimension;
estimating a set of modal attention weights from the prestep context vector and the first and second content vectors or the first and second modal content vectors;
generating a weighted content vector having the predetermined dimension from the set of modal attention weights and the first and second modal content vectors; and
generating a predicted word using the sequence generator for generating the word sequence from the weighted content vector.
20. The method of claim 19 , wherein the first and second sequential intervals are an identical interval.
Priority Applications (5)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US15/472,797 US10417498B2 (en) | 2016-12-30 | 2017-03-29 | Method and system for multi-modal fusion model |
| DE112017006685.9T DE112017006685B4 (en) | 2016-12-30 | 2017-12-25 | Method and system for a multimodal fusion model |
| PCT/JP2017/047417 WO2018124309A1 (en) | 2016-12-30 | 2017-12-25 | Method and system for multi-modal fusion model |
| CN201780079516.1A CN110168531B (en) | 2016-12-30 | 2017-12-25 | Method and system for multi-modal fusion model |
| JP2019513858A JP6719663B2 (en) | 2016-12-30 | 2017-12-25 | Method and system for multimodal fusion model |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US201662440433P | 2016-12-30 | 2016-12-30 | |
| US15/472,797 US10417498B2 (en) | 2016-12-30 | 2017-03-29 | Method and system for multi-modal fusion model |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20180189572A1 true US20180189572A1 (en) | 2018-07-05 |
| US10417498B2 US10417498B2 (en) | 2019-09-17 |
Family
ID=61094562
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US15/472,797 Active 2037-11-15 US10417498B2 (en) | 2016-12-30 | 2017-03-29 | Method and system for multi-modal fusion model |
Country Status (5)
| Country | Link |
|---|---|
| US (1) | US10417498B2 (en) |
| JP (1) | JP6719663B2 (en) |
| CN (1) | CN110168531B (en) |
| DE (1) | DE112017006685B4 (en) |
| WO (1) | WO2018124309A1 (en) |
Cited By (73)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN108875708A (en) * | 2018-07-18 | 2018-11-23 | 广东工业大学 | Video-based behavior analysis method, device, equipment, system and storage medium |
| US20190043379A1 (en) * | 2017-08-03 | 2019-02-07 | Microsoft Technology Licensing, Llc | Neural models for key phrase detection and question generation |
| US10366292B2 (en) * | 2016-11-03 | 2019-07-30 | Nec Corporation | Translating video to language using adaptive spatiotemporal convolution feature representation with dynamic abstraction |
| CN110826397A (en) * | 2019-09-20 | 2020-02-21 | 浙江大学 | Video description method based on high-order low-rank multi-modal attention mechanism |
| CN110851641A (en) * | 2018-08-01 | 2020-02-28 | 杭州海康威视数字技术股份有限公司 | Cross-modal retrieval method, apparatus and readable storage medium |
| CN110858232A (en) * | 2018-08-09 | 2020-03-03 | 阿里巴巴集团控股有限公司 | Search method, apparatus, system and storage medium |
| US20200134398A1 (en) * | 2018-10-29 | 2020-04-30 | Sri International | Determining intent from multimodal content embedded in a common geometric space |
| CN111274440A (en) * | 2020-01-19 | 2020-06-12 | 浙江工商大学 | Video recommendation method based on visual and audio content relevancy mining |
| CN111275085A (en) * | 2020-01-15 | 2020-06-12 | 重庆邮电大学 | Online short video multi-modal emotion recognition method based on attention fusion |
| CN111291804A (en) * | 2020-01-22 | 2020-06-16 | 杭州电子科技大学 | Multi-sensor time series analysis model based on attention mechanism |
| CN111325323A (en) * | 2020-02-19 | 2020-06-23 | 山东大学 | Power transmission and transformation scene description automatic generation method fusing global information and local information |
| US10699129B1 (en) * | 2019-11-15 | 2020-06-30 | Fudan University | System and method for video captioning |
| CN111639748A (en) * | 2020-05-15 | 2020-09-08 | 武汉大学 | Watershed pollutant flux prediction method based on LSTM-BP space-time combination model |
| CN111814844A (en) * | 2020-03-17 | 2020-10-23 | 同济大学 | A Dense Video Description Method Based on Positional Coding Fusion |
| US20200342236A1 (en) * | 2019-04-29 | 2020-10-29 | Tencent America LLC | End-to-end video captioning with multi-task reinforcement learning |
| CN112001437A (en) * | 2020-08-19 | 2020-11-27 | 四川大学 | Modal non-complete alignment-oriented data clustering method |
| CN112069361A (en) * | 2020-08-27 | 2020-12-11 | 新华智云科技有限公司 | A video description text generation method based on multimodal fusion |
| CN112115601A (en) * | 2020-09-10 | 2020-12-22 | 西北工业大学 | A Reliable Representation Model for User Attention Monitoring Estimation |
| CN112241008A (en) * | 2019-07-18 | 2021-01-19 | Aptiv技术有限公司 | Method and system for object detection |
| CN112468888A (en) * | 2020-11-26 | 2021-03-09 | 广东工业大学 | Video abstract generation method and system based on GRU network |
| US20210089968A1 (en) * | 2017-02-06 | 2021-03-25 | Deepmind Technologies Limited | Memory augmented generative temporal models |
| CN112651417A (en) * | 2019-10-12 | 2021-04-13 | 杭州海康威视数字技术股份有限公司 | License plate recognition method, device, equipment and storage medium |
| CN112765959A (en) * | 2020-12-31 | 2021-05-07 | 康佳集团股份有限公司 | Intention recognition method, device, equipment and computer readable storage medium |
| CN112861945A (en) * | 2021-01-28 | 2021-05-28 | 清华大学 | Multi-mode fusion lie detection method |
| CN112954312A (en) * | 2021-02-07 | 2021-06-11 | 福州大学 | No-reference video quality evaluation method fusing spatio-temporal characteristics |
| WO2021129181A1 (en) * | 2019-12-23 | 2021-07-01 | Guangdong Oppo Mobile Telecommunications Corp., Ltd. | Portrait segmentation method, model training method and electronic device |
| CN113139121A (en) * | 2020-01-20 | 2021-07-20 | 阿里巴巴集团控股有限公司 | Query method, model training method, device, equipment and storage medium |
| CN113205148A (en) * | 2021-05-20 | 2021-08-03 | 山东财经大学 | Medical image frame interpolation method and terminal for iterative interlayer information fusion |
| US20210247201A1 (en) * | 2020-02-06 | 2021-08-12 | Mitsubishi Electric Research Laboratories, Inc. | Method and System for Scene-Aware Interaction |
| US20210256977A1 (en) * | 2019-04-02 | 2021-08-19 | Tencent Technology (Shenzhen) Company Limited | Method and apparatus for generating video description information, and method and apparatus for video processing |
| CN113326703A (en) * | 2021-08-03 | 2021-08-31 | 国网电子商务有限公司 | Emotion recognition method and system based on multi-modal confrontation fusion in heterogeneous space |
| CN113474818A (en) * | 2019-02-11 | 2021-10-01 | 西门子股份公司 | Apparatus and method for performing data-driven pairwise registration of three-dimensional point clouds |
| CN113537566A (en) * | 2021-06-16 | 2021-10-22 | 广东工业大学 | An ultra-short-term wind power prediction method based on DCCSO optimized deep learning model |
| CN113569610A (en) * | 2021-02-09 | 2021-10-29 | 腾讯科技(深圳)有限公司 | Video content identification method and device, storage medium and electronic equipment |
| CN113569975A (en) * | 2021-08-04 | 2021-10-29 | 华南师范大学 | A method and device for rating sketches based on model fusion |
| US11170508B2 (en) * | 2018-01-03 | 2021-11-09 | Ramot At Tel-Aviv University Ltd. | Systems and methods for the segmentation of multi-modal image data |
| WO2021225841A1 (en) * | 2020-05-07 | 2021-11-11 | Nec Laboratories America, Inc. | Fault detection in cyber-physical systems |
| CN113821687A (en) * | 2021-06-30 | 2021-12-21 | 腾讯科技(深圳)有限公司 | A content retrieval method, apparatus, and computer-readable storage medium |
| CN113990473A (en) * | 2021-10-28 | 2022-01-28 | 上海昆亚医疗器械股份有限公司 | Medical equipment operation and maintenance information collecting and analyzing system and using method thereof |
| CN114120044A (en) * | 2021-12-08 | 2022-03-01 | 马上消费金融股份有限公司 | Image classification method, image classification network training method and device and electronic equipment |
| CN114332573A (en) * | 2021-12-18 | 2022-04-12 | 中国科学院深圳先进技术研究院 | Multimodal information fusion recognition method and system based on attention mechanism |
| CN114387567A (en) * | 2022-03-23 | 2022-04-22 | 长视科技股份有限公司 | Video data processing method and device, electronic equipment and storage medium |
| CN114400007A (en) * | 2021-12-31 | 2022-04-26 | 联想(北京)有限公司 | Voice processing method and device |
| CN114663733A (en) * | 2022-02-18 | 2022-06-24 | 北京百度网讯科技有限公司 | Method, device, device, medium and product for fusion of multimodal features |
| US20220223037A1 (en) * | 2021-01-14 | 2022-07-14 | Baidu Usa Llc | Machine learning model to fuse emergency vehicle audio and visual detection |
| CN114821255A (en) * | 2022-04-20 | 2022-07-29 | 北京百度网讯科技有限公司 | Method, apparatus, device, medium and product for fusion of multimodal features |
| WO2022164191A1 (en) * | 2021-01-29 | 2022-08-04 | Samsung Electronics Co., Ltd. | System and method for microgenre-based hyper-personalization with multi-modal machine learning |
| CN115034327A (en) * | 2022-06-22 | 2022-09-09 | 支付宝(杭州)信息技术有限公司 | External data application, method, device and device for user identification |
| CN115134676A (en) * | 2022-09-01 | 2022-09-30 | 有米科技股份有限公司 | Video reconstruction method and device for audio-assisted video completion |
| US11475254B1 (en) * | 2017-09-08 | 2022-10-18 | Snap Inc. | Multimodal entity identification |
| CN115590481A (en) * | 2022-12-15 | 2023-01-13 | 北京鹰瞳科技发展股份有限公司(Cn) | A device and computer-readable storage medium for predicting cognitive impairment |
| CN115705415A (en) * | 2021-08-09 | 2023-02-17 | 腾讯科技(深圳)有限公司 | Data processing method, device, storage medium and equipment |
| CN116128863A (en) * | 2023-03-01 | 2023-05-16 | 北京医准智能科技有限公司 | Medical image processing method, device and equipment |
| CN116127408A (en) * | 2023-02-21 | 2023-05-16 | 安徽大学 | Cross-modal enhancement-based multi-modal self-adaptive fusion method and system |
| WO2023124110A1 (en) * | 2021-12-30 | 2023-07-06 | 深圳市检验检疫科学研究院 | Label perception-based gated recurrent acquisition method |
| CN116401357A (en) * | 2023-03-31 | 2023-07-07 | 清华大学 | Multimodal document retrieval method and device based on cross-modal mutual attention mechanism |
| CN116414456A (en) * | 2023-01-19 | 2023-07-11 | 杭州知存智能科技有限公司 | Weighted fusion conversion component in memory chip, memory circuit and cooperative computing method |
| CN116543795A (en) * | 2023-06-29 | 2023-08-04 | 天津大学 | Sound scene classification method based on multi-mode feature fusion |
| US11842259B1 (en) * | 2022-07-12 | 2023-12-12 | University Of Chinese Academy Of Sciences | Intelligent information parsing method based on cross-modal data fusion |
| CN117312864A (en) * | 2023-11-30 | 2023-12-29 | 国家计算机网络与信息安全管理中心 | Training method and device for deformed word generation model based on multi-modal information |
| EP4207771A4 (en) * | 2020-12-22 | 2024-02-21 | Shanghai Hode Information Technology Co., Ltd. | VIDEO PROCESSING METHOD AND APPARATUS |
| CN117671438A (en) * | 2023-10-31 | 2024-03-08 | 上海人工智能创新中心 | Multimodal knowledge fusion methods, storage media and applications based on knowledge transfer |
| CN117668762A (en) * | 2024-01-31 | 2024-03-08 | 新疆三联工程建设有限责任公司 | Monitoring and early warning system and method for residential underground leakage |
| US20240303970A1 (en) * | 2021-02-25 | 2024-09-12 | Alibaba Group Holding Limited | Data processing method and device |
| US12106214B2 (en) * | 2017-05-17 | 2024-10-01 | Samsung Electronics Co., Ltd. | Sensor transformation attention network (STAN) model |
| US20240331370A1 (en) * | 2022-02-25 | 2024-10-03 | Suzhou Metabrain Intelligent Technology Co., Ltd. | Multi-modal model training method and apparatus, image recognition method and apparatus, and electronic device |
| CN118939125A (en) * | 2024-10-11 | 2024-11-12 | 四川物通科技有限公司 | A multimodal interaction method and system based on metaverse and brain-computer interface |
| CN118940760A (en) * | 2024-07-23 | 2024-11-12 | 广东工业大学 | A method and system for extracting symptom features from electronic medical records based on multimodal enhancement |
| US20240380949A1 (en) * | 2023-05-08 | 2024-11-14 | Lemon Inc. | Video captioning generation system and method |
| WO2025054081A1 (en) * | 2023-09-05 | 2025-03-13 | Qualcomm Incorporated | Faithful generation of output text for multimodal applications |
| US12269511B2 (en) | 2021-01-14 | 2025-04-08 | Baidu Usa Llc | Emergency vehicle audio and visual detection post fusion |
| US20250335497A1 (en) * | 2024-04-24 | 2025-10-30 | Dell Products L.P. | Method, device, and product for retrieval |
| CN121277365A (en) * | 2025-12-10 | 2026-01-06 | 宁波金晟芯影像技术股份有限公司 | An interactive control method for smart glasses |
Families Citing this family (40)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US11551042B1 (en) * | 2018-08-27 | 2023-01-10 | Snap Inc. | Multimodal sentiment classification |
| US11010559B2 (en) * | 2018-08-30 | 2021-05-18 | International Business Machines Corporation | Multi-aspect sentiment analysis by collaborative attention allocation |
| CN109871736B (en) * | 2018-11-23 | 2023-01-31 | 腾讯科技(深圳)有限公司 | Method and device for generating natural language description information |
| CN110162799B (en) * | 2018-11-28 | 2023-08-04 | 腾讯科技(深圳)有限公司 | Model training method, machine translation method, and related devices and equipment |
| CN109543824B (en) | 2018-11-30 | 2023-05-23 | 腾讯科技(深圳)有限公司 | A processing method and device for a sequence model |
| JP7206898B2 (en) * | 2018-12-25 | 2023-01-18 | 富士通株式会社 | LEARNING DEVICE, LEARNING METHOD AND LEARNING PROGRAM |
| CN110020596B (en) * | 2019-02-21 | 2021-04-30 | 北京大学 | Video content positioning method based on feature fusion and cascade learning |
| CN111640424B (en) * | 2019-03-01 | 2024-02-13 | 北京搜狗科技发展有限公司 | Voice recognition method and device and electronic equipment |
| CN110163091B (en) * | 2019-04-13 | 2023-05-26 | 天津大学 | 3D Model Retrieval Method Based on Multimodal Information Fusion of LSTM Network |
| CN110503636B (en) * | 2019-08-06 | 2024-01-26 | 腾讯医疗健康(深圳)有限公司 | Parameter adjustment method, lesion prediction method, parameter adjustment device and electronic equipment |
| CN110557447B (en) * | 2019-08-26 | 2022-06-10 | 腾讯科技(武汉)有限公司 | User behavior identification method and device, storage medium and server |
| CN110473529B (en) * | 2019-09-09 | 2021-11-05 | 北京中科智极科技有限公司 | Stream type voice transcription system based on self-attention mechanism |
| US11264009B2 (en) * | 2019-09-13 | 2022-03-01 | Mitsubishi Electric Research Laboratories, Inc. | System and method for a dialogue response generation system |
| US11270123B2 (en) * | 2019-10-22 | 2022-03-08 | Palo Alto Research Center Incorporated | System and method for generating localized contextual video annotation |
| JP7205646B2 (en) * | 2019-11-14 | 2023-01-17 | 富士通株式会社 | Output method, output program, and output device |
| CN110866509B (en) * | 2019-11-20 | 2023-04-28 | 腾讯科技(深圳)有限公司 | Action recognition method, device, computer storage medium and computer equipment |
| CN111274372A (en) * | 2020-01-15 | 2020-06-12 | 上海浦东发展银行股份有限公司 | Method, electronic device, and computer-readable storage medium for human-computer interaction |
| CN111294512A (en) | 2020-02-10 | 2020-06-16 | 深圳市铂岩科技有限公司 | Image processing method, device, storage medium, and imaging device |
| WO2021183256A1 (en) * | 2020-03-10 | 2021-09-16 | Sri International | Physics-guided deep multimodal embeddings for task-specific data exploitation |
| WO2021204143A1 (en) | 2020-04-08 | 2021-10-14 | Guangdong Oppo Mobile Telecommunications Corp., Ltd. | Methods for action localization, electronic device and storage medium |
| CN111523575B (en) * | 2020-04-13 | 2023-12-12 | 中南大学 | Short video recommendation method based on multi-modal features of short videos |
| CN113630302B (en) * | 2020-05-09 | 2023-07-11 | 阿里巴巴集团控股有限公司 | Junk mail identification method and device and computer readable storage medium |
| CN111767726B (en) * | 2020-06-24 | 2024-02-06 | 北京奇艺世纪科技有限公司 | Data processing method and device |
| EP4592902A3 (en) * | 2020-07-09 | 2025-10-29 | Featurespace Limited | Neural network architecture for transaction data processing |
| CN112000818B (en) * | 2020-07-10 | 2023-05-12 | 中国科学院信息工程研究所 | A text- and image-oriented cross-media retrieval method and electronic device |
| US12236192B2 (en) * | 2021-01-08 | 2025-02-25 | Meta Platforms, Inc. | Task-specific text generation based on multimodal inputs |
| CN113360514B (en) * | 2021-07-02 | 2022-05-17 | 支付宝(杭州)信息技术有限公司 | Method, device and system for jointly updating models |
| US11445267B1 (en) | 2021-07-23 | 2022-09-13 | Mitsubishi Electric Research Laboratories, Inc. | Low-latency captioning system |
| CN113986005B (en) * | 2021-10-13 | 2023-07-07 | 电子科技大学 | Multimodal Fusion Sight Estimation Framework Based on Ensemble Learning |
| KR102411278B1 (en) * | 2021-12-30 | 2022-06-22 | 주식회사 파일러 | Video surveillance system based on multi-modal video captioning and method of the same |
| CN114529790B (en) * | 2022-01-11 | 2025-05-13 | 山东师范大学 | Food nutrient content prediction method and system based on cross-modal attention mechanism |
| CN116797627A (en) * | 2022-03-10 | 2023-09-22 | 电子科技大学 | Multimodal video description generation method based on fused motion sensing information |
| US20240046085A1 (en) | 2022-08-04 | 2024-02-08 | Mitsubishi Electric Research Laboratories, Inc. | Low-latency Captioning System |
| CN115512368B (en) * | 2022-08-22 | 2024-05-10 | 华中农业大学 | A cross-modal semantic image generation model and method |
| CN116932731B (en) * | 2023-09-18 | 2024-01-30 | 上海帜讯信息技术股份有限公司 | Multimodal knowledge question and answer method and system for 5G messages |
| CN117708375B (en) * | 2024-02-05 | 2024-05-28 | 北京搜狐新媒体信息技术有限公司 | Video processing method and device and related products |
| CN117789099B (en) * | 2024-02-26 | 2024-05-28 | 北京搜狐新媒体信息技术有限公司 | Video feature extraction method and device, storage medium and electronic equipment |
| CN118821783B (en) * | 2024-06-18 | 2025-03-04 | 北京汉勃科技有限公司 | Intention recognition method, device and electronic device applied to message events |
| CN118629596B (en) * | 2024-08-13 | 2024-11-01 | 吉林大学 | Psychological state analysis system and method for patients with multiple myeloma cardiac amyloidosis |
| CN119939372B (en) * | 2025-04-03 | 2025-06-20 | 国网山东省电力公司营销服务中心(计量中心) | Multi-mode characteristic-based current transformer error state evaluation method and system |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102937972B (en) * | 2012-10-15 | 2016-06-22 | 上海外教社信息技术有限公司 | A kind of audiovisual subtitle making system and method |
| CN103885924A (en) * | 2013-11-21 | 2014-06-25 | 北京航空航天大学 | Field-adaptive automatic open class subtitle generating system and field-adaptive automatic open class subtitle generating method |
| US9542934B2 (en) * | 2014-02-27 | 2017-01-10 | Fuji Xerox Co., Ltd. | Systems and methods for using latent variable modeling for multi-modal video indexing |
| US10909329B2 (en) | 2015-05-21 | 2021-02-02 | Baidu Usa Llc | Multilingual image question answering |
-
2017
- 2017-03-29 US US15/472,797 patent/US10417498B2/en active Active
- 2017-12-25 JP JP2019513858A patent/JP6719663B2/en active Active
- 2017-12-25 WO PCT/JP2017/047417 patent/WO2018124309A1/en not_active Ceased
- 2017-12-25 CN CN201780079516.1A patent/CN110168531B/en active Active
- 2017-12-25 DE DE112017006685.9T patent/DE112017006685B4/en active Active
Cited By (85)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10366292B2 (en) * | 2016-11-03 | 2019-07-30 | Nec Corporation | Translating video to language using adaptive spatiotemporal convolution feature representation with dynamic abstraction |
| US10402658B2 (en) * | 2016-11-03 | 2019-09-03 | Nec Corporation | Video retrieval system using adaptive spatiotemporal convolution feature representation with dynamic abstraction for video to language translation |
| US20210089968A1 (en) * | 2017-02-06 | 2021-03-25 | Deepmind Technologies Limited | Memory augmented generative temporal models |
| US11977967B2 (en) * | 2017-02-06 | 2024-05-07 | Deepmind Technologies Limited | Memory augmented generative temporal models |
| US12106214B2 (en) * | 2017-05-17 | 2024-10-01 | Samsung Electronics Co., Ltd. | Sensor transformation attention network (STAN) model |
| US10902738B2 (en) * | 2017-08-03 | 2021-01-26 | Microsoft Technology Licensing, Llc | Neural models for key phrase detection and question generation |
| US20190043379A1 (en) * | 2017-08-03 | 2019-02-07 | Microsoft Technology Licensing, Llc | Neural models for key phrase detection and question generation |
| US12164603B2 (en) | 2017-09-08 | 2024-12-10 | Snap Inc. | Multimodal entity identification |
| US11475254B1 (en) * | 2017-09-08 | 2022-10-18 | Snap Inc. | Multimodal entity identification |
| US11699236B2 (en) | 2018-01-03 | 2023-07-11 | Ramot At Tel-Aviv University Ltd. | Systems and methods for the segmentation of multi-modal image data |
| US11170508B2 (en) * | 2018-01-03 | 2021-11-09 | Ramot At Tel-Aviv University Ltd. | Systems and methods for the segmentation of multi-modal image data |
| CN108875708A (en) * | 2018-07-18 | 2018-11-23 | 广东工业大学 | Video-based behavior analysis method, device, equipment, system and storage medium |
| CN110851641A (en) * | 2018-08-01 | 2020-02-28 | 杭州海康威视数字技术股份有限公司 | Cross-modal retrieval method, apparatus and readable storage medium |
| CN110858232A (en) * | 2018-08-09 | 2020-03-03 | 阿里巴巴集团控股有限公司 | Search method, apparatus, system and storage medium |
| US20200134398A1 (en) * | 2018-10-29 | 2020-04-30 | Sri International | Determining intent from multimodal content embedded in a common geometric space |
| CN113474818A (en) * | 2019-02-11 | 2021-10-01 | 西门子股份公司 | Apparatus and method for performing data-driven pairwise registration of three-dimensional point clouds |
| US11861886B2 (en) * | 2019-04-02 | 2024-01-02 | Tencent Technology (Shenzhen) Company Limited | Method and apparatus for generating video description information, and method and apparatus for video processing |
| US20210256977A1 (en) * | 2019-04-02 | 2021-08-19 | Tencent Technology (Shenzhen) Company Limited | Method and apparatus for generating video description information, and method and apparatus for video processing |
| US20200342236A1 (en) * | 2019-04-29 | 2020-10-29 | Tencent America LLC | End-to-end video captioning with multi-task reinforcement learning |
| US10885345B2 (en) * | 2019-04-29 | 2021-01-05 | Tencent America LLC | End-to-end video captioning with multi-task reinforcement learning |
| CN112241008A (en) * | 2019-07-18 | 2021-01-19 | Aptiv技术有限公司 | Method and system for object detection |
| CN110826397A (en) * | 2019-09-20 | 2020-02-21 | 浙江大学 | Video description method based on high-order low-rank multi-modal attention mechanism |
| CN112651417A (en) * | 2019-10-12 | 2021-04-13 | 杭州海康威视数字技术股份有限公司 | License plate recognition method, device, equipment and storage medium |
| US10699129B1 (en) * | 2019-11-15 | 2020-06-30 | Fudan University | System and method for video captioning |
| WO2021129181A1 (en) * | 2019-12-23 | 2021-07-01 | Guangdong Oppo Mobile Telecommunications Corp., Ltd. | Portrait segmentation method, model training method and electronic device |
| CN111275085A (en) * | 2020-01-15 | 2020-06-12 | 重庆邮电大学 | Online short video multi-modal emotion recognition method based on attention fusion |
| CN111274440A (en) * | 2020-01-19 | 2020-06-12 | 浙江工商大学 | Video recommendation method based on visual and audio content relevancy mining |
| CN113139121A (en) * | 2020-01-20 | 2021-07-20 | 阿里巴巴集团控股有限公司 | Query method, model training method, device, equipment and storage medium |
| CN111291804A (en) * | 2020-01-22 | 2020-06-16 | 杭州电子科技大学 | Multi-sensor time series analysis model based on attention mechanism |
| US11635299B2 (en) * | 2020-02-06 | 2023-04-25 | Mitsubishi Electric Research Laboratories, Inc. | Method and system for scene-aware interaction |
| US20210247201A1 (en) * | 2020-02-06 | 2021-08-12 | Mitsubishi Electric Research Laboratories, Inc. | Method and System for Scene-Aware Interaction |
| CN111325323A (en) * | 2020-02-19 | 2020-06-23 | 山东大学 | Power transmission and transformation scene description automatic generation method fusing global information and local information |
| CN111814844A (en) * | 2020-03-17 | 2020-10-23 | 同济大学 | A Dense Video Description Method Based on Positional Coding Fusion |
| WO2021225841A1 (en) * | 2020-05-07 | 2021-11-11 | Nec Laboratories America, Inc. | Fault detection in cyber-physical systems |
| CN111639748A (en) * | 2020-05-15 | 2020-09-08 | 武汉大学 | Watershed pollutant flux prediction method based on LSTM-BP space-time combination model |
| CN112001437A (en) * | 2020-08-19 | 2020-11-27 | 四川大学 | Modal non-complete alignment-oriented data clustering method |
| CN112069361A (en) * | 2020-08-27 | 2020-12-11 | 新华智云科技有限公司 | A video description text generation method based on multimodal fusion |
| CN112115601A (en) * | 2020-09-10 | 2020-12-22 | 西北工业大学 | A Reliable Representation Model for User Attention Monitoring Estimation |
| CN112468888A (en) * | 2020-11-26 | 2021-03-09 | 广东工业大学 | Video abstract generation method and system based on GRU network |
| EP4207771A4 (en) * | 2020-12-22 | 2024-02-21 | Shanghai Hode Information Technology Co., Ltd. | VIDEO PROCESSING METHOD AND APPARATUS |
| CN112765959A (en) * | 2020-12-31 | 2021-05-07 | 康佳集团股份有限公司 | Intention recognition method, device, equipment and computer readable storage medium |
| US12269511B2 (en) | 2021-01-14 | 2025-04-08 | Baidu Usa Llc | Emergency vehicle audio and visual detection post fusion |
| US11620903B2 (en) * | 2021-01-14 | 2023-04-04 | Baidu Usa Llc | Machine learning model to fuse emergency vehicle audio and visual detection |
| US20220223037A1 (en) * | 2021-01-14 | 2022-07-14 | Baidu Usa Llc | Machine learning model to fuse emergency vehicle audio and visual detection |
| CN112861945A (en) * | 2021-01-28 | 2021-05-28 | 清华大学 | Multi-mode fusion lie detection method |
| WO2022164191A1 (en) * | 2021-01-29 | 2022-08-04 | Samsung Electronics Co., Ltd. | System and method for microgenre-based hyper-personalization with multi-modal machine learning |
| US12493776B2 (en) | 2021-01-29 | 2025-12-09 | Samsung Electronics Co., Ltd. | Microgenre-based hyper-personalization with multi-modal machine learning |
| CN112954312A (en) * | 2021-02-07 | 2021-06-11 | 福州大学 | No-reference video quality evaluation method fusing spatio-temporal characteristics |
| CN113569610A (en) * | 2021-02-09 | 2021-10-29 | 腾讯科技(深圳)有限公司 | Video content identification method and device, storage medium and electronic equipment |
| US12524998B2 (en) * | 2021-02-25 | 2026-01-13 | Alibaba Group Holding Limited | Data processing method and device |
| US20240303970A1 (en) * | 2021-02-25 | 2024-09-12 | Alibaba Group Holding Limited | Data processing method and device |
| CN113205148A (en) * | 2021-05-20 | 2021-08-03 | 山东财经大学 | Medical image frame interpolation method and terminal for iterative interlayer information fusion |
| CN113537566A (en) * | 2021-06-16 | 2021-10-22 | 广东工业大学 | An ultra-short-term wind power prediction method based on DCCSO optimized deep learning model |
| CN113821687A (en) * | 2021-06-30 | 2021-12-21 | 腾讯科技(深圳)有限公司 | A content retrieval method, apparatus, and computer-readable storage medium |
| CN113326703A (en) * | 2021-08-03 | 2021-08-31 | 国网电子商务有限公司 | Emotion recognition method and system based on multi-modal confrontation fusion in heterogeneous space |
| CN113569975A (en) * | 2021-08-04 | 2021-10-29 | 华南师范大学 | A method and device for rating sketches based on model fusion |
| CN115705415A (en) * | 2021-08-09 | 2023-02-17 | 腾讯科技(深圳)有限公司 | Data processing method, device, storage medium and equipment |
| CN113990473A (en) * | 2021-10-28 | 2022-01-28 | 上海昆亚医疗器械股份有限公司 | Medical equipment operation and maintenance information collecting and analyzing system and using method thereof |
| CN114120044A (en) * | 2021-12-08 | 2022-03-01 | 马上消费金融股份有限公司 | Image classification method, image classification network training method and device and electronic equipment |
| CN114332573A (en) * | 2021-12-18 | 2022-04-12 | 中国科学院深圳先进技术研究院 | Multimodal information fusion recognition method and system based on attention mechanism |
| WO2023124110A1 (en) * | 2021-12-30 | 2023-07-06 | 深圳市检验检疫科学研究院 | Label perception-based gated recurrent acquisition method |
| CN114400007A (en) * | 2021-12-31 | 2022-04-26 | 联想(北京)有限公司 | Voice processing method and device |
| CN114663733A (en) * | 2022-02-18 | 2022-06-24 | 北京百度网讯科技有限公司 | Method, device, device, medium and product for fusion of multimodal features |
| US12260629B2 (en) * | 2022-02-25 | 2025-03-25 | Suzhou Metabrain Intelligent Technology Co., Ltd. | Multi-modal model training method and apparatus, image recognition method and apparatus, and electronic device |
| US20240331370A1 (en) * | 2022-02-25 | 2024-10-03 | Suzhou Metabrain Intelligent Technology Co., Ltd. | Multi-modal model training method and apparatus, image recognition method and apparatus, and electronic device |
| CN114387567A (en) * | 2022-03-23 | 2022-04-22 | 长视科技股份有限公司 | Video data processing method and device, electronic equipment and storage medium |
| CN114821255A (en) * | 2022-04-20 | 2022-07-29 | 北京百度网讯科技有限公司 | Method, apparatus, device, medium and product for fusion of multimodal features |
| CN115034327A (en) * | 2022-06-22 | 2022-09-09 | 支付宝(杭州)信息技术有限公司 | External data application, method, device and device for user identification |
| US11842259B1 (en) * | 2022-07-12 | 2023-12-12 | University Of Chinese Academy Of Sciences | Intelligent information parsing method based on cross-modal data fusion |
| CN115134676A (en) * | 2022-09-01 | 2022-09-30 | 有米科技股份有限公司 | Video reconstruction method and device for audio-assisted video completion |
| CN115590481A (en) * | 2022-12-15 | 2023-01-13 | 北京鹰瞳科技发展股份有限公司(Cn) | A device and computer-readable storage medium for predicting cognitive impairment |
| CN116414456A (en) * | 2023-01-19 | 2023-07-11 | 杭州知存智能科技有限公司 | Weighted fusion conversion component in memory chip, memory circuit and cooperative computing method |
| CN116127408A (en) * | 2023-02-21 | 2023-05-16 | 安徽大学 | Cross-modal enhancement-based multi-modal self-adaptive fusion method and system |
| CN116128863A (en) * | 2023-03-01 | 2023-05-16 | 北京医准智能科技有限公司 | Medical image processing method, device and equipment |
| CN116401357A (en) * | 2023-03-31 | 2023-07-07 | 清华大学 | Multimodal document retrieval method and device based on cross-modal mutual attention mechanism |
| US20240380949A1 (en) * | 2023-05-08 | 2024-11-14 | Lemon Inc. | Video captioning generation system and method |
| CN116543795A (en) * | 2023-06-29 | 2023-08-04 | 天津大学 | Sound scene classification method based on multi-mode feature fusion |
| WO2025054081A1 (en) * | 2023-09-05 | 2025-03-13 | Qualcomm Incorporated | Faithful generation of output text for multimodal applications |
| CN117671438A (en) * | 2023-10-31 | 2024-03-08 | 上海人工智能创新中心 | Multimodal knowledge fusion methods, storage media and applications based on knowledge transfer |
| CN117312864A (en) * | 2023-11-30 | 2023-12-29 | 国家计算机网络与信息安全管理中心 | Training method and device for deformed word generation model based on multi-modal information |
| CN117668762A (en) * | 2024-01-31 | 2024-03-08 | 新疆三联工程建设有限责任公司 | Monitoring and early warning system and method for residential underground leakage |
| US20250335497A1 (en) * | 2024-04-24 | 2025-10-30 | Dell Products L.P. | Method, device, and product for retrieval |
| CN118940760A (en) * | 2024-07-23 | 2024-11-12 | 广东工业大学 | A method and system for extracting symptom features from electronic medical records based on multimodal enhancement |
| CN118939125A (en) * | 2024-10-11 | 2024-11-12 | 四川物通科技有限公司 | A multimodal interaction method and system based on metaverse and brain-computer interface |
| CN121277365A (en) * | 2025-12-10 | 2026-01-06 | 宁波金晟芯影像技术股份有限公司 | An interactive control method for smart glasses |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2018124309A1 (en) | 2018-07-05 |
| DE112017006685T5 (en) | 2020-01-23 |
| CN110168531A (en) | 2019-08-23 |
| JP6719663B2 (en) | 2020-07-08 |
| CN110168531B (en) | 2023-06-20 |
| JP2019535063A (en) | 2019-12-05 |
| US10417498B2 (en) | 2019-09-17 |
| DE112017006685B4 (en) | 2025-06-18 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10417498B2 (en) | Method and system for multi-modal fusion model | |
| EP3857459B1 (en) | Method and system for training a dialogue response generation system | |
| EP4073787B1 (en) | System and method for streaming end-to-end speech recognition with asynchronous decoders | |
| US11526698B2 (en) | Unified referring video object segmentation network | |
| CN116050496A (en) | Method, device, medium, and equipment for determining image description information generation model | |
| US11086918B2 (en) | Method and system for multi-label classification | |
| CN111866609B (en) | Method and apparatus for generating video | |
| CN110263218B (en) | Video description text generation method, device, equipment and medium | |
| Elakkiya et al. | Subunit sign modeling framework for continuous sign language recognition | |
| Oghbaie et al. | Advances and challenges in deep lip reading | |
| CN116935287A (en) | Video understanding method and device | |
| US12198397B2 (en) | Keypoint based action localization | |
| CN115100566A (en) | Video object segmentation method, device, server and storage medium | |
| CN116684705A (en) | Method and device for generating dense video description | |
| CN120455807B (en) | Video generation method, device, equipment, medium and product based on reference image | |
| CN119922348A (en) | Video feature extraction method, video generation method, device, medium and equipment | |
| CN118314255A (en) | Display method, device, equipment, readable storage medium and computer program product | |
| CN117808517A (en) | User intention prediction method and device, electronic equipment and storage medium | |
| WO2025256268A1 (en) | Multi-modal data processing method and apparatus, electronic device, computer-readable storage medium, and computer program product | |
| CN116758912A (en) | Voice recognition method, device, equipment and storage medium | |
| CN119364148A (en) | Video generation method, device, electronic device, storage medium and product | |
| CN118366198A (en) | A tracking face-changing method, system, device and medium based on multi-person scene |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
| STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
| STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
| MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |