S.I. - Multi-modal Transformers

With the development of the Internet, social media, mobile apps, and other digital communication technologies, the world has stepped into a multimedia big data era. Millions of multimedia data, including image, text, audio, and video, are uploaded to the social platform every day. To make the artificial intelligence better understand the world around us, it is essential to teach machines to understand the multimodal messages. Multimodal machine learning, which aims to build models that can process and relate information from different modalities, has been a vibrant field with increasing importance and extraordinary potential. In this novel and hopeful area, extensive efforts have been dedicated to seamlessly unifying computer vision and natural language processing, such as multimedia content recognition (e.g., multimodal affect recognition), matching (e.g., cross-modal retrieval), description (e.g., image captioning), indexing (e.g., multimedia event detection), summarization (e.g., video summarization), reasoning (e.g., visual question answering), and so on. Although fruitful progress has been made with deep learning-based methods, the performance of above tasks is still far from users’ expectations, given the heterogeneous data due to several well-known challenges: (1) how to represent and summarize multimodal data; (2) how to identify and construct the connection and interaction between different modality data; (3) how to learn and infer adequate knowledge from multimodal data; (4) how to translate data or knowledge from one modality to another; and (5) how to understand and evaluate the heterogeneity in multimodal datasets.

Submission Guidelines: Authors should prepare their manuscript according to the Instructions for Authors available from the Multimedia Systems website. Authors should submit through the online submission site at Multimedia Systems and select “S.I. - Multi-modal Transformers" when they reach the “Article Type” step in the submission process. Submitted papers should present original, unpublished work, relevant to the topics of the special issue. All submitted papers will be evaluated on the basis of relevance, significance of contribution, technical quality, scholarship, and quality of presentation, by at least three independent reviewers. It is the policy of the journal that no submission, or substantially overlapping submission, be published or be under review at another journal or conference at any time during the review process. Final decisions on all papers are made by the Editor in Chief.

Journal

Multimedia Systems

Multimedia Systems is a peer-reviewed international journal publishing original research in the field of multimedia and multimedia systems.

Feifei Zhang

Feifei Zhang is currently a professor at the School of Computer Science and Engineering, Tianjin University of Technology. Her research interests include multimedia content analysis, understanding, and applications, especially crossmodal image retrieval, visual question answering, and image captioning. She has authored or co-authored over 20 academic papers in international conferences and journals, including IEEE TIP, IEEE TMM, IEEE TCSVT, ACM TOMM, IEEE CVPR, and ACM MM.
An-An Liu

Dr. An-An Liu is currently a professor in the School of Electronic Information Engineering, Tianjin University, China, and the director of Institute of Image Information & Television, Ministry of Education. He used to be a visiting professor in the School of Computing, National University of Singapore, working with Prof. Mohan Kankanhalli, and the visiting scholar in the Robotics Institute, Carnegie Mellon University, working with Prof. Takeo Kanade. He respectively received his B.E. and Ph.D. degrees from Tianjin University, China, in 2005 and 2010. His research interests include cross-media computing and machine learning.
Xiaoshan Yang

Xiaoshan Yang received Ph.D. degree in pattern recognition and intelligent systems from Institute of Automation, Chinese Academy of Sciences in 2016. He is currently an Associate Professor with the Institute of Automation, Chinese Academy of Sciences. His research focuses on data-driven and knowledge-guided multimedia content understanding. He has authored or co-authored more than 50 journal/conference papers, most of them are IEEE/ACM transactions or CCF-A conferences, e.g., IEEE TMM, IEEE TIP, IEEE TCYB, ACM TOMM, IEEE CVPR, ACM MM and AAAI.
Min Xu

Dr. Min Xu is an Associate Professor at the School of Electrical and Data Engineering (SEDE), Faculty of Engineering and Information Technology (FEIT), University of Technology Sydney (UTS). She is currently the Leader of Visual and Aural Intelligence Laboratory within the Global Big Data Technologies Center (GBDTC) at UTS. Dr. Xu is a researcher in the fields of multimedia, computer vision and machine learning. She has published 170+ research papers in prestigious international journals and conferences, including IEEE T-PAMI, IEEE T-NNLS, IEEE T-MM, IEEE T-MC, PR, ICLR, CVPR, ICCV, ACM MM, AAAI and so on.

Showing 1-16 of 16 articles

A novel exponent–sine–cosine chaos map-based multiple-image encryption technique
Atul Kumar
Mohit Dua
Special Issue Paper 04 May 2024 Article: 141
PointCMC: cross-modal multi-scale correspondences learning for point cloud understanding
Honggu Zhou
Xiaogang Peng
Zizhao Wu
Special Issue Paper 30 April 2024 Article: 138
Personalized time-sync comment generation based on a multimodal transformer
Hei-Chia Wang
Martinus Maslim
Wei-Ting Hong
Special Issue Paper 30 March 2024 Article: 105
GVA: guided visual attention approach for automatic image caption generation
Md. Bipul Hossen
Zhongfu Ye
Md. Imran Hossain
Special Issue Paper 29 January 2024 Article: 50
HCNNet: hybrid convolution neural network for automatic identification of ischaemia in diabetic foot ulcer wounds
Sujit Kumar Das
Suyel Namasudra
Arun Kumar Sangaiah
Special Issue Paper 22 January 2024 Article: 36
Yolov5s-MSD: a multi-scale ship detector for visible video image
Yan-Tong Chen
Yan-Yan Zhang
Yang Liu
Special Issue Paper 12 January 2024 Article: 3
A comprehensive survey on deep-learning-based visual captioning
Bowen Xin
Ning Xu
An-An Liu
Special Issue Paper 21 September 2023 Pages: 3781 - 3804
CTNet: hybrid architecture based on CNN and transformer for image inpainting detection
Fengjun Xiao
Zhuxi Zhang
Ye Yao
Special Issue Paper 19 September 2023 Pages: 3819 - 3832
Images denoising for COVID-19 chest X-ray based on multi-scale parallel convolutional neural network
Noor Ahmed
Rozina
Abdul Raziq
Special Issue Paper 11 September 2023 Pages: 3877 - 3890
Identification of haploid and diploid maize seeds using hybrid transformer model
Emrah Dönmez
Serhat Kılıçarslan
Abdullah Elen
Special Issue Paper 05 September 2023 Pages: 3833 - 3845
LET-Net: locally enhanced transformer network for medical image segmentation
Na Ta
Haipeng Chen
Nuo Jin
Special Issue Paper Open access 05 September 2023 Pages: 3847 - 3861
Variable bit allocation method based on meta-heuristic algorithms for facial image compression
Reza Khodadadi
Gholamreza Ardeshir
Hadi Grailu
Special Issue Paper 05 September 2023 Pages: 3903 - 3930
Inceptr: micro-expression recognition integrating inception-CBAM and vision transformer
Haoliang Zhou
Shucheng Huang
Yuqiao Xu
Special Issue Paper 31 August 2023 Pages: 3863 - 3876
Asymmetric bi-encoder for image–text retrieval
Wei Xiong
Haoliang Liu
Yu Zhang
Special Issue Paper 26 August 2023 Pages: 3805 - 3818
View-target relation-guided unsupervised 2D image-based 3D model retrieval via transformer
Jiacheng Chang
Lanyong Zhang
Zhuang Shao
Special Issue Paper Open access 24 August 2023 Pages: 3891 - 3901
Learning intra-inter-modality complementary for brain tumor segmentation
Jiangpeng Zheng
Fan Shi
Congcong Wang
Special Issue Paper 16 July 2023 Pages: 3771 - 3780

S.I. - Multi-modal Transformers

Participating journal

Multimedia Systems

Editors

Feifei Zhang

An-An Liu

Xiaoshan Yang

Min Xu

Articles

A novel exponent–sine–cosine chaos map-based multiple-image encryption technique

PointCMC: cross-modal multi-scale correspondences learning for point cloud understanding

Personalized time-sync comment generation based on a multimodal transformer

GVA: guided visual attention approach for automatic image caption generation

HCNNet: hybrid convolution neural network for automatic identification of ischaemia in diabetic foot ulcer wounds

Yolov5s-MSD: a multi-scale ship detector for visible video image

A comprehensive survey on deep-learning-based visual captioning

CTNet: hybrid architecture based on CNN and transformer for image inpainting detection

Images denoising for COVID-19 chest X-ray based on multi-scale parallel convolutional neural network

Identification of haploid and diploid maize seeds using hybrid transformer model

LET-Net: locally enhanced transformer network for medical image segmentation

Variable bit allocation method based on meta-heuristic algorithms for facial image compression

Inceptr: micro-expression recognition integrating inception-CBAM and vision transformer

Asymmetric bi-encoder for image–text retrieval

View-target relation-guided unsupervised 2D image-based 3D model retrieval via transformer

Learning intra-inter-modality complementary for brain tumor segmentation