Program at a Glance
1 December, 2021
| Beijing UTC +8 | AEST (BNE) UTC +10 | AEDT (SYD) UTC +11 | Session | Details (Paper List) | |
| 9:00 - 9:10 a.m. | 11:00 a.m. | 12:00 p.m. | Opening | - | |
| 9:10 - 10:10 a.m. | 11:10 - 12:10 p.m. | 12:10 - 13:10 p.m. | Keynote 1: Mohan Kankanhalli | Privacy-aware Multimedia Analytics | |
| 5 mins break | |||||
| 10:15 - 11:45 a.m. | 12:15 - 13:45 p.m. | 13:15 - 14:45 p.m. | Session 1: Video Understanding in Multimedia (Chair: Hailin Shi) | Motion = Video - Content: Towards Unsupervised Learning of Motion Representation from Videos | |
| Blindly Predict Image and Video Quality in the Wild | |||||
| Hierarchical Deep Residual Reasoning for Temporal Moment Localization | |||||
| Video Saliency Prediction via Deep Eye Movement Learning | |||||
| Conditional Extreme Value Theory for Open Set Video Domain Adaptation | |||||
| Intra- and Inter-frame Iterative Temporal Convolutional Networks for Video Stabilization | |||||
| 11:45 - 12:30 p.m. | 13:45 - 14:30 p.m. | 14:45 - 15:30 p.m. | Session 2: Best Paper Candidates | Language Based Image Quality Assessment | |
| Towards Discriminative Visual Search via Semantically Cycle-consistent Hashing Networks | |||||
| Latent Pattern Sensing: Deepfake Video Detection via Predictive Representation Learning | |||||
| 12:30 - 14:00 p.m. | 14:30 - 16:00 p.m. | 15:30 - 17:00 p.m. | Session 3: Deep Learning for Multimedia (Chair: Lu Sheng) | A Local-Global Commutative Preserving Functional Map for Shape Correspondence | |
| Differentially Private Learning with Grouped Gradient Clipping | |||||
| Structural Knowledge Organization and Transfer for Class-Incremental Learning | |||||
| Improving Hyperspectral Super-Resolution via Heterogeneous Knowledge Distillation | |||||
| Patch-Based Deep Autoencoder for Point Cloud Geometry Compression | |||||
| Score Transformer: Generating Musical Score from Note-level Representation | |||||
| 14:00 - 15:30 p.m. | 16:00 - 17:30 p.m. | 17:00 - 18:30 p.m. | Session 4: Multimodality Learning in Multimedia (Chair: Hongyuan Zhu) | BRUSH: Label Reconstructing and Similarity Preserving Hashing for Cross-modal Retrieval | |
| Local Self-Attention on Fine-grained Cross-media Retrieval | |||||
| Self-Adaptive Hashing for Fine-Grained Image Retrieval | |||||
| Hierarchical Composition Learning for Composed Query Image Retrieval | |||||
| Few-shot Egocentric Multimodal Activity Recognition | |||||
| Inter-modality Discordance for Multimodal Fake News Detection | |||||
| 10 minutes break for transition from Zoom to Gather.Town | |||||
| 15:40 - 16:30 p.m. | 17:40 - 18:30 p.m. | 18:40 - 19:30 p.m. | Lightning Talk Session 1 | Brave New Idea | Discovering Social Connections using Event Images |
| SangeetXML: An XML Format for Score Retrieval for Indic Music | |||||
| Holodeck: Immersive 3D Displays Using Swarms of Flying Light Specks | |||||
| Demo Papers | RoadAtlas: Intelligent Platform for Automated Road Defect Detection and Asset Management | ||||
| Private-Share: A Secure and Privacy-Preserving De-Centralized Framework for Large Scale Data Sharing | |||||
| An Efficient Bus Crowdedness Classification System | |||||
| Short Papers - Part 1 | *Paper list is shown on Short Paper List below | ||||
| 16:30 - 17:30 p.m. | 18:30 - 19:30 p.m. | 19:30 - 20:30 p.m. | Workshop: Multi-Modal Embedding and Understanding | Focusing Attention across Multiple Images for Multi-Modal Event Detection | |
| Adaptive Cross-stitch Graph Convolutional Networks | |||||
| Generation of Variable-Length Time Series from Text using Dynamic Time Warping-Based Method | |||||
| Hierarchical Graph Representation Learning with Local Capsule Pooling | |||||
| Deep Adaptive-Attention Triple Hashing | |||||
2 December, 2021
| Beijing UTC +8 | AEST (BNE) UTC +10 | AEDT (SYD) UTC +11 | Session | Details | |
| 9:00 - 9:10 a.m. | 11:00 - 11:10 a.m. | 12:00 - 12:10 p.m. | Best Paper Award Announcement | ||
| 9:10 - 10:00 a.m. | 11:10 - 12:00 p.m. | 12:10 - 13:00 p.m. | Keynote 2: Yong Rui | Artificial Intelligence: Paving a Path to Digital Economy Transformation | |
| 10:00 - 10:30 a.m. | 12:00 - 12:30 p.m. | 13:00 - 13:30 p.m. | Grand Challenges | Introduction to two Grand Challenges | |
| Paper: Hybrid Improvements in Multimodal Analysis for Deep Video Understanding | |||||
| 10:30 - 12:00 p.m. | 12:30 - 14:00 p.m. | 13:30 - 15:00 p.m. | Session 5: Vision and Language in Multimedia (Chair: Jing Zhang) | Semantic Enhanced Cross-modal GAN for Zero-shot Learning | |
| TS2TD: A Tree-Structured Decoder for Image Paragraph Captioning | |||||
| Entity Relation Fusion for Real-Time One-Stage Referring Expression Comprehension | |||||
| Visual Storytelling with Hierarchical BERT Semantic Guidance | |||||
| Efficient Proposal Generation with U-shaped Network for Temporal Sentence Grounding | |||||
| Zero-shot Recognition with Image Attributes Generation using Hierarchical Coupled Dictionary Learning | |||||
| 5 mins break | |||||
| 12:05 - 13:35 p.m. | 14:05 - 15:35 p.m. | 15:05 - 16:35 p.m. | Session 6: Computer Vision in Multimedia (Chair: Tiesong Zhao) | Source-Style Transferred Mean Teacher for Source-data Free Object Detection | |
| Improving Camouflaged Object Detection with the Uncertainty of Pseudo-edge Labels | |||||
| MIRecipe: A Recipe Dataset for Stage-Aware Recognition of Changes in Appearance of Ingredients | |||||
| Learning to Decompose and Restore Low-light Images with Wavelet Transform | |||||
| Hard-Boundary Attention Network for Nuclei Instance Segmentation | |||||
| A Model-Guided Unfolding Network for Single Image Reflection Removal | |||||
| 5 minutes break for transition from Zoom to Gather.Town | |||||
| 13:40 - 14:55 p.m. | 15:40 - 16:55 p.m. | 16:40 - 17:55 p.m. | Workshop: Multi-Model Computing of Marine Big Data | Deep Reinforcement Learning and Docking Simulations for autonomous molecule generation in de novo Drug Design | |
| Joint label refinement and contrastive learning with hybrid memory for Unsupervised Marine Object Re-Identification | |||||
| Prediction of transcription factor binding sites using deep learning combined with DNA sequences and shape feature data | |||||
| A reinforcement learning-based reward mechanism for molecule generation that introduces activity information | |||||
| A Fine-Grained River Ice Semantic Segmentation based on Attentive Features and Enhancing Feature Fusion | |||||
| Multi-Scale Graph Convolutional Network and Dynamic Iterative Class Loss for Ship Segmentation in Remote Sensing Images | |||||
| 5 mins break | |||||
| 15:00 - 16:00 p.m. | 17:00 - 18:00 p.m. | 18:00 - 19:00 p.m. | Special Session | Women in Multimedia Roundtable | |
| 5 mins break | |||||
| 16:05 - 17:05 p.m. | 18:05 - 19:05 p.m. | 19:05 - 20:05 p.m. | Lightning Talk Session 2 - Short Papers - Part 2 | *Paper list is shown on Short Paper List below | |
| 17:05 - 17:40 p.m. | 19:05 - 19:40 p.m. | 20:05 - 20:40 p.m. | Social Connections on Gather.Town | Posters and Q&A for all tracks | |
3 December, 2021
| Beijing UTC +8 | AEST (BNE) UTC +10 | AEDT (SYD) UTC +11 | Session | Details | |
| 9:00 - 9:10 a.m. | 11:00 - 11:10 p.m. | 12:00 - 12:10 p.m. | Introduction to ACM Multimedia Asia 2022 | - | |
| 9:10 - 10:00 a.m. | 11:10 - 12:00 p.m. | 12:10 - 13:00 p.m. | Keynote 3: Divesh Srivastava | How to do Research for Fun and Profit | |
| 10:00 - 11:00 p.m. | 12:00 - 13:00 p.m. | 13:00 - 14:00 p.m. | HDR Lightning Talks | TBA | |
| 11:00 - 12:00 p.m. | 13:00 - 14:00 p.m. | 14:00 - 15:00 p.m. | Keynote 4: Klara Nahrstedt | Navigation Models for Interactive 360-Degree Video Streaming Systems | |
| 5 mins break | |||||
| 12:05 - 14:05 p.m. | 14:05 - 16:05 p.m. | 15:05 - 17:05 p.m. | Tutorial 1: Recent Advances in Video Summarization: Conventional and Deep Learning based Approaches | Zhiyong Wang (USyd), Zhou Zhao (ZJU), Xi Li (ZJU), Kun Kuang (ZJU), and Fei Wu (ZJU) | |
| 10 mins break | |||||
| 14:15 - 16:00 p.m. | 16:15 - 18:00 p.m. | 17:15 - 19:00 p.m. | Tutorial 2: Modeling User Behavior for Vertical Search: Images, Apps and Products | Xiaohui Xie (THU), Jiaxin Mao (THU), Yuqun Liu (THU), and Maarten de Rijke (UvA) | |
| 16:00 - 16:45 p.m. | 18:00 - 18:45 p.m. | 19:00 - 19:45 p.m. | Applied Research Track | Goldeye: Enhanced Spatial Awareness for the Visually Impaired using Mixed Reality and Vibrotactile Feedback | |
| Convolutional Neural Network-Based Pure Paint Pigment Identification Using Hyperspectral Images | |||||
| CFCR: A Convolution and Fusion Model for Cross-platform Recommendation | |||||
| 5 minutes break for transition from Zoom to Gather.Town | |||||
| 16:50 - 17:50 p.m. | 18:50 - 19:50 p.m. | 19:50 - 20:50 p.m. | Workshop: Visual Tasks and Challenges under Low-quality Multimedia Data | Local-enhanced Multi-resolution Representation Learning for Vehicle Re-identification | |
| Dedark+Detection: A Hybrid Scheme for Object Detection under Low-light Surveillance | |||||
| Making Video Recognition Models Robust to Common Corruptions With Supervised Contrastive Learning | |||||
| Visible-Infrared Cross-Modal Person Re-identification based on Positive Feedback | |||||
| 17:50 p.m. | 19:50 p.m. | 20:50 p.m. | Closing | - | |
*Short Paper List - Part 1
- CMRD-Net: An Improved Method for Underwater Image Enhancement
- Deep Multiple Length Hashing via Multi-task Learning
- Color Image Denoising via Tensor Robust PCA with Nonconvex and Nonlocal Regularization
- Conditioned Image Retrieval for Fashion using Contrastive Learning and CLIP-based Features
- PBNet: Position-specific Text-to-image Generation by Boundary
- An Embarrassingly Simple Approach to Discrete Supervised Hashing
*Short Paper List - Part 2
- Towards Transferable 3D Adversarial Attack
- Delay-sensitive and Priority-aware Transmission Control for Real-time Multimedia Communications
- Impression of a Job Interview training agent that gives rationalized feedback ~Should Virtual Agent Give Advice with Rationale
- A Coarse-to-fine Approach for Fast Super-Resolution with Flexible Magnification
- Automatically Generate Rigged Character from Single Image
- Flat and Shallow: Understanding Fake Image Detection Models by Architecture Profiling
- Multi-branch Semantic Learning Network for Text-to-Image Synthesis
- Attention-based Dual-Branches Localization Network for Weakly Supervised Object Localization
- Pose-aware Outfit Transfer between Unpaired in-the-wild Fashion Images
- Explore before Moving: A Feasible Path Estimation and Memory Recalling Framework for Embodied Navigation
- Utilizing Resource-Rich Language Datasets for End-to-End Scene Text Recognition in Resource-Poor Languages
- PLM-IPE: A Pixel-Landmark Mutual Enhanced Framework for Implicit Preference Estimation
- Adaptive Viewport Margins Using Head Motion for Improving User Experience in Immersive Video
- Chinese White Dolphin Detection in the Wild
- BAND: A Benchmark Dataset for Bangla News Audio Classification
- A comparison study: the impact of age and gender distribution on age estimation
- Spherical Image Compression Using Spherical Wavelet Transform
- FQM-GC: Full-reference Quality Metric for Colored Point Cloud Based on Graph Signal Features and Color Features
- Cross-layer Navigation Convolutional Neural Network for Fine-grained Visual Classification
- NoisyActions2M: A Multimedia Dataset for Video Understanding from Noisy Labels