A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | CMTID | Oral/Poster | 1st Presentation | 2nd Presentation | Paper Title | Author Names | Abstract | Primary Subject Area | Secondary Subject Areas | Link to Code | Keywords | ||||
2 | 13 | Oral | 2-D #01 | 3-B #01 | Condensed Movies: Story Based Retrieval with Contextual Embeddings | Max Bain (University of Oxford)*; Arsha Nagrani (Oxford University ); Andrew Brown (University of Oxford); Andrew Zisserman (University of Oxford) | Our objective in this work is the long range understandingof the narrative structure of movies. Instead of considering the entire movie, we propose to learn from the `key scenes' of the movie, providing a condensed look at the full storyline. To this end, we make the following three contributions: (i) We create the Condensed Movie Dataset (CMD) consisting of the key scenes from over 3K movies: each key scene is accompanied by a high level semantic description of the scene, character face tracks, and metadata about the movie. Our dataset is scalable, obtained automatically from YouTube, and is freely available for anybody to download and use. It is also an order of magnitude larger than existing movie datasets in the number of movies; (ii) We provide a deep network baseline for text-to-video retrieval on our dataset, combining character, speech and visual cues into a single video embedding; and finally (iii) We demonstrate how the addition of context from other video clips improves retrieval performance. | Video Analysis and Event Recognition | Datasets and Performance Analysis | https://github.com/m-bain/CondensedMovies | story understanding; computer vision; video retrieval; multimodal fusion; movie dataset; | ||||
3 | 38 | Oral | 2-C #01 | 3-A #01 | Class-incremental Learning with Rectified Feature-Graph Preservation | Cheng-Hsun Lei (National Chiao Tung University); Yi-Hsin Chen (National Chiao Tung University); Wen-Hsiao Peng (National Chiao Tung University); Wei-Chen Chiu (National Chiao Tung University)* | In this paper, we address the problem of distillation-based class-incremental learning with a single head. A central theme of this task is to learn new classes that arrive in sequential phases over time while keeping the model's capability of recognizing seen classes with only limited memory for preserving seen data samples. Many regularization strategies have been proposed to mitigate the phenomenon of catastrophic forgetting. To understand better the essence of these regularizations, we introduce a feature-graph preservation perspective. Insights into their merits and faults motivate our weighted-Euclidean regularization for old knowledge preservation. We further propose rectified cosine normalization and show how it can work with binary cross-entropy to increase class separation for effective learning of new classes. Experimental results on both CIFAR-100 and ImageNet datasets demonstrate that our method outperforms the state-of-the-art approaches in reducing classification error, easing catastrophic forgetting, and encouraging evenly balanced accuracy over different classes. Our project page is at : https://github.com/yhchen12101/FGP-ICL. | Deep Learning for Computer Vision | Applications of Computer Vision, Vision for X | Incremental Learning; Knowledge Distillation; Data Imbalance | |||||
4 | 48 | Oral | 1-A #01 | 3-B #02 | Pre-training without Natural Images | Hirokatsu Kataoka (National Institute of Advanced Industrial Science and Technology (AIST))*; Kazushige Okayasu (National Institute of Advanced Industrial Science and Technology (AIST)); Asato Matsumoto (National Institute of Advanced Industrial Science and Technology (AIST)); Eisuke Yamagata (Tokyo Institute of Technology); Ryosuke Yamada (Tokyo Denki University); Nakamasa Inoue (Tokyo Institute of Technology); Akio Nakamura (Tokyo Denki University (TDU)); Yutaka Satoh (National Institute of Advanced Industrial Science and Technology (AIST)) | Is it possible to use convolutional neural networks pre-trained without any natural images to assist natural image understanding? The paper proposes a novel concept, Formula-driven Supervised Learning. We automatically generate image patterns and their category labels by assigning fractals, which are based on a natural law existing in the background knowledge of the real world. Theoretically, the use of automatically generated images instead of natural images in the pre-training phase allows us to generate an infinite scale dataset of labeled images. Although the models pre-trained with the proposed Fractal DataBase (FractalDB), a database without natural images, does not necessarily outperform models pre-trained with human annotated datasets at all settings, we are able to partially surpass the accuracy of ImageNet/Places pre-trained models. The image representation with the proposed FractalDB captures a unique feature in the visualization of convolutional layers and attentions. | Datasets and Performance Analysis | Recognition: Feature Detection, Indexing, Matching, and Shape Representation | https://hirokatsukataoka16.github.io/Pretraining-without-Natural-Images/ | Pre-training; Formula-driven Supervised Learning; Self-supervised Learning; Fractal Geometry | ||||
5 | 51 | Oral | 1-A #02 | 1-C #01 | In-sample Contrastive Learning and Consistent Attention for Weakly Supervised Object Localization | Minsong Ki (Yonsei University)*; Youngjung Uh (Yonsei University); Wonyoung Lee (Yonsei University); Hyeran Byun (Yonsei University) | Weakly supervised object localization (WSOL) aims to localize the target object using only the image-level supervision. Recent methods encourage the model to activate feature maps over the entire object by dropping the most discriminative parts. However, they are likely to induce excessive extension to the backgrounds which leads to over-estimated localization. In this paper, we consider the background as an important cue that guides the feature activation to cover the sophisticated object region and propose contrastive attention loss. The loss promotes similarity between foreground and its dropped version, and, dissimilarity between the dropped version and background. Furthermore, we propose foreground consistency loss that penalizes earlier layers producing noisy attention regarding the later layer as a reference to provide them with a sense of backgroundness. It guides the early layers to activate on objects rather than locally distinctive backgrounds so that their attentions to be similar to the later layer. For better optimizing the above losses, we use the non-local attention blocks to replace channel-pooled attention leading to enhanced attention maps considering the spatial similarity. Last but not least, we propose to drop background regions in addition to the most discriminative region. Our method achieves state-of-the-art performance on CUB-200-2011 and ImageNet benchmark datasets regarding top-1 localization accuracy and MaxBoxAccV2, and we provide detailed analysis on our individual components. The code will be publicly available online for reproducibility. | Deep Learning for Computer Vision | weakly supervised localization; contrastive learning; foreground consistency; contrastive attention; dropped foreground; | ||||||
6 | 54 | Oral | 1-C #02 | 2-A #01 | Fast and Differentiable Message Passing on Pairwise Markov Random Fields | Zhiwei Xu (Australian National University)*; Thalaiyasingam Ajanthan (ANU); RICHARD HARTLEY (Australian National University, Australia) | Despite the availability of many Markov Random Field (MRF) optimization algorithms, their widespread usage is currently limited due to imperfect MRF modelling arising from hand-crafted model parameters and the selection of inferior inference algorithm. In addition to differentiability, the two main aspects that enable learning these model parameters are the forward and backward propagation time of the MRF optimization algorithm and its inference capabilities. In this work, we introduce two fast and differentiable message passing algorithms, namely, Iterative Semi-Global Matching Revised (ISGMR) and Parallel Tree-Reweighted Message Passing (TRWP) which are greatly sped up on a GPU by exploiting massive parallelism. Specifically, ISGMR is an iterative and revised version of the standard SGM for general pairwise MRFs with improved optimization effectiveness, and TRWP is a highly parallel version of Sequential TRW (TRWS) for faster optimization. Our experiments on the standard stereo and denoising benchmarks demonstrated that ISGMR and TRWP achieve much lower energies than SGM and Mean-Field (MF), and TRWP is two orders of magnitude faster than TRWS without losing effectiveness in optimization. We further demonstrated the effectiveness of our algorithms on end-to-end learning for semantic segmentation. Notably, our CUDA implementations are at least 7 and 700 times faster than PyTorch GPU implementations for forward and backward propagation respectively, enabling efficient end-to-end learning with message passing. | Optimization Methods | Applications of Computer Vision, Vision for X | https://github.com/zwxu064/MPLayers.git | Markov random field; message passing; iterative SGM revised (ISGMR); parallel TRW (TRWP); stereo; semantic segmentation; | ||||
7 | 107 | Oral | 2-B #01 | 2-D #02 | Introspective Learning by Distilling Knowledge from Online Self-explanation | Jindong Gu (University of Munich)*; Zhiliang Wu (Siemens AG and Ludwig Maximilian University of Munich); Volker Tresp (Siemens AG and Ludwig Maximilian University of Munich ) | In recent years, many methods have been proposed to explain individual classification predictions of deep neural networks. However, how to leverage the created explanations to improve the learning process has been less explored. The explanations extracted from a model can be used to guide the learning process of the model itself. Another type of information used to guide the training of a model is the knowledge provided by a powerful teacher model. The goal of this work is to leverage the self-explanation to improve the learning process by borrowing ideas from knowledge distillation. We start by investigating the effective components of the knowledge transferred from the teacher network to the student network. Our investigation reveals that both the responses in non-ground-truth classes and the class-similarity information in teacher's outputs contribute to the success of the knowledge distillation. Motivated by the conclusion, we propose an implementation of introspective learning by distilling knowledge from online self-explanations. The models trained with the introspective learning procedure outperform the ones trained with the standard learning procedure, as well as the ones trained with different regularization methods. When compared to the models learned from peer networks or teacher networks, our models also show competitive performance and requires neither peers nor teachers. | Deep Learning for Computer Vision | Knowledge Distillation, Introspective Learning, Classification Explanations | ||||||
8 | 116 | Oral | 1-C #03 | 3-A #02 | Accurate and Efficient Single Image Super-Resolution with Matrix Channel Attention Network | Hailong Ma (Xiaomi); Xiangxiang Chu (Xiaomi); Bo Zhang (Xiaomi)* | In recent years, deep learning methods have achieved impressive results with higher peak signal-to-noise ratio in single image super-resolution (SISR) tasks. However, these methods are usually computationally expensive, which constrains their application in mobile scenarios. In addition, most of the existing methods rarely take full advantage of the intermediate features which are helpful for restoration. To address these issues, we propose a moderate-size SISR network named matrix channel attention network (MCAN) by constructing a matrix ensemble of multi-connected channel attention blocks (MCAB). Several models of different sizes are released to meet various practical requirements. Extensive benchmark experiments show that the proposed models achieve better performance with much fewer multiply-adds and parameters. | Deep Learning for Computer Vision | Low-level Vision, Image Processing | https://github.com/macn3388/MCAN | image super-resolution | ||||
9 | 126 | Oral | 2-B #02 | 3-D #01 | Progressive Batching for Efficient Non-linear Least Squares | Huu Le (Chalmers University of Technology)*; Christopher Zach (Chalmers University); Edward Rosten (Snap Inc.); Oliver J. Woodford (Snap Inc) | Non-linear least squares solvers are used across a broad range of offline and real-time model fitting problems. Most improvements of the basic Gauss-Newton algorithm tackle convergence guarantees or leverage the sparsity of the underlying problem structure for computational speedup. With the success of deep learning methods leveraging large datasets, stochastic optimization methods received recently a lot of attention. Our work borrows ideas from both stochastic machine learning and statistics, and we present an approach for non-linear least-squares that guarantees convergence while at the same time significantly reduces the required amount of computation. Empirical results show that our proposed method achieves competitive convergence rates compared to traditional second-order approaches on common computer vision problems such as essential/fundamental matrix estimation with very large numbers of correspondences. | Optimization Methods | 3D Computer Vision | https://github.com/intellhave/ProBLM | optimization; stochastic; homography; essential matrix; bundle adjustment; multiple view geometry; robust; | ||||
10 | 127 | Oral | 1-C #04 | 3-A #03 | Domain Adaptation Gaze Estimation by Embedding with Prediction Consistency | Zidong Guo (Xi'an Jiaotong university)*; Zejian Yuan (Xi‘an Jiaotong University); Chong Zhang (Tencent Robotics X); Wanchao Chi (Tencent Robotics X); Yonggen Ling (Tencent); shenghao zhang (Tencent) | Gaze is the essential manifestation of human attention. In recent years, a series of work has achieved high accuracy in gaze estimation. However, the inter-personal difference limits the reduction of the subject-independent gaze estimation error. This paper proposes an unsupervised method for domain adaptation gaze estimation to eliminate the impact of inter-personal diversity. In domain adaption, we design an embedding representation with prediction consistency to ensure that linear relationships between gaze directions in different domains remain consistent on gaze space and embedding space. Specifically, we employ source gaze to form a locally linear representation in the gaze space for each target domain prediction. Then the same linear combinations are applied in the embedding space to generate hypothesis embedding for the target domain sample, remaining prediction consistency. The deviation between the target and source domain is reduced by approximating the predicted and hypothesis embedding for the target domain sample. Guided by the proposed strategy, we design Domain Adaptation Gaze Estimation Network(DAGEN), which learns embedding with prediction consistency and achieves state-of-the-art results on both the MPIIGaze and the EYEDIAP datasets. | Face, Pose, Action, and Gesture | Deep Learning for Computer Vision; Recognition: Feature Detection, Indexing, Matching, and Shape Representation | gaze estimation; domain adaptation; | |||||
11 | 153 | Oral | 1-C #05 | 3-A #04 | DoFNet: Depth of Field Difference Learning for Detecting Image Forgery | Yonghyun Jeong (Samsung SDS)*; Jongwon Choi (Chung-Ang University); Doyeon Kim (SamsungSDS); Sehyeon Park (Samsung SDS); Minki Hong (Samsung SDS); Changhyun Park (Samsung SDS); Seungjai Min (Samsung SDS); Youngjune Gwon (Samsung SDS) | Recently, online transactions have had exponential growth and expanded to various cases, such as opening bank accounts and filing for insurance claims. Despite the effort of many companies requiring their own mobile applications to capture images for online transactions, it is difficult to restrict users from taking a picture of other's images displayed on a screen. To detect such cases, we propose a novel approach using paired images with different depth of field (DoF) for distinguishing the real images and the display images. Also, we introduce a new dataset containing 2,752 pairs of images capturing real and display objects on various types of displays, which is the largest real dataset employing DoF with multi-focus. Furthermore, we develop a new framework to concentrate on the difference of DoF in paired images, while avoiding learning individual display artifacts. Since DoF lies on the optical fundamentals, the framework can be widely utilized with any camera, and its performance shows at least 23% improvement compared to the conventional classification models. | Applications of Computer Vision, Vision for X | Datasets and Performance Analysis; Deep Learning for Computer Vision | image forgery; depth of field | |||||
12 | 176 | Oral | 3-B #03 | 3-D #02 | Watch, read and lookup: learning to spot signs from multiple supervisors | Liliane Momeni (University of Oxford); Gul Varol (University of Oxford)*; Samuel Albanie (University of Oxford); Triantafyllos Afouras (University of Oxford); Andrew Zisserman (University of Oxford) | The focus of this work is sign spotting--given a video of an isolated sign, our task is to identify whether and where it has been signed in a continuous, co-articulated sign language video. To achieve this sign spotting task, we train a model using multiple types of available supervision by: (1) watching existing sparsely labelled footage; (2) reading associated subtitles (readily available translations of the signed content) which provide additional weak-supervision; (3) looking up words (for which no co-articulated labelled examples are available) in visual sign language dictionaries to enable novel sign spotting. These three tasks are integrated into a unified learning framework using the principles of Noise Contrastive Estimation and Multiple Instance Learning. We validate the effectiveness of our approach on low-shot sign spotting benchmarks. In addition, we contribute a machine-readable British Sign Language (BSL) dictionary dataset of isolated signs, BSLDict, to facilitate study of this task. The dataset, models and code are available at our project page | Applications of Computer Vision, Vision for X | Face, Pose, Action, and Gesture | https://github.com/gulvarol/bsldict | sign language; sign spotting; gesture recognition; temporal localization; contrastive learning; multiple instance learning | ||||
13 | 187 | Oral | 2-C #02 | 3-A #05 | Patch SVDD: Patch-level SVDD for Anomaly Detection and Segmentation | Jihun Yi (Seoul National University); Sungroh Yoon (Seoul National University)* | In this paper, we address the problem of image anomaly detection and segmentation. Anomaly detection involves making a binary decision as to whether an input image contains an anomaly, and anomaly segmentation aims to locate the anomaly on the pixel level. Support vector data description (SVDD) is a long-standing algorithm used for an anomaly detection, and we extend its deep learning variant to the patch-based method using self-supervised learning. This extension enables anomaly segmentation and improves detection performance. As a result, anomaly detection and segmentation performances measured in AUROC on MVTec AD dataset increased by 9.8% and 7.0%, respectively, compared to the previous state-of-the-art methods. Our results indicate the efficacy of the proposed method and its potential for industrial application. Detailed analysis of the proposed method offers insights regarding its behavior, and the code is available online. | Deep Learning for Computer Vision | Applications of Computer Vision, Vision for X | https://github.com/nuclearboy95/Anomaly-Detection-PatchSVDD-PyTorch | anomaly detection;anomaly segmentation;inspection;anomaly | ||||
14 | 196 | Oral | 1-B #01 | 2-D #03 | Meta-Learning with Context-Agnostic Initialisations | Toby Perrett (University of Bristol)*; Alessandro Masullo (University of Bristol); Tilo Burghardt (University of Bristol); Majid Mirmehdi (University of Bristol); Dima Damen (University of Bristol) | Meta-learning approaches have addressed few-shot problems by finding initialisations suited for fine-tuning to target tasks. Often there are additional properties within training data (which we refer to as context), not relevant to the target task, which act as a distractor to meta-learning, particularly when the target task contains examples from a novel context not seen during training. We address this oversight by incorporating a context-adversarial component into the meta-learning process. This produces an initialisation which is both context-agnostic and task-generalised. We evaluate our approach on three commonly used meta-learning algorithms and four case studies. We demonstrate our context-agnostic meta-learning improves results in each case. First, we report few-shot character classification on the Omniglot dataset, using alphabets as context. An average improvement of 4.3% is observed across methods and tasks when classifying characters from an unseen alphabet. Second, we perform few-shot classification on Mini-ImageNet, obtaining context from the label hierarchy, with an average improvement of 2.8%. Third, we perform few-shot classification on CUB, with annotation metadata as context, and demonstrate an average improvement of 1.9%. Fourth, we evaluate on a dataset for personalised energy expenditure predictions from video, using participant knowledge as context. We demonstrate that context-agnostic meta-learning decreases the average mean square error by 30%. | Deep Learning for Computer Vision | Few-shot; Meta-learning; Context | ||||||
15 | 261 | Oral | 1-B #02 | 3-D #03 | D2D: Keypoint Extraction with Describe to Detect Approach | Yurun Tian (Imperial College London)*; Vassileios Balntas (Scape Technologies); Tony Ng (Imperial College London); Axel Barroso-Laguna (Imperial College London); Yiannis Demiris (Imperial College London); Krystian Mikolajczyk (Imperial College London) | In this paper, we present a novel approach that exploits the information within the descriptor space to propose keypoint locations. Detect then describe, or detect and describe jointly are two typical strategies for extracting local descriptors. In contrast, we propose an approach that inverts this process by first describing and then detecting the keypoint locations. Describe-to-Detect (D2D) leverages successful descriptor models without the need for any additional training. Our method selects keypoints as salient locations with high information content which is defined by the descriptors rather than some independent operators. We perform experiments on multiple benchmarks including image matching, camera localisation, and 3D reconstruction. The results indicate that our method improves the matching performance of various descriptors and that it generalises across methods and tasks. | Recognition: Feature Detection, Indexing, Matching, and Shape Representation | image matching; local feature; detector; descriptor | ||||||
16 | 264 | Oral | 1-A #03 | 2-C #03 | Backbone Based Feature Enhancement for Object Detection | Haoqin Ji (Shenzhen University); Weizeng Lu (Shenzhen University); Linlin Shen (Shenzhen University)* | FPN (Feature Pyramid Networks) and many of its variants have been widely used in state of the art object detectors and made remarkable progress in detection performance. However, almost all the architectures of feature pyramid are manually designed, which requires ad hoc design and prior knowledge. Meanwhile, existing methods focus on exploring more appropriate connections to generate features with strong semantics features from inherent pyramidal hierarchy of deep ConvNets (Convolutional Networks). In this paper, we propose a simple but effective approach, named BBFE (Backbone Based Feature Enhancement), to directly enhance the semantics of shallow features from backbone ConvNets. The proposed BBFE consists of two components: reusing backbone weight and personalized feature enhancement. We also proposed a fast version of BBFE, named Fast-BBFE, to achieve better trade-off between efficiency and accuracy. Without bells and whistles, our BBFE improves different baseline methods (both anchor-based and anchor-free) by a large margin (∼2.0 points higher AP) on COCO, surpassing common feature pyramid networks including FPN and PANet. | Deep Learning for Computer Vision | Recognition: Feature Detection, Indexing, Matching, and Shape Representation | feature enhancement; object detection; | |||||
17 | 275 | Oral | 1-A #04 | 2-C #04 | Part-aware Attention Network for Person Re-Identification | Wangmeng Xiang (The Hong Kong Polytechnic University); Jianqiang Huang (Damo Academy, Alibaba Group); Xian-Sheng Hua (Alibaba Group); Lei Zhang ("Hong Kong Polytechnic University, Hong Kong, China")* | Multi-level feature aggregation and part feature extraction are widely used to boost the performance of person re-identification (Re-ID). Most multi-level feature aggregation methods treat feature maps on different levels equally and use simple local operations for feature fusion, which neglects the long-distance connection among feature maps. On the other hand, the popular horizon pooling part based feature extraction methods may lead to feature misalignment. In this paper, we propose a novel Part-aware Attention Network (PAN) to connect part feature maps and middle-level features. Given a part feature map and a source feature map, PAN uses part features as queries to perform second-order information propagation from the source feature map. The attention is computed based on the compatibility of the source feature map with the part feature map. Specifically, PAN uses high-level part features of different human body parts to aggregate information from mid-level feature maps. As a part-aware feature aggregation method, PAN operates on all spatial positions of feature maps so that it can discover long-distance relations. Extensive experiments show that PAN achieves leading performance on Re-ID benchmarks Market1501, DukeMTMC, and CUHK03. | Deep Learning for Computer Vision | person re-identification; attention | ||||||
18 | 281 | Oral | 2-A #02 | 2-C #05 | Class-Wise Difficulty-Balanced Loss for Solving Class-Imbalance | Saptarshi Sinha (Hitachi CRL)*; Hiroki Ohashi (Hitachi Ltd); Katsuyuki Nakamura (Hitachi Ltd.) | Class-imbalance is one of the major challenges in real world datasets where a few classes (called majority classes) constitute much more data samples than the rest (called minority classes). Learning deep neural networks using such datasets leads to performance which is typically biased towards the majority classes. Most of the prior works try to solve class-imbalance by assigning more weights to the minority classes in various manners (e.g., data re-sampling, cost-sensitive learning). However, we argue that the number of available training data may not be always a good clue to determine the weighting strategy because some of the minority classes might be sufficiently represented even by a small number of training data. Overweighting samples of such classes can lead to drop in the model’s overall performance. We claim that the ‘difficulty’ of a class as perceived by the model is more important to determine the weighting. In this light, we propose a novel loss function named Class-wise Difficulty-Balanced loss, or CDB loss, which dynamically distributes weights to each sample according to the difficulty of the class that the sample belongs to. Note that the assigned weights dynamically change as the ‘difficulty’ for the model may change with the learning progress. Extensive experiments are conducted on both image (artificially induced class-imbalanced MNIST, long-tailed CIFAR and ImageNet-LT) and video (EGTEA) datasets. The results show that CDB loss consistently outperforms the recently proposed loss functions on class-imbalanced datasets irrespective of the data type (i.e., video or image). | Datasets and Performance Analysis | Deep Learning for Computer Vision; Video Analysis and Event Recognition | class-imbalance; long-tail; class-wise difficulty; dynamically updated weights; dynamic difficulty; weighted loss; | |||||
19 | 293 | Oral | 1-B #03 | 2-D #04 | Visually Guided Sound Source Separation using Cascaded Opponent Filter Network | Lingyu Zhu (Tampere University)*; Esa Rahtu (Tampere University) | The objective of this paper is to recover the original component signals from a mixture audio with the aid of visual cues of the sound sources. Such task is usually referred as visually guided sound source separation. The proposed Cascaded Opponent Filter (COF) framework consists of multiple stages, which recursively refine the source separation. A key element in COF is a novel opponent filter module that identifies and relocates residual components between sources. The system is guided by the appearance and motion of the source, and, for this purpose, we study different representations based on video frames, optical flows, dynamic images, and their combinations. Finally, we propose a Sound Source Location Masking (SSLM) technique, which, together with COF, produces a pixel level mask of the source location. The entire system is trained in an end-to-end manner using a large set of unlabelled videos. We compare COF with recent baselines and obtain the state-of-the-art performance in three challenging datasets (MUSIC, A-MUSIC, and A-NATURAL). | Deep Learning for Computer Vision | Applications of Computer Vision, Vision for X | https://ly-zhu.github.io/cof-net | sound source separation and localisation; cascaded opponent filter; dynamic image; appearance and motion; | ||||
20 | 297 | Oral | 1-D #01 | 2-A #03 | To filter prune, or to layer prune, that is the question | Sara Elkerdawy (University of Alberta)*; Mostafa Elhoushi (Huawei Technologies); Abhineet Singh (University of Alberta); Hong Zhang (University of Alberta); Nilanjan Ray (University of Alberta) | Recent advances in pruning of neural networks have made it possible to remove a large number of filters or weights without any perceptible drop in accuracy. The number of parameters and that of FLOPs are usually the reported metrics to measure the quality of the pruned models. However, the gain in speed for these pruned models is often overlooked in the literature due to the complex nature of latency measurements. In this paper, we show the limitation of filter pruning methods in terms of latency reduction and propose LayerPrune framework. LayerPrune presents a set of layer pruning methods based on different criteria that achieve higher latency reduction than filter pruning methods on similar accuracy. The advantage of layer pruning over filter pruning in terms of latency reduction is a result of the fact that the former is not constrained by the original model's depth and thus allows for a larger range of latency reduction. For each filter pruning method we examined, we use the same filter importance criterion to calculate a per-layer importance score in one-shot. We then prune the least important layers and fine-tune the shallower model which obtains comparable or better accuracy than its filter-based pruning counterpart. This one-shot process allows to remove layers from single path networks like VGG before fine-tuning, unlike in iterative filter pruning, a minimum number of filters per layer is required to allow for data flow which constraint the search space. To the best of our knowledge, we are the first to examine the effect of pruning methods on latency metric instead of FLOPs for multiple networks, datasets and hardware targets. LayerPrune also outperforms handcrafted architectures such as Shufflenet, MobileNet, MNASNet and ResNet18 by 7.3%, 4.6%, 2.8% and 0.5% respectively on similar latency budget on ImageNet dataset. | Robot Vision | Deep Learning for Computer Vision | https://github.com/selkerdawy/filter-vs-layer-pruning | Latency Reduction; Layer Pruning; Filter Pruning; CNN Pruning | ||||
21 | 301 | Oral | 2-A #04 | 3-C #01 | Descriptor-Free Multi-View Region Matching for Instance-Wise 3D Reconstruction | Takuma Doi (Osaka University); Fumio Okura (Osaka University)*; Toshiki Nagahara (Osaka University); Yasuyuki Matsushita (Osaka University); Yasushi Yagi (Osaka University) | This paper proposes a multi-view extension of instance segmentation without relying on texture or shape descriptor matching. Multi-view instance segmentation becomes challenging for scenes with repetitive textures and shapes, e.g., plant leaves, due to the difficulty of multi-view matching using texture or shape descriptors. To this end, we propose a multi-view region matching method based on epipolar geometry, which does not rely on any feature descriptors. We further show that the epipolar region matching can be easily integrated into instance segmentation and effective for instance-wise 3D reconstruction. Experiments demonstrate the improved accuracy of multi-view instance matching and the 3D reconstruction compared to the baseline methods. | Applications of Computer Vision, Vision for X | 3D Computer Vision; Biomedical Image Analysis; Segmentation and Grouping | region matching; 3D reconstruction; multi-view correspondence; instance segmentation; plant phenotyping | |||||
22 | 321 | Oral | 1-D #02 | 3-B #04 | Long-Term Cloth-Changing Person Re-identification | Xuelin Qian (Fudan University); Wenxuan Wang (Fudan University); Li Zhang (University of Oxford); Fangrui Zhu (Fudan University); Yanwei Fu (Fudan University)*; Tao Xiang (University of Surrey); Yu-Gang Jiang (Fudan University); Xiangyang Xue (Fudan University) | Person re-identification (Re-ID) aims to match a target person across camera views at different locations and times. Existing Re-ID studies focus on the short-term cloth-consistent setting, under which a person re-appears in different camera views with the same outfit. A discriminative feature representation learned by existing deep Re-ID models is thus dominated by the visual appearance of clothing. In this work, we focus on a much more difficult yet practical setting where person matching is conducted over long-duration, e.g., over days and months and therefore inevitably under the new challenge of changing clothes. This problem, termed Long-Term Cloth-Changing (LTCC) Re-ID is much understudied due to the lack of large scale datasets. The first contribution of this work is a new LTCC dataset containing people captured over a long period of time with frequent clothing changes. As a second contribution, we propose a novel Re-ID method specifically designed to address the cloth-changing challenge. Specifically, we consider that under cloth-changes, soft-biometrics such as body shape would be more reliable. We, therefore, introduce a shape embedding module as well as a cloth-elimination shape-distillation module aiming to eliminate the now unreliable clothing appearance features and focus on the body shape information. Extensive experiments show that superior performance is achieved by the proposed model on the new LTCC dataset. The dataset is available on the project website: https: //naiq.github.io/LTCC_Perosn_ReID.html. | Deep Learning for Computer Vision | Recognition: Feature Detection, Indexing, Matching, and Shape Representation | https://naiq.github.io/LTCC_Perosn_ReID.html | Person Re-identification; Long-Term; Cloth-Changing; LTCC Dataset | ||||
23 | 327 | Oral | 2-A #05 | 2-C #06 | Dehazing Cost Volume for Deep Multi-view Stereo in Scattering Media | Yuki Fujimura (Kyoto University)*; Motoharu Sonogashira (Kyoto University); Masaaki Iiyama (Kyoto University) | We propose a learning-based multi-view stereo (MVS) method in scattering media such as fog or smoke with a novel cost volume, called the dehazing cost volume. An image captured in scattering media degrades due to light scattering and attenuation caused by suspended particles. This degradation depends on scene depth; thus it is difficult for MVS to evaluate photometric consistency because the depth is unknown before three-dimensional reconstruction. Our dehazing cost volume can solve this chicken-and-egg problem of depth and scattering estimation by computing the scattering effect using swept planes in the cost volume. Experimental results on synthesized hazy images indicate the effectiveness of our dehazing cost volume against the ordinary cost volume regarding scattering media. We also demonstrated the applicability of our dehazing cost volume to real foggy scenes. | 3D Computer Vision | Low-level Vision, Image Processing | https://github.com/yfujimura/DCV-release | multi-vew stereo; scattering media; cost volume; light scattering | ||||
24 | 363 | Oral | 2-D #05 | 3-B #05 | Goal-GAN: Multimodal Trajectory Prediction Based on Goal Position Estimation | Patrick Dendorfer (TUM)*; Aljosa Osep (TUM Munich); Laura Leal-Taixé (TUM) | In this paper, we present Goal-GAN, an interpretable and end-to-end trainable model for human trajectory prediction. Inspired by human navigation, we model the task of trajectory prediction as an intuitive two-stage process: (i) goal estimation, which predicts the most likely target positions of the agent, followed by a (ii) routing module which estimates a set of plausible trajectories that route towards the estimated goal. We leverage information about the past trajectory and visual context of the scene to estimate a multi-modal probability distribution over the possible goal positions, which is used to sample a potential goal during the inference. The routing is governed by a recurrent neural network that reacts to physical constraints in the nearby surroundings and generates feasible paths that route towards the sampled goal. Our extensive experimental evaluation shows that our method establishes a new state-of-the-art on several benchmarks while being able to generate a realistic and diverse set of trajectories that conform to physical constraints. | Motion and Tracking | https://github.com/dendorferpatrick/GoalGAN | pedestrian trajectory prediction; goal; GAN; motion modelling; visual features; gumbel softmax; multimodality | |||||
25 | 370 | Oral | 1-A #05 | 1-C #06 | RF-GAN: A Light and Reconfigurable Network for Unpaired Image-to-Image Translation | Ali Koksal (Nanyang Technological University); Shijian Lu (Nanyang Technological University)* | Generative adversarial networks (GANs) have been widely studied for unpaired image-to-image translation in recent years. On the other hand, state-of-the-art translation GANs are often constrained by large model sizes and inflexibility in translating across various domains. Inspired by the observation that the mappings between two domains are often approximately invertible, we design an innovative reconfigurable GAN (RF-GAN) that has a small size but is versatile in high-fidelity image translation either across two domains or among multiple domains. One unique feature of RF-GAN lies with its single generator which is reconfigurable and can perform bidirectional image translations by swapping its parameters. In addition, a multi-domain discriminator is designed which allows joint discrimination of original and translated samples in multiple domains. Experiments over eight unpaired image translation datasets (on various tasks such as object transfiguration, season transfer, and painters' style transfer, etc.) show that RF-GAN reduces the model size by up to 75% as compared with state-of-the-art translation GANs but produces superior image translation performance with lower Fréchet Inception Distance consistently. | Generative models for computer vision | Deep Learning for Computer Vision | Generative Adversarial Networks; Image-to-image Translation; Image Synthesis; Style Transfer; Adversarial Learning | |||||
26 | 420 | Oral | 1-B #04 | 1-D #03 | Efficient Large-Scale Semantic Visual Localization in 2D Maps | Tomas Vojir (CMP CTU)*; Ignas Budvytis (Department of Engineering, University of Cambridge); Roberto Cipolla (University of Cambridge) | With the emergence of autonomous navigation systems, image-based localization is one of the essential tasks to be tackled. However, most of the current algorithms struggle to scale to city-size environments mainly because of the need to collect large (semi-)annotated datasets for CNN training and create databases for test environment of images, keypoint level features or image embeddings. This data acquisition is not only expensive and time-consuming but also may cause privacy concerns. In this work, we propose a novel framework for semantic visual localization in city-scale environments which alleviates the aforementioned problem by using freely available 2D maps such as OpenStreetMap. Our method does not require any images or image-map pairs for training or test environment database collection. Instead, a robust embedding is learned from a depth and building instance label information of a particular location in the 2D map. At test time, this embedding is extracted from a panoramic building instance label and depth images. It is then used to retrieve the closest match in the database.We evaluate our localization framework on two large-scale datasets consisting of Cambridge and San Francisco cities with a total length of drivable roads spanning over 500 km and including approximately 110k unique locations. To the best of our knowledge, this is the first large-scale semantic localization method which works on par with approaches that require the availability of images at train time or for test environment database creation. | Recognition: Feature Detection, Indexing, Matching, and Shape Representation | Big Data, Large Scale Methods | localization;map;large-scale;retrieval;embedding | |||||
27 | 455 | Oral | 1-A #06 | 3-D #04 | SpotPatch: Parameter-Efficient Transfer Learning for Mobile Object Detection | Keren Ye (University of Pittsburgh)*; Adriana Kovashka (University of Pittsburgh); Mark Sandler (Google); Menglong Zhu (UPenn); Andrew Howard (Google); Marco Fornoni (Google) | Deep learning based object detectors are commonly deployed on mobile devices to solve a variety of tasks. For maximum accuracy, each detector is usually trained to solve one single specific task, and comes with a completely independent set of parameters. While this guarantees high performance, it is also highly inefficient, as each model has to be separately downloaded and stored. In this paper we address the question: can task-specific detectors be trained and represented as a shared set of weights, plus a very small set of additional weights for each task? The main contributions of this paper are the following: 1) we perform the first systematic study of parameter-efficient transfer learning techniques for object detection problems; 2) we propose a technique to learn a model patch with a size that is dependent on the difficulty of the task to be learned, and validate our approach on 10 different object detection tasks. Our approach achieves similar accuracy as previously proposed approaches, while being significantly more compact. | Applications of Computer Vision, Vision for X | Deep Learning for Computer Vision; Recognition: Feature Detection, Indexing, Matching, and Shape Representation | object detection; parameter-efficient transfer learning; mobile vision applications | |||||
28 | 499 | Oral | 2-B #03 | 2-D #06 | DeepSEE: Deep Disentangled Semantic Explorative Extreme Super-Resolution | Marcel C. Bühler (ETH Zürich)*; Andrés Romero (ETH Zürich); Radu Timofte (ETH Zurich) | Super-resolution (SR) is by definition ill-posed. There are infinitely many plausible high-resolution variants for a given low-resolution natural image. Most of the current literature aims at a single deterministic solution of either high reconstruction fidelity or photo-realistic perceptual quality. In this work, we propose an explorative facial super-resolution framework, DeepSEE, for Deep disentangled Semantic Explorative Extreme super-resolution. To the best of our knowledge, DeepSEE is the first method to leverage semantic maps for explorative super-resolution. In particular, it provides control of the semantic regions, their disentangled appearance and it allows a broad range of image manipulations. We validate DeepSEE on faces, for up to 32x magnification and exploration of the space of super-resolution. Our code and models are available at: https://mcbuehler.github.io/DeepSEE/ | Generative models for computer vision | Deep Learning for Computer Vision; Segmentation and Grouping | https://mcbuehler.github.io/DeepSEE/ | explorative super-resolution; face hallucination; stochastic super-resolution; extreme super-resolution; disentanglement | ||||
29 | 505 | Oral | 2-D #07 | 3-B #06 | Learning 3D Face Reconstruction with a Pose Guidance Network | Pengpeng Liu (The Chinese University of Hong Kong)*; Xintong Han (Malong Technologies); Michael Lyu (The Chinese University of Hong Kong); Irwin King (The Chinese University of Hong Kong); Jia Xu (Huya AI) | We present a self-supervised learning approach to learning monocular 3D face reconstruction with a pose guidance network (PGN). First, we unveil the bottleneck of pose estimation in prior parametric 3D face learning methods, and propose to utilize 3D face landmarks for estimating pose parameters. With our specially designed PGN, our model can learn from both faces with fully labeled 3D landmarks and unlimited unlabeled in-the-wild face images. Our network is further augmented with a self-supervised learning scheme, which exploits face geometry information embedded in multiple frames of the same person, to alleviate the ill-posed nature of regressing 3D face geometry from a single image. These three insights yield a single approach that combines the complementary strengths of parametric model learning and data-driven learning techniques. We conduct a rigorous evaluation on the challenging AFLW2000-3D, Florence and FaceWarehouse datasets, and show that our method outperforms the state-of-the-art for all metrics. | Face, Pose, Action, and Gesture | 3D Computer Vision | 3D face reconstruction, pose guidance network, self-supervised learning, learning from videos | |||||
30 | 515 | Oral | 2-A #06 | 3-C #02 | Domain-transferred Face Augmentation Network | Hao-Chiang Shao (Fu Jen Catholic University); Kang-Yu Liu (National Tsing Hua University); Chia-Wen Lin (National Tsing Hua University)*; Jiwen Lu (Tsinghua University) | The performance of a convolutional neural network (CNN) based face recognition model largely relies on the richness of labelled training data. However, it is expensive to collect a training set with large variations of a face identity under different poses and illumination changes, so the diversity of within-class face images becomes a critical issue in practice. In this paper, we propose a 3D model-assisted domain-transferred face augmentation network (DotFAN) that can generate a series of variants of an input face based on the knowledge distilled from existing rich face datasets of other domains. Extending from StarGAN's architecture, DotFAN integrates with two additional subnetworks, i.e., face expert model (FEM) and face shape regressor (FSR), for latent facial code control. While FSR aims to extract face attributes, FEM is designed to capture a face identity. With their aid, DotFAN can separately learn facial feature codes and effectively generate face images of various facial attributes while keeping the identity of augmented faces unaltered. Experiments show that DotFAN is beneficial for augmenting small face datasets to improve their within-class diversity so that a better face recognition model can be learned from the augmented dataset. | Applications of Computer Vision, Vision for X | Face, Pose, Action, and Gesture; Generative models for computer vision | face augmentation; GAN; domain knowledge transfer; generative model. | |||||
31 | 523 | Oral | 2-B #04 | 3-D #05 | A Sparse Gaussian Approach to Region-Based 6DoF Object Tracking | Manuel Stoiber (German Aerospace Center (DLR))*; Martin Pfanne (German Aerospace Center); Klaus H. Strobl (DLR); Rudolph Triebel (German Aerospace Center (DLR)); Alin Albu-Schaeffer (Robotics and Mechatronics Center (RMC), German Aerospace Center (DLR)) | We propose a novel, highly efficient sparse approach to region-based 6DoF object tracking that requires only a monocular RGB camera and the 3D object model. The key contribution of our work is a probabilistic model that considers image information sparsely along correspondence lines. For the implementation, we provide a highly efficient discrete scale-space formulation. In addition, we derive a novel mathematical proof that shows that our proposed likelihood function follows a Gaussian distribution. Based on this information, we develop robust approximations for the derivatives of the log-likelihood that are used in a regularized Newton optimization. In multiple experiments, we show that our approach outperforms state-of-the-art region-based methods in terms of tracking success while being about one order of magnitude faster. The source code of our tracker is publicly available. | Motion and Tracking | Optimization Methods | https://github.com/DLR-RM/RBGT | 6DoF object tracking; pose estimation; region-based; sparse; Gaussian; real-time; Newton optimization; monocular; | ||||
32 | 542 | Oral | 1-B #05 | 3-D #06 | Encode the Unseen: Predictive Video Hashing for Scalable Mid-Stream Retrieval | Tong Yu (University of Strasbourg)*; Nicolas Padoy (University of Strasbourg) | This paper tackles a new problem in computer vision: mid-stream video-to-video retrieval. This task, which consists in searching a database for content similar to a video right as it is playing, e.g. from a live stream, exhibits challenging characteristics. Only the beginning part of the video is available as query and new frames are constantly added as the video plays out. To perform retrieval in this demanding situation, we propose an approach based on a binary encoder that is both predictive and incremental in order to (1) account for the missing video content at query time and (2) keep up with repeated, continuously evolving queries throughout the streaming. In particular, we present the first hashing framework that infers the unseen future content of a currently playing video. Experiments on FCVID and ActivityNet demonstrate the feasibility of this task. Our approach also yields a significant mAP@20 performance increase compared to a baseline adapted from the literature for this task, for instance 7.4% (2.6%) increase at 20% (50%) of elapsed runtime on FCVID using bitcodes of size 192 bits. | Deep Learning for Computer Vision | Video Analysis and Event Recognition | video retrieval; video hashing; distillation; incremental search | |||||
33 | 554 | Oral | 2-A #07 | 3-C #03 | Image Inpainting with Onion Convolutions | Shant Navasardyan (Picsart Inc.)*; Marianna Ohanyan (Picsart Inc.) | Recently deep learning methods have achieved a great success in image inpainting problem. However, reconstructing continuities of complex structures with non-stationary textures remains a challenging task for computer vision. In this paper, a novel approach to image inpainting problem is presented, which adapts exemplar-based methods for deep convolutional neural networks. The concept of onion convolution is introduced with the purpose of preserving feature continuities and semantic coherence. Similar to recent approaches, our onion convolution is able to capture long-range spatial correlations. In general, the implementation of modules with such ability in low-level features leads to impractically high latency and complexity. To address this limitations, the onion convolution suggests an efficient implementation. As qualitative and quantitative comparisons show, our method with onion convolutions outperforms state-of-the-art methods by producing more realistic, visually plausible and semantically coherent results. | Deep Learning for Computer Vision | Computational Photography, Sensing, and Display; Generative models for computer vision ; Low-level Vision, Image Processing | image inpainting;onion convolution;object removal;patch matching;attention | |||||
34 | 559 | Oral | 3-B #07 | 3-D #07 | In Defense of LSTMs for Addressing Multiple Instance Learning Problems | Kaili Wang (KU Leuven, UAntwerpen)*; Jose Oramas (UAntwerp, imec-IDLab); Tinne Tuytelaars (KU Leuven) | LSTMs have a proven track record in analyzing sequential data. But what about unordered instance bags, as found under a Multiple Instance Learning (MIL) setting? While not often used for this, we show LSTMs excell under this setting too. In addition, we show thatLSTMs are capable of indirectly capturing instance-level information us-ing only bag-level annotations. Thus, they can be used to learn instance-level models in a weakly supervised manner. Our empirical evaluation on both simplified (MNIST) and realistic (Lookbook and Histopathology) datasets shows that LSTMs are competitive with or even surpass state-of-the-art methods specially designed for handling specific MIL problems. Moreover, we show that their performance on instance-level prediction is close to that of fully-supervised methods | Deep Learning for Computer Vision | Applications of Computer Vision, Vision for X | https://github.com/shadowwkl/LSTM-for-Multiple-Instance-Learning | LSTM; multiple instance learning; weakly supervised learning | ||||
35 | 560 | Oral | 1-A #07 | 1-B #06 | Sketch-to-Art: Synthesizing Stylized Art Images From Sketches | Bingchen Liu (Rutgers, The State University of New Jersey)*; Kunpeng Song (Rutgers University); Yizhe Zhu (Rutgers University ); Ahmed Elgammal (-) | We propose a new approach for synthesizing fully detailed art-stylized images from sketches. Given a sketch, with no semantic tagging, and a reference image of a specific style, the model can synthesize meaningful details with colors and textures. Based on the GAN framework, the model consists of three novel modules designed explicitly for better artistic style capturing and generation. To enforce the content faithfulness, we introduce the dual-masked mechanism which directly shapes the feature maps according to sketch. To capture more artistic style aspects, we design feature-map transformation for a better style consistency to the reference image. Finally, an inverse process of instance-normalization disentangles the style and content information and further improves the synthesis quality. Experiments demonstrate a significant qualitative and quantitative boost over baseline models based on previous state-of-the-art techniques, modified for the proposed task (17% better Frechet Inception distance and 18% better style classification score). Moreover, the lightweight design of the proposed modules enables the high-quality synthesis at 512 * 512 resolution. | Applications of Computer Vision, Vision for X | Deep Learning for Computer Vision; Generative models for computer vision | https://github.com/odegeasslbc/Sketch2art-pytorch | sketch to image; art image synthesis; art style transfer; image synthesis; sketch to art | ||||
36 | 604 | Oral | 2-B #05 | 3-D #08 | FreezeNet: Full Performance by Reduced Storage Costs | Paul Wimmer (Luebeck University / Robert Bosch GmbH)*; Jens Mehnert (Robert Bosch GmbH); Alexandru Condurache (Bosch) | Pruning generates sparse networks by setting parameters tozero. In this work we improve one-shot pruning methods, applied beforetraining, without adding any additional storage costs while preservingthe sparse gradient computations. The main difference to pruning is thatwe do not sparsify the network’s weights but learn just a few key parame-ters and keep the other ones fixed at their random initialized value. Thismechanism is called freezing the parameters. Those frozen weights canbe stored efficiently with a single 32bit random seed number. The pa-rameters to be frozen are determined one-shot by a single for- and back-ward pass applied before training starts. We call the introduced methodFreezeNet. In our experiments we show that FreezeNets achieve good re-sults, especially for extreme freezing rates. Freezing weights preserves thegradient flow throughout the network and consequently, FreezeNets trainbetter and have an increased capacity compared to their pruned counter-parts. On the classification tasks MNIST and CIFAR-10/100 we outper-form SNIP, in this setting the best reported one-shot pruning method,applied before training. On MNIST, FreezeNet achieves 99.2% perfor-mance of the baseline LeNet-5-Caffe architecture, while compressing thenumber of trained and stored parameters by a factor of ×157. | Deep Learning for Computer Vision | Datasets and Performance Analysis; Optimization Methods; Statistical Methods and Learning | Network Pruning;Random Weights;Sparse Gradients;Preserved Gradient Flow;Backpropagation | |||||
37 | 608 | Oral | 2-A #08 | 2-C #07 | Chromatic Aberration Correction Using Cross-Channel Prior in Shearlet Domain | Kunyi Li (Tsinghua University); Xin Jin (Tsinghua University)* | Instead of more expensive and complex optics, recent years, many researches are focused on high-quality photography using lightweight cameras, such as single-ball lens, with computational image processing. Traditional methods for image enhancement do not comprehensively address the blurring artifacts caused by strong chromatic aberrations in images produced by a simple optical system. In this paper, we propose a new method to correct both lateral and axial chromatic aberrations based on their different characteristics. To eliminate lateral chromatic aberration, cross-channel prior in shearlet domain is proposed to align texture information of red and blue channels to green channel. We also propose a new PSF estimation method to better correct axial chromatic aberration using wave propagation model, where F-number of the optical system is needed. Simulation results demonstrate our method can provide aberration-free images while there are still some artifacts in the results of the state-of-art methods. PSNRs of simulation results increase at least 2 dB and SSIM is on average 6.29% to 41.26% better than other methods. Real-captured image results prove that the proposed prior can effectively remove lateral chromatic aberration while the proposed PSF model can further correct the axial chromatic aberration. | Low-level Vision, Image Processing | Computational Photography, Sensing, and Display | chromatic aberration correction; shearlet; PSF estimation; wave propagation; deconvolution; image enhancement; | |||||
38 | 618 | Oral | 2-B #06 | 2-D #08 | Project to Adapt: Domain Adaptation for Depth Completion from Noisy and Sparse Sensor Data | Adrian Lopez-Rodriguez (Imperial College London)*; Benjamin Busam (Technical University of Munich); Krystian Mikolajczyk (Imperial College London) | Depth completion aims to predict a dense depth map from a sparse depth input. The acquisition of dense ground truth annotations for depth completion settings can be difficult and, at the same time, a significant domain gap between real LiDAR measurements and synthetic data has prevented from successful training of models in virtual settings. We propose a domain adaptation approach for sparse-to-dense depth completion that is trained from synthetic data, without annotations in the real domain or additional sensors. Our approach simulates the real sensor noise in an RGB~+~LiDAR set-up, and consists of three modules: simulating the real LiDAR input in the synthetic domain via projections, filtering the real noisy LiDAR for supervision and adapting the synthetic RGB image using a CycleGAN approach. We extensively evaluate these modules against the state-of-the-art in the KITTI depth completion benchmark, showing significant improvements. | 3D Computer Vision | RGBD and Depth Image Processing | https://github.com/alopezgit/project-adapt | depth completion; LiDAR; synthetic; domain adaptation; depth | ||||
39 | 622 | Oral | 1-A #08 | 3-C #04 | An Efficient Group Feature Fusion Residual Network for Image Super-Resolution | Pengcheng Lei (University of Shanghai for Science and Technology); Cong Liu (University of Shanghai for Science and Technology)* | Convolutional neural networks (CNNs) have made great breakthrough in the field of image super-resolution (SR). However, most current methods are usually to improve their performance by simply increasing the depth of their network. Although this strategy can get promising results, it is inefficient in many real-world scenarios because of the high computational cost. In this paper, we propose an efficient group feature fusion residual network (GFFRN) for image super-resolution. In detail, we design a novel group feature fusion residual block (GFFRB) to group and fuse the features of the intermediate layer. In this way, GFFRB can enjoy the merits of the lightweight of the group convolution and the high-efficiency of the skip connections, thus achieving better performance compared with most current residual blocks. Experiments on the benchmark test sets show that our models are more efficient than most of the state-of-the-art methods. | Low-level Vision, Image Processing | Deep Learning for Computer Vision | https://github.com/lpcccc-cv/GFFRN-ACCV2020.git | super-resolution;group convolution;skip connection;resnet;densenet;pixelshuffle;feature fusion;channel expansion;efficiency | ||||
40 | 648 | Oral | 3-A #06 | 3-C #05 | Adversarial Image Composition with Auxiliary Illumination | Fangneng Zhan (Nanyang Technological University); Shijian Lu (Nanyang Technological University)*; Changgong Zhang (Alibaba Group); Feiying Ma (Alibaba); Xuansong Xie (Alibaba) | Dealing with the inconsistency between a foreground object and a background image is a challenging task in high-fidelity image composition. State-of-the-art methods strive to harmonize the composed image by adapting the style of foreground objects to be compatible with the background image, whereas the potential shadow of foreground objects within the composed image which is critical to the composition realism is largely neglected. In this paper, we propose an Adversarial Image Composition Net (AIC-Net) that achieves realistic image composition by considering potential shadows that the foreground object projects in the composed image. A novel branched generation mechanism is proposed, which disentangles the generation of shadows and the transfer of foreground styles for optimal accomplishment of the two tasks simultaneously. A differentiable spatial transformation module is designed which bridges the local harmonization and the global harmonization to achieve their joint optimization effectively. Extensive experiments on pedestrian and car composition tasks show that the proposed AIC-Net achieves superior composition performance qualitatively and quantitatively. | Low-level Vision, Image Processing | Deep Learning for Computer Vision | image composition, illumination, GAN | |||||
41 | 656 | Oral | 1-B #07 | 1-D #04 | Self-Supervised Multi-View Synchronization Learning for 3D Pose Estimation | Simon Jenni (Universität Bern)*; Paolo Favaro (University of Bern) | Current state of the art methods cast monocular 3D human pose estimation as a learning problem by training neural networks on costly large data sets of images and corresponding skeleton poses. In contrast, we propose an approach that can exploit small annotated data sets by fine-tuning networks pre-trained via self-supervised learning on (large) unlabeled data sets. To drive such models in the pre-training step towards supporting 3D pose estimation, we introduce a novel self-supervised feature learning task designed to focus on the 3D structure in an image. We exploit images extracted from videos captured with a multi-view camera system. The task is to classify whether two images depict two views of the same scene up to a rigid transformation. In a multi-view data set, where objects deform in a non-rigid manner, a rigid transformation occurs only between two views taken at the exact same time, i.e., when they are synchronized.We demonstrate the effectiveness of the synchronization task on the Human3.6M data set and achieve state-of-the-art results in 3D human pose estimation. | Face, Pose, Action, and Gesture | 3D Computer Vision; Deep Learning for Computer Vision; Statistical Methods and Learning | https://github.com/sjenni/multiview-sync-ssl | pose estimation; self-supervised; 3D; synchronization; multi-view; feature learning; human pose; unsupervised | ||||
42 | 680 | Oral | 1-C #07 | 2-A #09 | Adversarial Refinement Network for Human Motion Prediction | Xianjin Chao (The City University of Hong Kong)*; Yanrui Bin (HUST); Wenqing Chu (Tencent); Xuan Cao (Tencent); Yanhao Ge (Tencent); Chengjie Wang (Tencent); Jilin Li (Tencent); Feiyue Huang (Tencent); Howard Leung (City University of Hong Kong) | Human motion prediction aims to predict future 3D skeletal sequences by giving a limited human motion as inputs. Two popular methods, recurrent neural networks and feed-forward deep networks, are able to predict rough motion trend, but motion details such as limb movement may be lost. To predict more accurate future human motion, we propose an Adversarial Refinement Network (ARNet) following a simple yet effective coarse-to-fine mechanism with novel adversarial error augmentation. Specifically, we take both the historical motion sequences and coarse prediction as input of our cascaded refinement network to predict refined human motion and strengthen the refinement network with adversarial error augmentation. During training, we deliberately introduce the error distribution by learning through the adversarial mechanism among different subjects. In testing, our cascaded refinement network alleviates the prediction error from the coarse predictor resulting in accurate prediction robustly. This adversarial error augmentation provides rich error cases as input to our refinement network, leading to better generalization performance on the testing dataset. We conduct extensive experiments on three standard benchmark datasets and show that our proposed ARNet outperforms other state-of-the-art methods, especially on challenging aperiodic actions in both short-term and long-term predictions. | Motion and Tracking | 3D Computer Vision; Face, Pose, Action, and Gesture | https://github.com/Xianjin111/ARNet-for-human-motion-prediction | Human motion prediction; Coarse-to-fine; Adversarial learning; 3D; | ||||
43 | 697 | Oral | 1-A #09 | 1-D #05 | 3D Human Motion Estimation via Motion Compression and Refinement | Zhengyi Luo (Carnegie Mellon University)*; S. Alireza Golestaneh (Carnegie Mellon University); Kris M. Kitani (Carnegie Mellon University) | We develop a technique for generating smooth and accurate 3D human pose and motion estimates from RGB video sequences. Our technique, which we call Motion Estimation via Variational Autoencoder (MEVA), decomposes a temporal sequence of human motion into a smooth motion representation using auto-encoder-based motion compression and a residual representation learned through motion refinement. This two-step encoding process of human motion can represent a wide variety of general human motions while also retaining person-specific motion details. Experiments show that our method produces both smooth and accurate 3D human pose and motion estimates. | Face, Pose, Action, and Gesture | Generative models for computer vision | https://github.com/ZhengyiLuo/MEVA.git | 3D human pose estimation; human motion estimation; variational autoencoder; generative models; temporal smoothness | ||||
44 | 705 | Oral | 2-B #07 | 3-D #09 | Lossless Image Compression Using a Multi-Scale Progressive Statistical Model | Honglei Zhang (Nokia Technologies)*; Francesco Cricri (Nokia Technologies); Hamed R. Tavakoli (Nokia Technologies); Nannan Zou (Tampere University); Emre Aksu (Nokia Technologies); Miska M. Hannuksela (Nokia Technologies) | Lossless image compression is an important technique for im-age storage and transmission when information loss is not allowed. Withthe fast development of deep learning techniques, deep neural networkshave been used in this field to achieve a higher compression rate. Meth-ods based on pixel-wise autoregressive statistical models have showngood performance. However, the sequential processing way prevents thesemethods to be used in practice. Recently, multi-scale autoregressive mod-els have been proposed to address this limitation. Multi-scale approachescan use parallel computing systems efficiently and build practical sys-tems. Nevertheless, these approaches sacrifice compression performancein exchange for speed. In this paper, we propose a multi-scale progressivestatistical model that takes advantage of the pixel-wise approach and themulti-scale approach. We developed a flexible mechanism where the pro-cessing order of the pixels can be adjusted easily. Our proposed methodoutperforms the state-of-the-art lossless image compression methods ontwo large benchmark datasets by a significant margin without degradingthe inference speed dramatically. | Statistical Methods and Learning | Big Data, Large Scale Methods; Deep Learning for Computer Vision | Image Compression; Statistical Model; Lossless Image Compression; Deep Learning | |||||
45 | 717 | Oral | 1-D #06 | 2-A #10 | A cost-effective method for improving and re-purposing large, pre-trained GANs by fine-tuning their class-embeddings | Qi Li (Auburn University); Long Mai (Adobe Research); Michael A. Alcorn (Auburn University); Anh Nguyen (Auburn University)* | Large, pre-trained generative models have been increasingly popular and useful to both the research and wider communities. Specifically, BigGANs a class-conditional Generative Adversarial Networks trained on ImageNet---achieved excellent, state-of-the-art capability in generating realistic photos. However, fine-tuning or training BigGANs from scratch is practically impossible for most researchers and engineers because (1) GAN training is often unstable and suffering from mode-collapse; and (2) the training requires a significant amount of computation, 256 Google TPUs for 2 days or 8xV100 GPUs for 15 days. Importantly, many pre-trained generative models both in NLP and image domains were found to contain biases that are harmful to society. Thus, we need computationally-feasible methods for modifying and re-purposing these huge, pre-trained models for downstream tasks. In this paper, we propose a cost-effective optimization method for improving and re-purposing BigGANs by fine-tuning only the class-embedding layer. We show the effectiveness of our model-editing approach in three tasks: (1) significantly improving the realism and diversity of samples of complete mode-collapse classes; (2) re-purposing ImageNet BigGANs for generating images for Places365; and (3) de-biasing or improving the sample diversity for selected ImageNet classes. | Generative models for computer vision | Big Data, Large Scale Methods; Deep Learning for Computer Vision | https://github.com/qilimk/biggan-am | Activation Maximization; Class embeddings; Mode-collapse; Re-purposing BigGAN; Improving the sample diversity | ||||
46 | 719 | Oral | 1-A #10 | 1-C #08 | Dense Pixel-wise Micro-motion Estimation of Object Surface by using Low Dimensional Embedding of Laser Speckle Pattern | Ryusuke Sagawa ("AIST, Japan")*; Yusuke Higuchi (Kyushu University); Hiroshi Kawasaki (Kyushu univ.); Ryo Furukawa (Hiroshima city univ.); Takahiro Ito (AIST) | This paper proposes a method of estimating micro-motion of an object at eachpixel that is too small to detect under a common setup of camera andillumination. The method introduces an active-lighting approach to make themotion visually detectable. The approach is based on speckle pattern, which isproduced by the mutual interference of laser light on object's surface andcontinuously changes its appearance according to the out-of-plane motion of thesurface. In addition, speckle pattern becomes uncorrelated with large motion. Tocompensate such micro- and large motion, the method estimates the motionparameters up to scale at each pixel by nonlinear embedding of the specklepattern into low-dimensional space. The out-of-plane motion is calculated bymaking the motion parameters spatially consistent across the image. In theexperiments, the proposed method is compared with other measuring devices toprove the effectiveness of the method. | Computational Photography, Sensing, and Display | 3D Computer Vision; Physics-based Vision and Shape from X | micro-motion estimation; laser speckle; low dimensional embedding | |||||
47 | 751 | Oral | 2-C #08 | 3-A #07 | Learning More Accurate Features for Semantic Segmentation in CycleNet | Linzi Qu (Xidian University)*; Lihuo He (Xidian University); JunJie Ke (Xidian University); Xinbo Gao (Xidian University); Wen Lu (Xidian University) | Contextual information is essential for computer vision tasks, especially semantic segmentation. Previous works generally focus on how to collect contextual information by enlarging the size of receptive field, such as PSPNet, DenseASPP. In contrast to previous works, this paper proposes a new network -- CycleNet, which considers assigning a more accurate representative for every pixel. It consists of two modules, Cycle Atrous Spatial Pyramid Pooling (CycleASPP) and Alignment with Deformable Convolution (ADC). The former realizes dense connections between a series of atrous convolution layers with different dilation rates. Not only the forward connections can aggregate more contextual information, but also the backward connections can pay more attention to important information by transferring high-level features to low-level layers. Besides, ADC generates accurate information during the decoding process. It draws support from deformable convolution to select and recombine features from different blocks, thus improving the misalignment issues caused by simple interpolation. A set of experiments have been conducted on Cityscapes and ADE20K to demonstrate the effectiveness of CycleNet. In particular, our model achieved 46.14% mIoU on ADE20K validation set. | Deep Learning for Computer Vision | Segmentation and Grouping | Semantic segmentation; recurrent connection; misalignment issue; deformable convolution | |||||
48 | 775 | Oral | 3-A #08 | 3-C #06 | Deep Snapshot HDR Imaging Using Multi-Exposure Color Filter Array | Takeru Suda (Tokyo Institute of Technology); Masayuki Tanaka (Tokyo Institute of Technology); Yusuke Monno (Tokyo Institute of Technology)*; Masatoshi Okutomi (Tokyo Institute of Technology) | In this paper, we propose a deep snapshot high dynamic range (HDR) imaging framework that can effectively reconstruct an HDR image from the RAW data captured using a multi-exposure color filter array (ME-CFA), which consists of a mosaic pattern of RGB filters with different exposure levels. To effectively learn the HDR image reconstruction network, we introduce the idea of luminance normalization that simultaneously enables effective loss computation and input data normalization by considering relative local contrasts in the "normalized-by-luminance'' HDR domain. This idea makes it possible to equally handle the errors in both bright and dark areas regardless of absolute luminance levels, which significantly improves the visual image quality in a tone-mapped domain. Experimental results using two public HDR image datasets demonstrate that our framework outperforms other snapshot methods and produces high-quality HDR images with fewer visual artifacts. | Computational Photography, Sensing, and Display | http://www.ok.sc.e.titech.ac.jp/res/DSHDR/ | high dynamic range imaging; HDR; demosaicking; color filter array; CFA | |||||
49 | 803 | Oral | 1-D #07 | 3-A #09 | A Benchmark and Baseline for Language-Driven Image Editing | Jing Shi (University of Rochester)*; Ning Xu (Adobe Research); Trung Bui (Adobe Research); Franck Dernoncourt (Adobe Research); Zheng Wen (DeepMind); Chenliang Xu (University of Rochester) | Language-driven image editing can significantly save the laborious image editing work and be friendly to the photography novice. However, most similar work can only deal with a specific image domain or can only do global retouching. To solve this new task, we first present a new language-driven image editing dataset that supports both local and global editing with editing operation and mask annotations. Besides, we also propose a baseline method that fully utilizes the annotation to solve this problem. Our new method treats each editing operation as a sub-module and can automatically predict operation parameters. Not only performing well on challenging user data, but such an approach is also highly interpretable. We believe our work, including both the benchmark and the baseline, will advance the image editing area towards a more general and free-form level. | Deep Learning for Computer Vision | Applications of Computer Vision, Vision for X; Datasets and Performance Analysis | https://github.com/jshi31/LDIE_ACCV | Computational Photography; Vision-and-Language | ||||
50 | 817 | Oral | 1-A #11 | 3-B #08 | Bidirectional Pyramid Networks for Semantic Segmentation | Dong Nie (UNC)*; Jia Xue (Rutgers University); Xiaofeng Ren (Alibaba group) | Semantic segmentation is a fundamental problem in com-puter vision that has attracted a lot of attention. Recent eorts havebeen devoted to network architecture innovations for ecient semanticsegmentation that can run in real-time for autonomous driving and otherapplications. Information ow between scales is crucial because accuratesegmentation needs both large context and ne detail. However, most ex-isting approaches still rely on pretrained backbone models (e.g. ResNeton ImageNet). In this work, we propose to open up the backbone and de-sign a simple yet eective multiscale network architecture, BidirectionalPyramid Network (BPNet). BPNet takes the shape of a pyramid: infor-mation ows from bottom (high-resolution, small receptive eld) to top(low-resolution, large receptive eld), and from top to bottom, in a sys-tematic manner, at every step of the processing. More importantly, fusionneeds to be ecient; this is done through an add-and-multiply modulewith learned weights. We also apply a unary-pairwise attention mecha-nism to balance position sensitivity and context aggregation. Auxiliaryloss is applied at multiple steps of the pyramid bottom. The resultingnetwork achieves high accuracy with eciency, without the need of pre-training. On the standard Cityscapes dataset, we achieve test mIoU 76:3with 5:1M parameters and 36 fps (on Nvidia 2080 Ti), competitive withthe state of the time real-time models. Meanwhile, our design is generaland can be used to build heavier networks: a ResNet-101 equivalent ver-sion of BPNet achieves mIoU 81.9 on Cityscapes, competitive with thebest published results. We further demonstrate the exibility of BPNeton a prostate MRI segmentation task, achieving the state of the art with a45x speed-up. | Segmentation and Grouping | Big Data, Large Scale Methods; Deep Learning for Computer Vision | https://github.com/ginobilinie/BPNet | semantic segmentation; high resolution; feature fusion; attention; multi-scale | ||||
51 | 826 | Oral | 3-B #09 | 3-D #10 | Semantics through Time: Semi-supervised Segmentation of Aerial Videos with Iterative Label Propagation | Alina Marcu (University "Politehnica" of Bucharest)*; Vlad Licaret (Autonomous Systems); Dragos Costea (University "Politehnica" of Bucharest); Marius Leordeanu (University "Politehnica" of Bucharest) | Semantic segmentation is a crucial task for robot navigation and safety. However, current supervised methods require a large amount of pixelwise annotations to yield accurate results. Labeling is a tedious and time consuming process that has hampered progress in low altitude UAV applications. This paper makes an important step towards automatic annotation by introducing SegProp, a novel iterative flow-based method, with a direct connection to spectral clustering in space and time, to propagate the semantic labels to frames that lack human annotations. The labels are further used in semi-supervised learning scenarios. Motivated by the lack of a large video aerial dataset, we also introduce Ruralscapes, a new dataset with high resolution (4K) images and manually annotated dense labels every 50 frames - the largest of its kind, to the best of our knowledge. Our novel SegProp automatically annotates the remaining unlabeled 98% of frames with an accuracy exceeding 90% (F-measure), significantly outperforming other state-of-the-art label propagation methods. Moreover, when integrating other methods as modules inside SegProp's iterative label propagation loop, we achieve a significant boost over the baseline labels. Finally, we test SegProp in a full semi-supervised setting: we train several state-of-the-art deep neural networks on the SegProp-automatically-labeled training frames and test them on completely novel videos. We convincingly demonstrate, every time, a significant improvement over the supervised scenario. | Deep Learning for Computer Vision | Datasets and Performance Analysis; Robot Vision; Segmentation and Grouping | semantic segmentation; semi-supervised learning; video scene understanding; label propagation; spectral clustering; UAVs; | |||||
52 | 829 | Oral | 1-D #08 | 2-B #08 | GAN-based Noise Model for Denoising Real Images | Linh Duy Tran (Teikyo University)*; Son Minh Nguyen (Teikyo University); Masayuki Arai (Teikyo Univ.) | In the present paper, we propose a new approach for realistic image noise modeling based on a generative adversarial network (GAN). The model aims to boost performance of a deep network denoiser for real-world denoising. Although deep network denoisers, such as a denoising convolutional neural network, can achieve state-of-the-art denoised results on synthetic noise, they perform poorly on real-world noisy images. To address this, we propose a two-step model. First, the images are converted to raw image data before adding noise. We then trained a GAN to estimate the noise distribution over a large collection of images (1 million). The estimated noise was used to train a deep neural network denoiser. Extensive experiments demonstrated that our new noise model achieves state-of-the-art performance on real raw images from the Smartphone Image Denoising Dataset benchmark. | Generative models for computer vision | Deep Learning for Computer Vision | deep learning; denoiser; generative network; real-world noisy images | |||||
53 | 839 | Oral | 2-A #11 | 2-D #09 | Mapping of Sparse 3D Data using Alternating Projection | Siddhant Ranade (University of Utah); Xin Yu (University of Utah); Shantnu Kakkar (Trimble); Pedro Miraldo (Instituto Superior Técnico, Lisboa); Srikumar Ramalingam (University of Utah)* | We propose a novel technique to register sparse 3D scans in the absence of texture. While existing methods such as KinectFusion or Iterative Closest Points (ICP) heavily rely on dense point clouds, this task is particularly challenging under sparse conditions without RGB data. Sparse texture-less data does not come with high-quality boundary signal, and this prohibits the use of correspondences from corners, junctions, or boundary lines. Moreover, in the case of sparse data, it is incorrect to assume that the same point will be captured in two consecutive scans. We take a different approach and first re-parameterize the point-cloud using a large number of line segments. In this re-parameterized data, there exists a large number of line intersection (and not correspondence) constraints that allow us to solve the registration task. We propose the use of a two-step alternating projection algorithm by formulating the registration as the simultaneous satisfaction of intersection and rigidity constraints. Despite the simplicity, the proposed approach outperforms other top-scoring algorithms on both Kinect and LiDAR datasets. In Kinect, we can use 100X downsampled sparse data and still outperform competing methods operating on full-resolution data. | 3D Computer Vision | Motion and Tracking; RGBD and Depth Image Processing | https://github.com/SiddhantRanade/SparseLiDAR | LiDAR, 3D registration, line intersection, generalized relative pose estimation | ||||
54 | 863 | Oral | 2-D #10 | 3-B #10 | FKAConv: Feature-Kernel Alignment for Point Cloud Convolution | Alexandre Boulch (valeo.ai)*; Gilles Puy (Valeo); Renaud Marlet (Ecole des Ponts ParisTech) | Recent state-of-the-art methods for point cloud semantic segmentation are based on convolution defined for point clouds The interest goes beyond semantic segmentation. We propose a formulation of the convolution for point cloud directly inspired by the discrete convolution in image processing. The resulting formulation underlines the separation between the discrete kernel space and the geometric space where the points lies. Several existing methods fall under this formulation.The two spaces are linked with a space change matrix $\mathbf{A}$, estimated with a neural network. $\mathbf{A}$ softly assigns the input features on the convolution kernel. Finally, we show competitive results on several semantic segmentation benchmarks while being efficient both in computation time and memory. | Deep Learning for Computer Vision | 3D Computer Vision | https://github.com/valeoai/FKAConv | point cloud; convolution; semantic segmentation; neural networks; convolutional neural networks, CNN; 3D processing | ||||
55 | 865 | Oral | 1-B #08 | 1-D #09 | Raw-Guided Enhancing Reprocess of Low-Light Image via Deep Exposure Adjustment | Haofeng Huang (Peking University)*; Wenhan Yang (Peking University); Yueyu Hu (Peking University); Jiaying Liu (Peking University) | Enhancement of images captured in low-light conditions remains to be a challenging problem even with the advanced machine learning techniques. The challenges include the ambiguity of the ground truth for a low-light image and the loss of information during the RAW image processing. To tackle the problems, in this paper, we take a novel view to regard low-light image enhancement as an exposure time adjustment problem and propose a corresponding explicit and mathematical definition. Based on that, we construct a RAW-Guiding exposure time adjustment Network (RGNET), which overcomes RGB images' nonlinearity and RAW images' inaccessibility. That is, RGNET is only trained with RGB images and corresponding RAW images, which helps project nonlinear RGB images into a linear domain, simultaneously without using RAW images in the testing phase. Furthermore, our network consists of three individual sub-modules for unprocessing, reconstruction and processing, respectively. To the best of our knowledge, the proposed sub-net for unprocessing is the first learning-based unprocessing method. After the joint training of three parts, each pre-trained seperately with the RAW image guidance, experimental results demonstrate that RGNET outperforms state-of-the-art low-light image enhancement methods. | Low-level Vision, Image Processing | Computational Photography, Sensing, and Display | Low light enhancement; RAW guidance; Deep learning; Exposure adjustment | |||||
56 | 872 | Oral | 1-D #10 | 2-B #09 | Depth-Adapted CNN for RGB-D cameras | Zongwei WU (Univ. Bourgogne Franche-Comte, France)*; Guillaume Allibert (Université Côte d’Azur, CNRS, I3S, France ); Christophe Stolz (Univ. Bourgogne Franche-Comte, France); Cedric Demonceaux (Univ. Bourgogne Franche-Comte, France) | Conventional 2D Convolutional Neural Networks (CNN) extract features from an input image by applying linear filters. These filters compute the spatial coherence by weighting the photometric information on a fixed neighborhood without taking into account the geometric information. We tackle the problem of improving the classical RGB CNN methods by using the depth information provided by the RGB-D cameras. State-of-the-art approaches use depth as an additional channel or image (HHA) or pass from 2D CNN to 3D CNN. This paper proposes a novel and generic procedure to articulate both photometric and geometric information in CNN architecture. The depth data is represented as a 2D offset to adapt spatial sampling locations. The new model presented is invariant to scale and rotation around X and Y axis of the camera coordinate system. Moreover, when depth data is constant, our model is equivalent to a regular CNN. Experiments of benchmarks validate the effectiveness of our model. | Deep Learning for Computer Vision | RGBD and Depth Image Processing | Geometry in CNN; Generic Model | |||||
57 | 904 | Oral | 1-C #09 | 3-A #10 | Sparse Convolutions on Continuous Domains for Point Cloud and Event Stream Networks | Dominic Jack (Queensland University of Technology)*; Frederic Maire (Queensland University of Technology); SIMON DENMAN (Queensland University of Technology, Australia); Anders Eriksson (University of Queensland ) | Image convolutions have been a cornerstone of a great number of deep learning advances in computer vision. The research community is yet to settle on an equivalent operator for sparse, unstructured continuous data like point clouds and event streams however. We present an elegant sparse matrix-based interpretation of the convolution operator for these cases, which is consistent with the mathematical definition of convolution and efficient during training. On benchmark point cloud classification problems we demonstrate networks built with these operations can train an order of magnitude or more faster than top existing methods, whilst maintaining comparable accuracy and requiring a tiny fraction of the memory. We also apply our operator to event stream processing, achieving state-of-the-art results on multiple tasks with streams of hundreds of thousands of events. | Deep Learning for Computer Vision | 3D Computer Vision | https://github.com/jackd/sccd | convolution; point clouds; event cameras; event streams; | ||||
58 | 914 | Oral | 1-D #11 | 2-B #10 | Unified Application of Style Transfer for Face Swapping and Reenactment | Le Minh Ngo (University of Amsterdam)*; Christian aan de Wiel (3DUniversum); Sezer Karaoglu (University of Amsterdam); Theo Gevers (University of Amsterdam) | Face reenactment and face swap have gained a lot of attention due to their broad range of applications in computer vision. Although both tasks share similar objectives (e.g. manipulating expression and pose), existing methods do not explore the benefits of combining these two tasks.In this paper, we introduce a unified end-to-end pipeline for face swapping and reenactment. We propose a novel approach to isolated disentangled representation learning of specific visual attributes in an unsupervised manner. A combination of the proposed training losses allows us to synthesize results in a one-shot manner. The proposed method does not require subject-specific training.We compare our method against state-of-the-art methods for multiple public datasets of different complexities. The proposed method outperforms other SOTA methods in terms of realistic-looking face images. | Face, Pose, Action, and Gesture | Applications of Computer Vision, Vision for X; Deep Learning for Computer Vision; Generative models for computer vision | face swap; face reenactment; face; generative adversarial networks; | |||||
59 | 916 | Oral | 1-A #12 | 2-B #11 | Self-supervised Learning of Orc-Bert Augmentator for Recognizing Few-Shot Oracle Characters | Wenhui Han (Fudan University); Xinlin Ren (Fudan University); Hangyu Lin (Fudan University); Yanwei Fu (Fudan University)*; Xiangyang Xue (Fudan University) | This paper studies the recognition of oracle character, the earliest known hieroglyphs in China. Essentially, oracle character recognition suffers from the problem of data limitation and imbalance. Recognizing the oracle characters of extremely limited samples, naturally, should be taken as the few-shot learning task. Different from the standard few-shot learning setting, our model has only access to large-scale unlabeled source Chinese characters and few labeled oracle characters. In such a setting, meta-based or metric-based few-shot methods are failed to be efficiently trained on source unlabeled data; and thus the only possible methodologies are self-supervised learning and data augmentation. Unfortunately, the conventional geometric augmentation always performs the same global transformations to all samples in pixel format, without considering the diversity of each part within a sample. Moreover, to the best of our knowledge, there is no effective self-supervised learning method for few-shot learning. To this end, this paper integrates the idea of self-supervised learning in data augmentation. And we propose a novel data augmentation approach, named Orc-Bert Augmentor pre-trained by self-supervised learning, for few-shot oracle character recognition. Specifically, Orc-Bert Augmentor leverages a self-supervised BERT model pre-trained on large unlabeled Chinese characters datasets to generate sample-wise augmented samples. Given a masked input in vector format, Orc-Bert Augmentor can recover it and then output a pixel format image as augmented data. Different mask proportion brings diverse reconstructed output. Concatenated with Gaussian noise, the model further performs point-wise displacement to improve diversity. Experimentally, we collect two large-scale datasets of oracle characters and other Chinese ancient characters for few-shot oracle character recognition and Orc-Bert Augmentor pre-training. Extensive experiments on few-shot learning demonstrate the effectiveness of our Orc-Bert Augmentor on improving the performance of various networks in the few-shot oracle character recognition. | Deep Learning for Computer Vision | Applications of Computer Vision, Vision for X; Datasets and Performance Analysis | https://github.com/wenhui-han/Oracle-50K.git | oracle character recognition; few-shot learning; data augmentation; self-supervised learning | ||||
60 | 938 | Oral | 1-B #09 | 3-D #11 | Generic Image Segmentation in Fully Convolutional Networks by Superpixel Merging Map | Jin-Yu Huang (National Taiwan University); Jian-Jiun Ding (National Taiwan University)* | Recently, the Fully Convolutional Network (FCN) has been adopted in image segmentation. However, existing FCN-based segmentation algorithms were designed for semantic segmentation. Before learning-based algorithms were developed, many advanced generic segmentation algorithms are superpixel-based. However, due to the irregular shape and size of superpixels, it is hard to apply deep learning to superpixel-based image segmentation directly. In this paper, we combined the merits of the FCN and superpixels and proposed a highly accurate and extremely fast generic image segmentation algorithm. We treated image segmentation as multiple superpixel merging decision problems and determined whether the boundary between two adjacent superpixels should be kept. In other words, if the boundary of two adjacent superpixels should be deleted, then the two superpixels will be merged. The network applies the colors, the edge map, and the superpixel information to make decision about merging suprepixels. By solving all the superpixel-merging subproblems with just one forward pass, the FCN facilitates the speed of the whole segmentation process by a wide margin meanwhile gaining higher accuracy. Simulations show that the proposed algorithm has favorable runtime, meanwhile achieving highly accurate segmentation results. It outperforms state-of-the-art image segmentation methods, including feature-based and learning-based methods, in all metrics. | Segmentation and Grouping | Deep Learning for Computer Vision; Robot Vision | https://drive.google.com/drive/folders/1NcEsdGh7OkuyTJk9Kx_U4N33f‐BIglRP?usp= sharing | image segmentation; superpixel; generic image segmentation; deep learning; edge information | ||||
61 | 940 | Oral | 2-D #11 | 3-B #11 | EvolGAN: Evolutionary Generative Adversarial Networks | Baptiste Roziere (Facebook AI Research); Fabien Teytaud (Univ. Littoral Cote d'Opale); Vlad Hosu (University of Konstanz); Hanhe Lin (University of Konstanz); Jeremy Rapin (Facebook AI Research); Mariia Zameshina (Inria); Olivier Teytaud (Facebook)* | We propose to use a quality estimator and evolutionarymethods to search the latent space of generative adversarial networkstrained on small, difficult datasets, or both. The new method leads tothe generation of significantly higher quality images while preserving theoriginal generator’s diversity. Human raters preferred an image from thenew version with frequency 83.7% for Cats, 74% for FashionGen, 70.4%for Horses, and 69.2% for Artworks - minor improvements for the alreadyexcellent GANs for faces. This approach applies to any quality scorer andGAN generator. | Generative models for computer vision | Optimization Methods | GAN; Image Generation; Generative Networks; Evolutionary Methods; IQA; Image Quality Assesment | |||||
62 | 962 | Oral | 2-C #09 | 3-A #11 | AFN: Attentional Feedback Network based 3D Terrain Super-Resolution | Ashish Kubade (International Institute Of Information Technology, Hyderabad)*; Diptiben Patel (IIIT Hyderabad); Avinash Sharma (CVIT, IIIT-Hyderabad); K. S. Rajan (IIIT Hyderabad) | Terrain, representing features of an earth surface, plays a crucial role in many applications such as simulations, route planning, analysis of surface dynamics, computer graphics-based games, entertainment, films, to name a few. With recent advancements in digital technology, these applications demand the presence of high resolution details in the terrain. In this paper, we propose a novel fully convolutional neural network based super-resolution architecture to increase the resolution of low-resolution Digital Elevation Model (LRDEM) with the help of information extracted from the corresponding aerial image as a complementary modality. We perform the super-resolution of LRDEM using an attention based feedback mechanism named ‘Attentional Feedback Network’ (AFN), which selectively fuses the information from LRDEM and aerial image to enhance and infuse the high-frequency features and to produce the terrain realistically . We compare the proposed architecture with existing state-of-the-art DEM super-resolution methods and show that the proposed architecture outperforms enhancing the resolution of input LRDEM accurately and in a realistic manner. | 3D Computer Vision | Deep Learning for Computer Vision; RGBD and Depth Image Processing | https://github.com/ashj9/AFN | Super-resolution, Digital Elevation Models, Feedback Network, Attentional Feedback, GIS, AFN | ||||
63 | 981 | Oral | 2-D #12 | 3-B #12 | L2R GAN: LiDAR-to-Radar Translation | LeiChen Wang (Daimler AG)*; Bastian Goldluecke (University of Konstanz); Carsten Anklam (Daimler AG) | The lack of annotated public radar datasets causes difficulties for research in environmental perception from radar observations. In this paper, we propose a novel neural network based framework which we call L2R GAN to generate the radar spectrum of natural scenes from a given LiDAR point cloud.We adapt ideas from existing image-to-image translation GAN frameworks, which we investigate as a baseline for translating radar spectra image from a given LiDAR bird’s eye view (BEV). However, for our application, we identify several shortcomings of existing approaches. As a remedy, we learn radar data generation with an occupancy-grid-mask as a guidance, and further design a set of local region generators and discriminator networks. This allows our L2R GAN to combine the advantages of global image features and local region detail, and not only learn the cross-modal relations between LiDAR and radar in large scale, but also refine details in small scale. Qualitative and quantitative comparison show that L2R GAN outperforms previous GAN architectures with respect to details by a large margin. A L2R-GAN-based GUI also allows users to define and generate radar data of special emergency scenarios to test corresponding ADAS applications such as Pedestrian Collision Warning (PCW). | Robot Vision | Applications of Computer Vision, Vision for X; Computational Photography, Sensing, and Display; Deep Learning for Computer Vision | GAN; deep learning; sensor fusion; autonomous vehicle | |||||
64 | 994 | Oral | 1-B #10 | 1-D #12 | Webly Supervised Semantic Embeddings for Large Scale Zero-Shot Learning | Yannick Le Cacheux (CEA LIST)*; Adrian Popescu (CEA LIST); Herve Le Borgne (CEA LIST) | Zero-shot learning (ZSL) makes object recognition in images possible in absence of visual training data for a part of the classes from a dataset. When the number of classes is large, classes are usually represented by semantic class prototypes learned automatically from unannotated text collections. This typically leads to much lower performances than with manually designed semantic prototypes such as attributes. While most ZSL works focus on the visual aspect and reuse standard semantic prototypes learned from generic text collections, we focus on the problem of semantic class prototype design for large scale ZSL. More specifically, we investigate the use of noisy textual metadata associated to photos as text collections, as we hypothesize they are likely to provide more plausible semantic embeddings for visual classes if exploited appropriately. We thus make use of a source-based filtering strategy to improve the robustness of semantic prototypes. Evaluation on the large scale ImageNet dataset shows a significant improvement in ZSL performances over two strong baselines, and over usual semantic embeddings used in previous works. We show that this improvement is obtained for several embedding methods, leading to state of the art results when one uses automatically created visual and text features. | Datasets and Performance Analysis | Big Data, Large Scale Methods; Deep Learning for Computer Vision | https://github.com/yannick-lc/semantic-embeddings-zsl | large scale zero-shot learning; webly supervised learning; semantic class prototypes | ||||
65 | 8 | Poster | 2-D #01 | 3-B #01 | Self-Guided Multiple Instance Learning for Weakly Supervised Thoracic Disease Classification and Localization in Chest Radiographs | Constantin Seibold (Karlsruhe Institute of Technology)*; Jens Kleesiek (German Cancer Research Center); Heinz-Peter Schlemmer (German Cancer Research Center); Rainer Stiefelhagen (Karlsruhe Institute of Technology) | Due to the high complexity of medical images and the scarcity of trained personnel, most large-scale radiological datasets are lacking fine-grained annotations and are often only described on image-level. These shortcomings hinder the deployment of automated diagnosis systems, which require human-interpretable justification for their decision process. In this paper, we address the problem of weakly supervised identification and localization of abnormalities in chest radiographs in a multiple-instance learning setting. To that end, we introduce a novel loss function for training convolutional neural networks increasing the localization confidence and assisting the overall disease identification. The loss leverages both image- and patch-level predictions to generate auxiliary supervision and enables specific training at patch-level. Rather than forming strictly binary from the predictions as done in previous loss formulations, we create targets in a more customized manner.This way, the loss accounts for possible misclassification of less certain instances. We show that the supervision provided within the proposed learning scheme leads to better performance and more precise predictions on prevalent datasets for multiple-instance learning as well as on the NIH ChestX-Ray14 benchmark for disease recognition than previously used losses. | Biomedical Image Analysis | Deep Learning for Computer Vision | https://github.com/ConstantinSeibold/SGL | Multiple Instance Learning; Chest X-Ray disease classification; Localization; Medical Image Analysis | ||||
66 | 10 | Poster | 1-A #01 | 3-C #01 | RealSmileNet: A Deep End-To-End Network for Spontaneous and Posed Smile Recognition | Yan Yang (Australian National University)*; Md Zakir Hossain (The Australian National University ); Tom Gedeon (The Australian National University); Shafin Rahman (North South University) | Smiles play a vital role in the understanding of social interactions within different communities, and reveals the physical state of mind of people in both real and deceptive ways. Several methods have been proposed to recognize spontaneous and posed smiles. All follow a feature-engineering based pipeline requiring costly pre-processing steps such as manual annotation of face landmarks, tracking, segmentation of smile phases, and hand-crafted features. The resulting computation is expensive, and strongly dependent on pre-processing steps. We investigate an end-to-end deep learning model to address these problems, the first end-to-end model for spontaneous and posed smile recognition. Our fully automated model is fast and learns the feature extraction processes by training a series of convolution and ConvLSTM layer from scratch. Our experiments on four datasets demonstrate the robustness and generalization of the proposed model by achieving state-of-the-art performances. | Deep Learning for Computer Vision | Face, Pose, Action, and Gesture | deep learning; spontaneous smile recognition; | |||||
67 | 18 | Poster | 1-C #01 | 2-A #01 | Anatomy and Geometry Constrained One-Stage Framework for 3D Human Pose Estimation | Xin Cao (Shanghai JiaoTong University); Xu Zhao (Shanghai Jiao Tong University)* | Although significant progress has been achieved in monocular3D human pose estimation, the correlation between body parts andcross-view geometry consistency have not been well studied. In this work,to fully explore the priors on body structure and view-relationship for3D human pose estimation, we propose an anatomy and geometry constrainedone-stage framework. First of all, we define a kinematic structuremodel in deep learning framework which represents the joint positionsin a tree-structure model. Then we propose bone-length and bone-symmetrylosses based on the anatomy prior, to encode the body structureinformation. To further explore the cross-view geometry information,we introduce a novel training mechanism for multi-view consistencyconstraints, which effectively reduces unnatural and implausible estimationresults. The proposed approach achieves state-of-the-art results onboth Human3.6M and MPI-INF-3DHP data sets. | 3D Computer Vision | Face, Pose, Action, and Gesture | https://github.com/sjtu-cx/AG_Pose | 3D pose estimation;anatomy prior;multi-view consistency constraint | ||||
68 | 19 | Poster | 1-C #02 | 3-A #01 | Imbalance Robust Softmax for Deep Embedding Learning | Hao Zhu (Australian National University)*; Yang Yuan (AnyVision); Guosheng Hu (AnyVision); Xiang Wu (Reconova); Neil Robertson (Queen's University Belfast) | Deep embedding learning is expected to learn a metric space in which features have smaller maximal intra-class distance than minimal inter-class distance.In recent years, one research focus is to solve the open-set problem by discriminative deep embedding learning in the field of face recognition (FR) and person re-identification (re-ID). Apart from open-set problem, we find that imbalanced training data is another main factor causing the performance degradation of FR and re-ID, and data imbalance widely exists in the real applications. However, very little research explores why and how data imbalance influences the performance of FR and re-ID. In this work, we deeply investigate data imbalance in the perspective of neural network optimisation and feature distribution. We find one main reason of performance degradation caused by data imbalance is that the weights (from the penultimate fully-connected layer) are far from their class centers in feature space. Based on this investigation, we propose a unified framework, Imbalance-Robust Softmax (\emph{IR-Softmax}), which can simultaneously solve the open-set problem and reduce the influence of data imbalance. IR-Softmax can generalise to any softmax and its variants (which are discriminative for open-set problem) by directly setting the weights as their class centers, naturally solving the data imbalance problem. In this work, we explicitly re-formulate two discriminative softmax (A-Softmax \cite{liu2017sphereface} and AM-Softmax \cite{2018arXiv180105599W}) under the framework of IR-Softmax. We conduct extensive experiments on FR databases (LFW, MegaFace) and re-ID database (Market-1501, Duke), and IR-Softmax outperforms many state-of-the-art methods. | Face, Pose, Action, and Gesture | Deep Learning for Computer Vision | Face Recognition; Person Re-identification; Imbalance Data; Metric Learning; Softmax | |||||
69 | 20 | Poster | 1-A #02 | 3-C #02 | IAFA: Instance-Aware Feature Aggregation for 3D Object Detection from a Single Image | Dingfu Zhou (Baidu)*; Xibin Song (Baidu); Yuchao Dai (Northwestern Polytechnical University); Junbo Yin (Beijing Institute of Technology); Feixiang Lu (Baidu); Miao Liao (Baidu); Jin Fang (Baidu ); Liangjun Zhang (Baidu) | 3D object detection from a single image is an important task in Autonomous Driving (AD), where various approaches have been proposed. However, the task is intrinsically ambiguous and challenging as single image depth estimation is already an ill-posed problem. In this paper, we propose an instance-aware approach to aggregate useful information for improving the accuracy of 3D object detection with the following contributions. First, an instance-aware feature aggregation (IAFA) module is proposed to collect local and global features for 3D bounding boxes regression. Second, we empirically find that the spatial attention module can be well learned by taking coarse-level instance annotations as a supervision signal. The proposed module has significantly boosted the performance of the baseline method on both 3D detection and 2D bird-eye's view of vehicle detection among all three categories. Third, our proposed method outperforms all single image-based approaches (even these methods trained with depth as auxiliary inputs) and achieves state-of-the-art 3D detection performance on the KITTI benchmark. | Deep Learning for Computer Vision | 3D Computer Vision; Applications of Computer Vision, Vision for X | 3D Object Detection; Single Frame; Autonomous Driving | |||||
70 | 27 | Poster | 1-A #03 | 3-C #03 | Synthetic-to-Real Unsupervised Domain Adaptation for Scene Text Detection in the Wild | weijia wu (Zhejiang University)*; Ning Lu (Tencent Cloud Product Department); Enze Xie (The University of Hong Kong); Yuxing Wang (Zhejiang University); Wenwen Yu (Xuzhou Medical University); Cheng Yang (Zhejiang University); HONG ZHOU (Zhejiang University) | Deep learning-based scene text detection can achieve preferable performance, powered with sufficient labeled training data. However, manual labeling is time consuming and laborious. At the extreme, the corresponding annotated data are unavailable. Exploiting synthetic data is a very promising solution except for domain distribution mismatches between synthetic datasets and real datasets. To address the severe domain distribution mismatch, we propose a synthetic-to-real domain adaptation method for scene text detection, which transfers knowledge from synthetic data (source domain) to real data (target domain). In this paper, a text self-training (TST) method and adversarial text instance alignment (ATA) for domain adaptive scene text detection are introduced. ATA helps the network learn domain-invariant features by training a domain classifier in an adversarial manner. TST diminishes the adverse effects of false positives(FPs) and false negatives(FNs) from inaccurate pseudo-labels. Two components have positive effects on improving the performance of scene text detectors when adapting from synthetic-to-real scenes. We evaluate the proposed method by transferring from SynthText, VISD to ICDAR2015, ICDAR2013. The results demonstrate the effectiveness of the proposed method with up to 10% improvement, which has important exploration significance for domain adaptive scene text detection. | Recognition: Feature Detection, Indexing, Matching, and Shape Representation | Deep Learning for Computer Vision | https://github.com/weijiawu/SyntoReal_STD | sence text detection; domain adaption; self training; stroke width transform; feature alignment | ||||
71 | 30 | Poster | 2-D #02 | 3-B #02 | Play Fair: Frame Contributions in Video Models | Will Price (University of Bristol)*; Dima Damen (University of Bristol) | In this paper, we introduce an attribution method for explaining action recognition models. Such models fuse information from multiple frames within a video, through score aggregation or relational reasoning. We break down a model’s class score into the sum of contributions from each frame, fairly. Our method adapts an axiomatic solution to fair reward distribution in cooperative games, known as the Shapley value, for elements in a variable-length sequence, which we call the Element Shapley Value (ESV). Critically, we propose a tractable approximation of ESV that scales linearly with the number of frames in the sequence.We employ ESV to explain two action recognition models (TRN and TSN) on the fine-grained dataset Something-Something. We offer detailed analysis of supporting/distracting frames, and the relationships of ESVs to the frame’s position, class prediction, and sequence length. We compare ESV to naive baselines and two commonly used attribution methods: Grad-CAM and Integrated-Gradients. | Video Analysis and Event Recognition | Datasets and Performance Analysis | https://github.com/willprice/play-fair/ | explainability; video understanding; shapley values; frame attribution; action recognition | ||||
72 | 31 | Poster | 1-A #04 | 3-B #03 | Speech2Video Synthesis with 3D Skeleton Regularization and Expressive Body Poses | Miao Liao (Baidu)*; Sibo Zhang (Baidu); Peng Wang (Baidu USA LLC.); Hao Zhu (Nanjing University); Xinxin Zuo (University of Kentucky); Ruigang Yang (University of Kentucky, USA) | In this paper, we propose a novel approach to convert a given speech audio to a photo-realistic speaking video of a specific person, where the output video has synchronized, realistic and expressive rich body dynamics. We achieve this by first generating 3D skeleton movements from the audio sequence using a recurrent neural network (RNN), and then synthesizing the output video via a conditional generative adversarial network (GAN). To make the skeleton movement realistic and expressive, we embed the knowledge of an articulated 3D human skeleton and a learned dictionary of personal speech iconic gestures into the generation process in both learning and testing pipelines. The former prevents the generation of unreasonable body distortion, while the later helps our model quickly learn meaningful body movement through a few recorded videos. To produce photo-realistic and high-resolution video with motion details, we propose to insert part attention mechanisms in the conditional GAN, where each detailed part, e.g. head and hand, is automatically zoomed in to have their own discriminators. To validate our approach, we collect a dataset with 20 high-quality videos from 1 male and 1 female model reading various documents under different topics. Compared with previous SoTA pipelines handling similar tasks, our approach achieves better results by a user study. | Face, Pose, Action, and Gesture | Generative models for computer vision | https://github.com/sibozhang/Speech2Video | Body Pose; Image Synthesis; Face; Gesture; Vision and Language; Audio; GAN; Video Synthesis; Human Simulation; 3D | ||||
73 | 36 | Poster | 1-B #01 | 3-D #01 | Transforming Multi-Concept Attention into Video Summarization | Yen-Ting Liu (National Taiwan University)*; Yu-Jhe Li (Carnegie Mellon University); Yu-Chiang Frank Wang (National Taiwan University) | Video summarization is among challenging tasks in computer vision, which aims at identifying highlight frames or shots over a lengthy video input. In this paper, we propose an novel attention-based framework for video summarization with complex video data. Unlike previous works which only apply attention mechanism on the correspondence between frames, our multi-concept video self-attention (MC-VSA) model is presented to identify informative regions across temporal and concept video features, which jointly exploit context diversity over time and space for summarization purposes. Together with consistency between video and summary enforced in our framework, our model can be applied to both labeled and unlabeled data, making our method preferable to real-world applications. Extensive and complete experiments on two benchmarks demonstrate the effectiveness of our model both quantitatively and qualitatively, and confirms its superiority over the state-of-the-arts. | Video Analysis and Event Recognition | Deep Learning for Computer Vision | "video summarization; self-attention; video recognition" | |||||
74 | 39 | Poster | 3-A #02 | 3-C #04 | 3D Guided Weakly Supervised Semantic Segmentation | Weixuan Sun (Australian National University, Data61 )*; Jing Zhang (Australian National University); Nick Barnes (ANU) | Pixel-wise clean annotation is necessary for fully-supervised semantic segmentation, which is laborious and expensive to obtain. In this paper, we propose a weakly supervised 2D semantic segmentation model by incorporating sparse bounding box labels with available 3D information, which is much easier to obtain with advanced sensors. We manually labeled a subset of the 2D-3D Semantics(2D-3D-S) dataset with bounding boxes, and introduce our 2D-3D inference module to generate accurate pixel-wise segment proposal masks. Guided by 3D information, we first generate a point cloud of objects and calculate objectness probability score for each point. Then we project the point cloud with objectness probabilities back to 2D images followed by a refinement step to obtain segment proposals, which are treated as pseudo labels to train a semantic segmentation network. Our method works in a recursive manner to gradually refine the above-mentioned segment proposals. Extensive experimental results on the 2D-3D-S dataset show that the proposed method can generate accurate segment proposals when bounding box labels are available on only a small subset of training images. Performance comparison with recent state-of-the-art methods further illustrates the effectiveness of our method. | Segmentation and Grouping | semantic segmentation;weak supervision;3D guidance; | ||||||
75 | 41 | Poster | 2-B #01 | 3-D #02 | Visual Tracking by TridentAlign and Context Embedding | Janghoon Choi (Seoul National University)*; Junseok Kwon (Chung-Ang Univ., Korea); Kyoung Mu Lee (Seoul National University) | Recent advances in Siamese network-based visual tracking methods have enabled high performance on numerous tracking benchmarks. However, extensive scale variations of the target object and distractor objects with similar categories have consistently posed challenges in visual tracking. To address these persisting issues, we propose novel TridentAlign and context embedding modules for Siamese network-based visual tracking methods. The TridentAlign module facilitates adaptability to extensive scale variations and large deformations of the target, where it pools the feature representation of the target object into multiple spatial dimensions to form a feature pyramid, which is then utilized in the region proposal stage. Meanwhile, context embedding module aims to discriminate the target from distractor objects by accounting for the global context information among objects. The context embedding module extracts and embeds the global context information of a given frame into a local feature representation such that the information can be utilized in the final classification stage. Experimental results obtained on multiple benchmark datasets show that the performance of the proposed tracker is comparable to that of state-of-the-art trackers, while the proposed tracker runs at real-time speed. | Motion and Tracking | Deep Learning for Computer Vision | https://github.com/JanghoonChoi/TACT | visual tracking; object tracking; tracking; | ||||
76 | 50 | Poster | 1-C #03 | 3-A #03 | Frequency Attention Network: Blind Noise Removal for Real Images | Hongcheng Mo (Shanghai Jiao Tong University); Jianfei Jiang (Shanghai Jiao Tong University); Qin Wang (Shanghai Jiao Tong University)*; Dong Yin (Fullhan); Pengyu Dong (Fullhan); Jingjun Tian (Fullhan) | With outstanding feature extraction capabilities, deep convolutional neural networks(CNNs) have achieved extraordinary improvements in image denoising tasks. However, because of the difference of statistical characteristics of signal-dependent noise and signal-independent noise, it is hard to model real noise for training and blind real image denoising is still an important challenge problem. In this work we propose a method for blind image denoising that combines frequency domain analysis and attention mechanism, named frequency attention network (FAN). We adopt wavelet transform to convert images from spatial domain to frequency domain with more sparse features to utilize spectrum information and structure information. For the denoising task, the objective of the neural network is to estimate the optimal solution of the wavelet coefficients of the clean image by nonlinear characteristics, which makes FAN possess good interpretability. Meanwhile, spatial and channel mechanisms are employed to enhance feature maps at different scales for capturing contextual information. Extensive experiments on the synthetic noise dataset and two real-world noise benchmarks indicate the superiority of our method over other competing methods at different noise type cases in blind image denoising. | Low-level Vision, Image Processing | Deep Learning for Computer Vision | https://github.com/momo1689/FAN | image denoising; wavelet; attention mechanism; UNet; CNN | ||||
77 | 73 | Poster | 1-B #02 | 3-D #03 | Exploiting Transferable Knowledge for Fairness-aware Image Classification | sunhee hwang (Yonsei university)*; Sungho Park (Yonsei University); Pilhyeon Lee (Yonsei University); seogkyu jeon (Yonsei university); Dohyung Kim (Yonsei University); Hyeran Byun (Yonsei University) | Recent studies have revealed the importance of fairness in machine learning and computer vision systems, in accordance with the concerns about the unintended social discrimination produced by the systems. In this work, we aim to tackle the fairness-aware image classification problem, whose goal is to classify a target attribute (e.g., attractiveness) in a fair manner regarding protected attributes (e.g., gender, age, race). To this end, existing methods mainly rely on protected attribute labels for training, which are costly and sometimes unavailable for real-world scenarios. To alleviate the restriction and enlarge the scalability of fair models, we introduce a new framework where a fair classification model can be trained on datasets without protected attribute labels (i.e., target datasets) by exploiting knowledge from pre-built benchmarks (i.e., source datasets). Specifically, when training a target attribute encoder, we encourage its representations to be independent of the features from the pre-trained encoder on a source dataset. Moreover, we design a Group-wise Fair loss to minimize the gap in error rates between different protected attribute groups. To the best of our knowledge, this work is the first attempt to train the fairness-aware image classification model on a target dataset without protected attribute annotations. To verify the effectiveness of our approach, we conduct experiments on CelebA and UTK datasets with two settings: the conventional and the transfer settings. In the both settings, our model shows the fairest results when compared to the existing methods. | Deep Learning for Computer Vision | fairness in computer vision; equality of opportunity; face attribute classification | ||||||
78 | 75 | Poster | 3-B #04 | 3-D #04 | Adversarial Semi-Supervised Multi-Domain Tracking | Kourosh Meshgi (RIKEN AIP)*; Maryam Sadat Mirzaei (Riken AIP / Kyoto University) | Neural networks for multi-domain learning empowers an effective combination of information from different domains by sharing and co-learning the parameters. In visual tracking, the emerging features in shared layers of a multi-domain tracker, trained on various sequences, are crucial for tracking the target in unseen videos. Yet, in a fully shared architecture, some of the emerging features are useful only in a specific domain, reducing the generalization of the learned feature representation. We propose a semi-supervised learning scheme to separate domain-invariant and domain-specific features using adversarial learning, to encourage mutual exclusion between them, and to leverage self-supervised learning for enhancing the shared features using the unlabeled reservoir. By employing these features and training dedicated layers for each sequence, we build a tracker that performs exceptionally on different types of videos. | Motion and Tracking | Deep Learning for Computer Vision; Video Analysis and Event Recognition | http://ishiilab.jp/member/meshgi-k/asmt.html | visual tracking; multi-domain learning; adversarial training; self-supervised representation learning; | ||||
79 | 76 | Poster | 1-A #05 | 3-B #05 | DeepVoxels++: Enhancing the Fidelity of Novel View Synthesis from 3D Voxel Embeddings | Tong He (UCLA)*; John Collomosse (Adobe Research); Hailin Jin (Adobe Research); Stefano Soatto (UCLA) | We present a novel view synthesis method based upon latent voxel embeddings of an object, which encode both shape and appearance information and are learned without explicit 3D occupancy supervision. Our method uses an encoder-decoder architecture to learn such deep volumetric representations from a set of images taken at multiple viewpoints. Compared with DeepVoxels, our DeepVoxels++ applies a series of enhancements: a) a patch-based image feature extraction and neural rendering scheme that learns local shape and texture patterns, and enables neural rendering at high resolution; b) learned view-dependent feature transformation kernels to explicitly model perspective transformations induced by viewpoint changes; c) a recurrent-concurrent aggregation technique to alleviate single-view update bias of the voxel embeddings recurrent learning process. Combined with d) a simple yet effective implementation trick of frustum representation sufficient sampling, we achieve improved visual quality over the prior deep voxel-based methods (33% SSIM error reduction and 22% PSNR improvement) on 360-degree novel-view synthesis benchmarks. | 3D Computer Vision | Generative models for computer vision | https://tonghehehe.com/deepvoxelspp | novel-view synthesis; latent voxel embeddings; neural rendering; local patterns | ||||
80 | 82 | Poster | 1-B #03 | 1-D #01 | CPTNet: Cascade Pose Transform Network for Single Image Talking Head Animation | Jiale Zhang (Huazhong University of Science and Technology); Ke Xian (Huazhong University of Science and Technology); Chengxin Liu (Huazhong University of Science and Technology)*; Yinpeng Chen (Huazhong University of Science and Technology); Zhiguo Cao (Huazhong Univ. of Sci.&Tech.); Weicai Zhong (Huawei CBG Consumer Cloud Service Big Data Platform Dept.) | We study the problem of talking head animation from a sin-gle image. Most of the existing methods focus on generating talking headsfor human. However, little attention has been paid to the creation of talk-ing head anime. In this paper, our goal is to synthesize vivid talking headsfrom a single anime image. To this end, we propose cascade pose trans-form network, termed CPTNet, that consists of a face pose transformnetwork and a head pose transform network. Specifically, we introducea mask generator to animate facial expression (e.g., close eyes and openmouth) and a grid generator for head movement animation, followed by afusion module to generate talking heads. In order to handle large motionand obtain more accurate results, we design a pose vector decomposi-tion and cascaded refinement strategy. In addition, we create an animetalking head dataset, that includes various anime characters and poses,to train our model. Extensive experiments on our dataset demonstratethat our model outperforms other methods, generating more accurateand vivid talking heads from a single anime image. | Generative models for computer vision | talking head animation; pose transform; pose decomposition; cascade refinement; | ||||||
81 | 85 | Poster | 2-B #02 | 3-D #05 | Multi-View Consistency Loss for Improved Single-Image 3D Reconstruction of Clothed People | Akin Caliskan (Center for Vision Speech and Signal Processing - University of Surrey)*; Armin Mustafa (University of Surrey); Evren Imre (Vicon); Adrian Hilton (University of Surrey) | We present a novel method to improve the accuracy of the 3D reconstruction of clothed human shape from a single image. Recent work has introduced volumetric, implicit and model-based shape learning frameworks for reconstruction of objects and people from one or more images. However, the accuracy and completeness for reconstruction of clothed people is limited due to the large variation in shape resulting from clothing, hair, body size, pose and camera viewpoint. This paper introduces two advances to overcome this limitation: firstly a new synthetic dataset of realistic clothed people, 3DVH;and secondly, a novel multiple-view loss function for training of monocular volumetric shape estimation, which is demonstrated to significantly improve generalisation and reconstruction accuracy. The 3DVH dataset of realistic clothed 3D human models rendered with diverse natural backgrounds is demonstrated to allows transfer to reconstruction from real images of people. Comprehensive comparative performance evaluation on both synthetic and real images of people demonstrates that the proposed method significantly outperforms the previous state-of-the-art learning-based single image 3D human shape estimation approaches achieving significant improvement of reconstruction accuracy, completeness, and quality. An ablation study shows that this is due to both the proposed multiple-view training and the new 3DVH dataset. The code and the dataset can be found at the project website: https://akincaliskan3d.github.io/MV3DH/. | 3D Computer Vision | Datasets and Performance Analysis | https://akincaliskan3d.github.io/MV3DH/ | 3D Reconstruction; Human Modelling; 3D Human Reconstruction; Dataset | ||||
82 | 100 | Poster | 1-D #02 | 2-B #03 | CLASS: Cross-Level Attention and Supervision for Salient Objects Detection | Lv Tang (Nanjing University)*; Bo Li (Nanjing University) | Salient object detection (SOD) is a fundamental computer vision task. Recently, with the revival of deep neural networks, SOD has made great progresses. However, there still exist two thorny issues that cannot be well addressed by existing methods, indistinguishable regions and complex structures. To address these two issues, in this paper we propose a novel deep network for accurate SOD, named CLASS. First, in order to leverage the different advantages of low-level and high-level features, we propose a novel non-local cross-level attention (CLA), which can capture the long-range feature dependencies to enhance the distinction of complete salient object. Second, a novel cross-level supervision (CLS) is designed to learn complementary context for complex structures through pixel-level, region-level and object-level. Then the fine structures and boundaries of salient objects can be well restored. In experiments, with the proposed CLA and CLS, our CLASS net consistently outperforms 13 state-of-the-art methods on five datasets. | Deep Learning for Computer Vision | Recognition: Feature Detection, Indexing, Matching, and Shape Representation; Segmentation and Grouping | https://github.com/luckybird1994/classnet | Salient Object Detection; Cross-level Attention; Cross-level Supervision; Deep Learning | ||||
83 | 102 | Poster | 1-A #06 | 1-B #04 | MLIFeat: Multi-level information fusion based deep local features | Yuyang Zhang (Institute of Automation, Chinese Academy of Sciences; University of Chinese Academy of Sciences); Jinge Wang (Megvii); Shibiao Xu (Institute of Automation, Chinese Academy of Sciences)*; Xiao Liu (Megvii Inc); Xiaopeng Zhang (Institute of Automation, Chinese Academy of Sciences) | Accurate image keypoints detection and description are of central importance in a wide range of applications. Although there are various studies proposed to address these challenging tasks, they are far from optimal. In this paper, we devise a model named MLIFeat with two novel light-weight modules for multi-level information fusion based deep local features learning, to cope with both the image keypoints detection and description. On the one hand, the image keypoints are robustly detected by our Feature Shuffle Module (FSM), which can efficiently utilize the multi-level convolutional feature maps with marginal computing cost. On the other hand, the corresponding feature descriptors are generated by our well-designed Feature Blend Module (FBM), which can collect and extract the most useful information from the multi-level convolutional feature vectors. To study in-depth about our MLIFeat and other state-of-the-art methods, we have conducted thorough experiments, including image matching on HPatches and FM-Bench, and visual localization on Aachen-Day-Night, which verifies the robustness and effectiveness of our proposed model. Code at:https://github.com/yyangzh/MLIFeat | Recognition: Feature Detection, Indexing, Matching, and Shape Representation | Low-level Vision, Image Processing | https://github.com/yyangzh/MLIFeat | Deep Local Features,Keypoints Detection and Description,Image Matching,Visual Localization | ||||
84 | 106 | Poster | 2-B #04 | 3-D #06 | Multiple Exemplars-based Hallucination for Face Super-resolution and Editing | Kaili Wang (KU Leuven, UAntwerpen)*; Jose Oramas (UAntwerp, imec-IDLab); Tinne Tuytelaars (KU Leuven) | Given a really low-resolution input image of a face (say 16×16 or 8×8 pixels), the goal of this paper is to reconstruct a high-resolution version thereof. This, by itself, is an ill-posed problem, as the high-frequency information is missing in the low-resolution input and needs to be hallucinated, based on prior knowledge about the image content. Rather than relying on a generic face prior, in this paper, we explore the use of a set of exemplars, i.e. other high-resolution images of the same person. These guide the neural network as we condition the output on them. Multiple exemplars work better than a single one. To combine the information from multiple exemplars effectively, we intro-duce a pixel-wise weight generation module. Besides standard face super-resolution, our method allows to perform subtle face editing simply by replacing the exemplars with another set with different facial features. A user study is conducted and shows the super-resolved images can hardly be distinguished from real images on the CelebA dataset. A qualitative comparison indicates our model outperforms methods proposed in the literature on the CelebA and WebFace data. | Face, Pose, Action, and Gesture | Applications of Computer Vision, Vision for X; Deep Learning for Computer Vision; Generative models for computer vision ; Low-level Vision, Image Processing | https://github.com/shadowwkl/MIL-Face-SR | face super-resolution; multiple exemplars; face editing | ||||
85 | 109 | Poster | 2-A #02 | 2-C #01 | End-to-end Model-based Gait Recognition | Xiang Li (Nanjing University of Science and Technology)*; Yasushi Makihara ("""Osaka University, Japan"""); Chi Xu (Nanjing University of Science and Technology); Yasushi Yagi (Osaka University); Shiqi Yu (Southern University of Science and Technology, China); Mingwu Ren (Nanjing University of Science and Technology) | Most existing gait recognition approaches adopt a two-step procedure: a preprocessing step to extract silhouettes or skeletons followed by recognition. In this paper, we propose an end-to-end model-based gait recognition method. Specifically, we employ a skinned multi-person linear (SMPL) model for human modeling, and estimate its parameters using a pre-trained human mesh recovery (HMR) network. As the pre-trained HMR is not recognition-oriented, we fine-tune it in an end-to-end gait recognition framework. To cope with differences between gait datasets and those used for pre-training the HMR, we introduce a reconstruction loss between the silhouette masks in the gait datasets and the rendered silhouettes from the estimated SMPL model produced by a differentiable renderer. This enables us to adapt the HMR to the gait dataset without supervision using the ground-truth joint locations. Experimental results with the OU-MVLP and CASIA-B datasets demonstrate the state-of-the-art performance of the proposed method for both gait identification and verification scenarios, a direct consequence of the explicitly disentangled pose and shape features produced by the proposed end-to-end model-based framework. | Biometrics | model-based gait recognition; end-to-end; explicit disentanglement; shape and pose feature; human mesh recovery; SMPL | ||||||
86 | 123 | Poster | 1-C #04 | 3-A #04 | Learning End-to-End Action Interaction by Paired-Embedding Data Augmentation | Ziyang Song (Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University.)*; Zejian Yuan (Xi‘an Jiaotong University); Chong Zhang (Tencent Robotics X); Wanchao Chi (Tencent Robotics X); Yonggen Ling (Tencent); Shenghao Zhang (Tencent) | In recognition-based action interaction, robots' responses to human actions are often pre-designed according to recognized categories and thus stiff.In this paper, we specify a new Interactive Action Translation (IAT) task which aims to learn end-to-end action interaction from unlabeled interactive pairs, removing explicit action recognition.To enable learning on small-scale data, we propose a Paired-Embedding (PE) method for effective and reliable data augmentation.Specifically, our method first utilizes paired relationships to cluster individual actions in an embedding space.Then two actions originally paired can be replaced with other actions in their respective neighborhood, assembling into new pairs.An Act2Act network based on conditional GAN follows to learn from augmented data.Besides, IAT-test and IAT-train scores are specifically proposed for evaluating methods on our task.Experimental results on two datasets show impressive effects and broad application prospects of our method. | Applications of Computer Vision, Vision for X | Deep Learning for Computer Vision; Face, Pose, Action, and Gesture; Generative models for computer vision | human-robot interaction; action generation; data augmentation; clustering | |||||
87 | 139 | Poster | 1-A #07 | 2-C #02 | Quantum Robust Fitting | Tat-Jun Chin (University of Adelaide); David Suter (Edith Cowan University); Shin-Fang Ch'ng (The University of Adelaide)*; James Quach (The University of Adelaide) | Many computer vision applications need to recover structure from imperfect measurements of the real world. The task is often solved by robustly fitting a geometric model onto noisy and outlier-contaminated data. However, recent theoretical analyses indicate that many commonly used formulations of robust fitting in computer vision are not amenable to tractable solution and approximation. In this paper, we explore the usage of quantum computers for robust fitting. To do so, we examine and establish the practical usefulness of a robust fitting formulation inspired by the analysis of monotone Boolean functions. We then investigate a quantum algorithm to solve the formulation and analyse the computational speed-up possible over the classical algorithm. Our work thus proposes one of the first quantum treatments of robust fitting for computer vision. | 3D Computer Vision | robust fitting; quantum computing | ||||||
88 | 141 | Poster | 1-D #03 | 2-B #05 | Tracking-by-Trackers with a Distilled and Reinforced Model | Matteo Dunnhofer (University of Udine)*; Niki Martinel (University of Udine); CHRISTIAN MICHELONI (University of Udine, Italy) | Visual object tracking was generally tackled by reasoning independently on fast processing algorithms, accurate online adaptation methods, and fusion of trackers. In this paper, we unify such goals by proposing a novel tracking methodology that takes advantage of other visual trackers, offline and online. A compact student model is trained via the marriage of knowledge distillation and reinforcement learning. The first allows to transfer and compress tracking knowledge of other trackers. The second enables the learning of evaluation measures which are then exploited online. After learning, the student can be ultimately used to build (i) a very fast single-shot tracker, (ii) a tracker with a simple and effective online adaptation mechanism, (iii) a tracker that performs fusion of other trackers. Extensive validation shows that the proposed algorithms compete with real-time state-of-the-art trackers. | Motion and Tracking | Deep Learning for Computer Vision; Video Analysis and Event Recognition | https://github.com/dontfollowmeimcrazy/vot-kd-rl | visual tracking; tracking-by-trackers; knowledge distillation; reinforcement learning; student-teacher; | ||||
89 | 150 | Poster | 1-A #08 | 1-C #05 | Do We Need Sound for Sound Source Localization? | Takashi Oya (Waseda University)*; Shohei Iwase (Waseda University ); Ryota Natsume (Waseda University); Takahiro Itazuri (Waseda University); Shugo Yamaguchi (Waseda University); Shigeo Morishima (Waseda Research Institute for Science and Engineering) | During the performance of sound source localization which uses both visual and aural information, it presently remains unclear how much either image or sound modalities contribute to the result, i.e. do we need both image and sound for sound source localization? To address this question, we develop an unsupervised learning system that solves sound source localization by decomposing this task into two steps: (i) ''potential sound source localization'', a step that localizes possible sound sources using only visual information (ii) ''object selection'', a step that identifies which objects are actually sounding using aural information. Our overall system achieves state-of-the-art performance in sound source localization, and more importantly, we find that despite the constraint on available information, the results of (i) achieve similar performance. From this observation and further experiments, we show that visual information is dominant in ''sound'' source localization when evaluated with the currently adopted benchmark dataset. Moreover, we show that the majority of sound-producing objects within the samples in this dataset can be inherently identified using only visual information, and thus that the dataset is inadequate to evaluate a system's capability to leverage aural information. As an alternative, we present an evaluation protocol which enforces both visual and aural information to be leveraged, and verify this property through several experiments. | Applications of Computer Vision, Vision for X | Deep Learning for Computer Vision | cross-modal learning; sound source localization; unsupervised learning; self-supervised learning; audio-visual learning | |||||
90 | 151 | Poster | 1-D #04 | 2-B #06 | Adaptive Spatio-Temporal Regularized Correlation Filters for UAV-based Tracking | Libin Xu (Shandong University of Technology); Qilei Li (Sichuan University); Jun Jiang ( Southwest Petroleum University;Sichuan University of Science & Engineering); Guofeng Zou (Shandong University of Technology); Zheng Liu (University of British Columbia); Mingliang Gao (Shandong University of Technology)* | The advance of visual tracking has provided unmanned aerial vehicle (UAV) with the intriguing capability for various practical applications. With promising performance and efficiency, discriminative correlation filter (DCF)-based trackers have drawn great attention and undergone remarkable progress. However, the boundary effect and filter degradation remain two challenging problems. In this work, we propose a novel Adaptive Spatio-Temporal Regularized Correlation Filter (ASTR-CF) model to address these two problems. The ASTR-CF can optimize the spatial regularization weight and the temporal regularization weight simultaneously. Meanwhile, the proposed model can be effectively optimized based on the alternating direction method of multipliers (ADMM), where each subproblem has a closed-form solution. Experimental results on DTB70 and UAV123@10fps benchmarks have proven the superiority of our method compared to the state-of-the-art trackers in terms of both accuracy and computational speed. | Motion and Tracking | https://github.com/mlgao?tab=projects | UAV tracking; correlation filter; spatio-temporal regularization | |||||
91 | 167 | Poster | 2-A #03 | 3-C #05 | Learning Global Pose Features in Graph Convolutional Networks for 3D Human Pose Estimation | Kenkun Liu ( University of Illinois at Chicago); Zhiming Zou (University of Illinois at Chicago); Wei Tang (University of Illinois at Chicago)* | As the human body skeleton can be represented as a sparse graph, it is natural to exploit graph convolutional networks (GCNs) to model the articulated body structure for 3D human pose estimation (HPE). However, a vanilla graph convolutional layer, the building block of a GCN, only models the local relationships between each body joint and their neighbors on the skeleton graph. Some global attributes, e.g., the action of the person, can be critical to 3D HPE, especially in the case of occlusion or depth ambiguity. To address this issue, this paper introduces a new 3D HPE framework by learning global pose features in GCNs. Specifically, we add a global node to the graph and connect it to all the body joint nodes. On one hand, global features are updated by aggregating all body joint features to model the global attributes. On the other hand, the feature update of each body joint depends on not only their neighbors but also the global node. Furthermore, we propose a heterogeneous multi-task learning framework to learn the local and global features. While each local node regresses the 3D coordinate of the corresponding body joint, we force the global node to classify an action category or learn a low-dimensional pose embedding. Experimental results demonstrate the effectiveness of the proposed approach. | 3D Computer Vision | Deep Learning for Computer Vision | Graph Convolutional Networks,Global Pose Features,3D Human Pose Estimation | |||||
92 | 168 | Poster | 1-C #06 | 3-A #05 | Horizontal Flipping Assisted Disentangled Feature Learning for Semi-Supervised Person Re-Identification | Gehan Hao ( University of Electronic Science and Technology of China); Yang Yang (Institute of Automation, Chinese Academy of Sciences); Xue Zhou (University of Electronic Science and Technology of China)*; Guanan Wang (CASIA); Zhen Lei (NLPR, CASIA, China) | In this paper, we propose to learn a powerful Re-ID model by using less labeled data together with lots of unlabeled data,i.e.semi-supervised Re-ID. Such kind of learning enables Re-ID model to be more generalizable and scalable to real-world scenes. Specifically, we design a two-stream encoder-decoder-based structure with shared modules and parameters. For the encoder module, we take the original person image with its horizontal mirror image as a pair of inputs and encode deep features with identity and structural information properly disentangled. Then different combinations of disentangling features are used to reconstruct images in the decoder module. In addition to the commonly used constraints from identity consistency and image reconstruction consistency for loss function definition, we design a novel loss function of en-forcing consistent transformation constraints on disentangled features. It is free of labels, but can be applied to both supervised and unsupervised learning branches in our model. Extensive results on four Re-ID datasets demonstrate that by reducing 5/6 labeled data, Our method achieves the best performance on Market-1501 and CUHK03, and comparable accuracy on DukeMTMC-reID and MSMT17. | Deep Learning for Computer Vision | Recognition: Feature Detection, Indexing, Matching, and Shape Representation | person re-identification; semi-supervised; feature disentangled learning; horizontal flipping; self-supervised | |||||
93 | 169 | Poster | 2-A #04 | 3-B #06 | Unpaired Multimodal Facial Expression Recognition | Bin Xia (University of Science and Technology of China); Shangfei Wang (University of Science and Technology of China)* | Current works on multimodal facial expression recognition typically require paired visible and thermal facial images. Although visible cameras are readily available in our daily life, thermal cameras are expensive and less prevalent. It is costly to collect a large quantity of synchronous visible and thermal facial images. To tackle this paired training data bottleneck, we propose an unpaired multimodal facial expression recognition method, which makes full use of the massive number of unpaired visible and thermal images by utilizing thermal images to construct better image representations and classifiers for visible images during training. Specifically, two deep neural networks are trained from visible and thermal images to learn image representations and expression classifiers for two modalities. Then, an adversarial strategy is adopted to force statistical similarity between the learned visible and thermal representations, and to minimize the distribution mismatch between the predictions of the visible and thermal images. Through adversarial learning, the proposed method leverages thermal images to construct better image representations and classifiers for visible images during training, without the requirement of paired data. A decoder network is built upon the visible hidden features in order to preserve some inherent features of the visible view. We also take the variability of the different images’ transferability into account via adaptive classification loss. During testing, only visible images are required and the visible network is used. Thus, the proposed method is appropriate for real-world scenarios, since thermal imaging is rare in these instances. Experiments on two benchmark multimodal expression databases and three visible facial expression databases demonstrate the superiority of the proposed method compared to state-of-the-art methods. | Face, Pose, Action, and Gesture | Facial Expression Recognition;Privileged Information | ||||||
94 | 188 | Poster | 1-C #07 | 3-A #06 | Dense Dual-Path Network for Real-time Semantic Segmentation | Xinneng Yang (Tongji University)*; Yan Wu (Tongji University); Junqiao Zhao (Tongji University); Feilin Liu (Tongji University) | Semantic segmentation has achieved remarkable results with high computational cost and a large number of parameters. However, real-world applications require efficient inference speed on embedded devices. Most previous works address the challenge by reducing depth, width and layer capacity of network, which leads to poor performance. In this paper, we introduce a novel Dense Dual-Path Network (DDPNet) for real-time semantic segmentation under resource constraints. We design a light-weight and powerful backbone with dense connectivity to facilitate feature reuse throughout the whole network and the proposed Dual-Path module (DPM) to sufficiently aggregate multi-scale contexts. Meanwhile, a simple and effective framework is built with a skip architecture utilizing the high-resolution feature maps to refine the segmentation output and an upsampling module leveraging context information from the feature maps to refine the heatmaps. The proposed DDPNet shows an obvious advantage in balancing accuracy and speed. Specifically, on Cityscapes test dataset, DDPNet achieves 75.3% mIoU with 52.6 FPS for an input of 1024 X 2048 resolution on a single GTX 1080Ti card. Compared with other state-of-the-art methods, DDPNet achieves a significant better accuracy with a comparable speed and fewer parameters. | Deep Learning for Computer Vision | Segmentation and Grouping | Real-time Semantic Segmentation; Dense Dual-Path Network; Light-weight Network; Densely Connected Convolutional Network | |||||
95 | 189 | Poster | 2-B #07 | 3-D #07 | Audiovisual Transformer with Instance Attention for Audio-Visual Event Localization | Yan-Bo Lin (National Taiwan Unviersity)*; Yu-Chiang Frank Wang (National Taiwan University) | Audio-visual event localization requires one to identify the event label across video frames by jointly observing visual and audio information. To address this task, we propose a deep learning framework of cross-modality co-attention for video event localization. Our proposed audiovisual transformer (AV-transformer) is able to exploit intra and inter-frame visual information, with audio features jointly observed to perform co-attention over the above three modalities. With visual, temporal, and audio information observed across consecutive video frames, our model achieves promising capability in extracting informative spatial/temporal features for improved event localization. Moreover, our model is able to produce instance-level attention, which would identify image regions at the instance level which are associated with the sound/event of interest. Experiments on a benchmark dataset confirm the effectiveness of our proposed framework, with ablation studies performed to verify the design of our propose network model. | Applications of Computer Vision, Vision for X | Deep Learning for Computer Vision; Video Analysis and Event Recognition | Audio Video Features; Dual Modality; Event Localization | |||||
96 | 190 | Poster | 2-A #05 | 2-C #03 | Hyperparameter-Free Out-of-Distribution Detection Using Cosine Similarity | Engkarat Techapanurak (Tohoku University)*; Masanori Suganuma (RIKEN AIP / Tohoku University); Takayuki Okatani (Tohoku University/RIKEN AIP) | The ability to detect out-of-distribution (OOD) samples is vital to secure the reliability of deep neural networks in real-world applications. Considering the nature of OOD samples, detection methods should not have hyperparameters that need to be tuned depending on incoming OOD samples. However, most of the recently proposed methods do not meet this requirement, leading to compromised performance in real-world applications. In this paper, we propose a simple and computationally efficient, hyperparameter-free method that uses cosine similarity. Although recent studies show its effectiveness for metric learning, it remains uncertain if cosine similarity works well also for OOD detection and, if so, why. We provide an intuitive explanation of why cosine similarity works better than the standard methods that use the maximum of softmax outputs or logits. Besides, there are several differences in the design of output layers, which are essential to achieve the best performance. We show through experiments that our method outperforms the existing methods on the evaluation test recently proposed by Shafaei et al., which takes the above issue of hyperparameter dependency into account; it achieves at least comparable performance to the state-of-the-art on the conventional test, where other methods but ours are allowed to use explicit OOD samples for determining hyperparameters. | Deep Learning for Computer Vision | https://github.com/engkarat/cosine-ood-detector | out-of-distribution detection; safety in AI; uncertainty; metric learning; anomaly detection | |||||
97 | 191 | Poster | 1-D #05 | 3-B #07 | Towards Fast and Robust Adversarial Training for Image Classification | Erh-Chung Chen (National Tsing Hua University)*; Che-Rung Lee (National Tsing Hua University ) | The adversarial training, which augments the training data with adversarial examples, is one of the most effective methods to defend adversarial attacks. However, its robustness degrades for complex models, and the producing of strong adversarial examples is a time-consuming task. In this paper, we proposed methods to improve the robustness and efficiency of the adversarial training. First, we utilized a re-constructor to enforce the classifier to learn the important features under perturbations. Second, we employed the enhanced FGSM to generate adversarial examples effectively. It can detect overfitting and stop training earlier without extra cost. Experiments are conducted on MNIST and CIFAR10 to validate the effectiveness of our methods. We also compared our algorithm with the state-of-the-art defense methods. The results show that our algorithm is 4-5 times faster than the previously fastest training method. For CIFAR-10, our method can achieve above 46\% robust accuracy, which is better than most of other methods. | Deep Learning for Computer Vision | Statistical Methods and Learning | adversarial attack;adversarial defense | |||||
98 | 199 | Poster | 2-A #06 | 3-D #08 | Utilizing Transfer Learning and a Customized Loss Function for Optic Disc Segmentation from Retinal Images | Abdullah Sarhan (University of Calgary)*; Ali Al-Khaz'Aly (University of Calgary); Adam Gorner (University of Calgary); Andrew Swift (University of Calgary); Jon Rokne (University of Calgary); Reda Alhajj (University of Calgary); Andrew Crichton (University of Calgary) | Accurate segmentation of the optic disc from a retinal image is vital to extracting retinal features that may be highly correlated with retinal conditions such as glaucoma. In this paper, we propose a deep-learning based approach capable of segmenting the optic disc given a high-precision retinal fundus image. Our approach utilizes a UNET-based model with a VGG16 encoder trained on the ImageNet dataset. This study can be distinguished from other studies in the customization made for the VGG16 model, the diversity of the datasets adopted, the duration of disc segmentation, the loss function utilized, and the number of parameters required to train our model. Our approach was tested on seven publicly available datasets augmented by a dataset from a private clinic that was annotated by two Doctors of Optometry through a web portal built for this purpose. We achieved an accuracy of 99.78\% and a Dice coefficient of 94.73\% for a disc segmentation from a retinal image in 0.03 seconds. The results obtained from comprehensive experiments demonstrate the robustness of our approach to disc segmentation of retinal images obtained from different sources. | Segmentation and Grouping | Biomedical Image Analysis; Datasets and Performance Analysis; Deep Learning for Computer Vision | https://github.com/AbdullahSarhan/ACCVDiscSegmentation | object segmentation; retinal disc; balanced loss functions; VGG16; retinal images; glaucoma; transfer learning; augmentation; | ||||
99 | 201 | Poster | 1-B #05 | 1-D #06 | Modular Graph Attention Network for Complex Visual Relational Reasoning | Yihan Zheng (South China University of Technology); Zhiquan Wen (South China University of Technology); Mingkui Tan (South China University of Technology)*; Runhao Zeng (South China University of Technology); Qi Chen (South China University of Technology); Yaowei Wang (PengCheng Laboratory); Qi Wu (University of Adelaide) | Visual Relational Reasoning is crucial for many vision-and-language based tasks, such as Visual Question Answering and Vision Language Navigation. In this paper, we consider reasoning on complex referring expression comprehension (c-REF) task that seeks to localise the target objects in an image guided by complex queries. Such queries often contain complex logic and thus impose two key challenges for reasoning: (i) It can be very difficult to comprehend the query since it often refers to multiple objects and describes complex relationships among them. (ii) It is non-trivial to reason among multiple objects guided by the query and localise the target correctly. To address these challenges, we propose a novel Modular Graph Attention Network (MGA-Net). Specifically, to comprehend the long queries, we devise a language attention network to decompose them into four types: basic attributes, absolute location, visual relationship and relative locations, which mimics the human language understanding mechanism. Moreover, to capture the complex logic in a query, we construct a relational graph to represent the visual objects and their relationships, and propose a multi-step reasoning method to progressively understand the complex logic. Extensive experiments on CLEVR-Ref+, GQA and CLEVR-CoGenT datasets demonstrate the superior reasoning performance of our MGA-Net. | Applications of Computer Vision, Vision for X | Deep Learning for Computer Vision | https://github.com/wzq12345/MGA-Net | Visual Relational Reasoning; Complex Relationship; Graph Attention Network | ||||
100 | 202 | Poster | 1-A #09 | 2-C #04 | Second Order enhanced Multi-glimpse Attention in Visual Question Answering | Qiang Sun (Fudan University)*; Binghui Xie (Fudan University); Yanwei Fu (Fudan University) | Visual Question Answering (VQA) is formulated as predicting the answer given an image and question pair. A successful VQA model relies on the information from both visual and textual modalities. Previous endeavours of VQA are made on the good attention mechanism, and multi-modal fusion strategies. For example, most models, till date, are proposed to fuse the multi-modal features based on implicit neural network through cross-modal interactions. To better explore and exploit the information of different modalities, the idea of second order interactions of different modalities, which is prevalent in recommendation system, is re-purposed to VQA in efficiently and explicitly modeling the second order interaction on both the visual and textual features, learned in a shared embedding space. To implement this idea, we propose a novel Second Order enhanced Multi-glimpse Attention model (SOMA) where each glimpse denotes an attention map. SOMA adopts multi-glimpse attention to focus on different contents in the image. With projected the multi-glimpse outputs and question feature into a shared embedding space, an explicit second order feature is constructed to model the interaction on both the intra-modality and cross-modality of features. Furthermore, we advocate a semantic deformation method as data augmentation to generate more training examples in Visual Question Answering. Experimental results on VQA v2.0 and VQA-CP v2.0 have demonstrated the effectiveness of our method. Extensive ablation studies are studied to evaluate the components of the proposed model. | Deep Learning for Computer Vision | visual question answering; multi-glimpse attention; second order |