Visual Intelligence and Learning Laboratory (VILAB)

We are a research group at the Swiss Federal Institute of Technology (EPFL)'s School of Computer and Communication Sciences. Our research focus is broadly on Computer Vision, Machine Learning, and Perception-for-Robotics.

Group Members



Highlighted Recent Projects

Robustness via Cross-Domain Ensembles
T. Yeo*, O.F. Kar*, A. Zamir
ICCV, 2021. [Oral]
[project website] [code] [PDF]

We present a method for making neural network predictions robust to shifts from the training data distribution. The proposed method is based on making predictions via a diverse set of cues (called 'middle domains') and ensembling them into one strong prediction. The premise of the idea is that predictions made via different cues respond differently to a distribution shift, hence one should be able to merge them into one robust final prediction. We perform the merging in a straightforward but principled manner based on the uncertainty associated with each prediction. The evaluations are performed using multiple tasks and datasets (Taskonomy, Replica, ImageNet, CIFAR) under a wide range of adversarial and non-adversarial distribution shifts which demonstrate the proposed method is considerably more robust than its standard learning counterpart, conventional deep ensembles, and several other baselines.

Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets from 3D Scans
A. Eftekhar*, A. Sax*, R. Bachmann, J. Malik, A. Zamir
ICCV, 2021.
[project website] [code] [PDF]

Robust Learning Through Cross-Task Consistency
A. Zamir*, A. Sax*, T. Yeo, O. Kar, N. Cheerla, R. Suri, J. Cao, J. Malik, L. Guibas
CVPR, 2020. [Best Paper Award Nominee]
[project website] [live demo] [code] [PDF] [slides]

Visual perception entails solving a wide set of tasks, e.g., object detection, depth estimation, etc. The predictions made for multiple tasks from the same image are not independent, and therefore, are expected to be ‘consistent’. We propose a broadly applicable and fully computational method for augmenting learning with Cross-Task Consistency. The proposed formulation is based on inference-path invariance over a graph of arbitrary tasks. We observe that learning with cross-task consistency leads to more accurate predictions and better generalization to out-of-distribution inputs. This framework also leads to an informative unsupervised quantity, called Consistency Energy, based on measuring the intrinsic consistency of the system. Consistency Energy correlates well with the supervised error (r=0.67), thus it can be employed as an unsupervised confidence metric as well as for detection of out-of-distribution inputs (ROC-AUC=0.95). The evaluations are performed on multiple datasets, including Taskonomy, Replica, CocoDoom, and ApolloScape, and they benchmark cross-task consistency versus various baselines including conventional multi-task learning, cycle consistency, and analytical consistency.

Which Tasks Should Be Learned Together in Multi-task Learning?
T. Standley, A. Zamir, D. Chen, L. Guibas, J. Malik, S. Savarese
ICML, 2020.
[project website] [slides] [code] [PDF]

Many computer vision applications require solving multiple tasks in real-time. A neural network can be trained to solve multiple tasks simultaneously using multi-task learning. This can save computation at inference time as only a single network needs to be evaluated. Unfortunately, this often leads to inferior overall performance as task objectives can compete, which consequently poses the question: which tasks should and should not be learned together in one network when employing multi-task learning? We study task cooperation and competition in several different learning settings and propose a framework for assigning tasks to a few neural networks such that cooperating tasks are computed by the same neural network, while competing tasks are computed by different networks. Our framework offers a time-accuracy trade-off and can produce better accuracy using less inference time than not only a single large multi-task neural network but also many single-task networks.

Side-tuning: Network Adaptation via Additive Side Networks
J. Zhang, A. Sax, A. Zamir, L. Guibas, J. Malik
ECCV, 2020. [Spotlight]
[project website] [code] [PDF]

When training a neural network for a desired task, one may prefer to adapt a pre-trained network rather than start with a randomly initialized one -- due to lacking enough training data, performing lifelong learning where the system has to learn a new task while being previously trained for other tasks, or wishing to encode priors in the network via preset weights. The most commonly employed approaches for network adaptation are fine-tuning and using the pre-trained network as a fixed feature extractor, among others.

In this paper we propose a straightforward alternative: Side-Tuning. Side-tuning adapts a pre-trained network by training a lightweight "side" network that is fused with the (unchanged) pre-trained network using a simple additive process. This simple method works as well as or better than existing solutions while it resolves some of the basic issues with fine-tuning, fixed features, and several other common baselines. In particular, side-tuning is less prone to overfitting when little training data is available, yields better results than using a fixed feature extractor, and does not suffer from catastrophic forgetting in lifelong learning. We demonstrate the performance of side-tuning under a diverse set of scenarios, including lifelong learning (iCIFAR, Taskonomy), reinforcement learning, imitation learning (visual navigation in Habitat), NLP question-answering (SQuAD v2), and single-task transfer learning (Taskonomy), with consistently promising results.

Robust Policies via Mid-Level Visual Representations: An Experimental Study in Manipulation and Navigation
B. Chen, S. Sax, L. Pinto, F. Lewis, I. Armeni, S. Savarese, A. Zamir, J. Malik
CoRL, 2020.
[project website] [code] [PDF]

Vision-based robotics often factors the control loop into separate components for perception and control. Conventional perception components usually extract hand-engineered features from the visual input that are then used by the control component in an explicit manner. In contrast, recent advances in deep RL make it possible to learn these features end-to-end during training, but the final result is often brittle, fails unexpectedly under minuscule visual shifts, and comes with a high sample complexity cost.

In this work, we study the effects of using mid-level visual representations asynchronously trained for traditional computer vision objectives as a generic and easy-to-decode perceptual state in an end-to-end RL framework. We show that the invariances provided by the mid-level representations aid generalization, improve sample complexity, and lead to a higher final performance. Compared to the alternative approaches for incorporating invariances, such as domain randomization, using asynchronously trained mid-level representations scale better to harder problems and larger domain shifts, and consequently, successfully trains policies for tasks where domain randomization or learning-from-scratch failed. Our experimental findings are reported on manipulation and navigation tasks using real robots as well as simulations.


Prospective Members:

  • Prospective PhD Students: We are always looking for highly motivated and talented students. However, PhD admissions at EPFL are centralized and highly competitive. Please directly apply to the PhD program of computer science. The admission system allows you to identify the professors/labs you're interested in.
  • Prospective Postdocs: Please contact us directly with your CV.
  • Prospective BS/MS interns and project students: If you're an EPFL student, send us an email with your CV. If you are not and you are eligible for Summer@EPFL internship program, please apply here. Otherwise, feel free to send us an email with your CV.