Computer Vision

Towards transparent Language and Vision Models

Revealing working of models

Ravi Shekhar

Publications | ravi.shekhar [at] (Email)


Deep neural networks have made remarkable progress in lots of AI tasks such as image recognition and image captioning and obtained state-of-the-art results. However, it is often not clear why these models provided certain output and when these models fail, without any explanation. I am interested in automatically investigating the working of these models. Specifically, I am interested in understanding the working of models which combines natural language and computer vision and provides the explanation of the outcome of the model.

Virtual Crowds: Modeling, Recording and Validating

Niccolò Bisagno

Publications | niccolo.bisagno [at] (Email) | Website


Crowd analysis supports security and safety (e.g. to prevent dangerous situations), people flow management, marketing and business intelligence. Many analysis algorithms have been developed to deal with issues in crowded environments, such as tracking, people counting and flow segmentation. Existing crowd datasets for testing and benchmarking present problems, such as quality of videos, size of datasets, content, density of the crowd and quality of the annotation. Relying on simulators would provide a unified framework to test and validate analysis algorithms. The simulation framework deals with three key aspects: motion modeling, sensor modeling and validation.

Continuous object recognition in videos

Luca Erculiani

Publications | luca.erculiani [at] (Email)


Over the last few years, the adoption of Deep Convolutional Networks led to massive improvements on the tasks of object detection and classification in images and videos. Albeit reaching impressive accuracy, state-of-the-art models do not reach comparable performances in tasks such as handling new unseen classes, differentiating between instances of the same class and leveraging efficiently new information to improve performance. We are working on techniques to overcome these limitations and to build models that can learn continuously and improve their performances over time.

Unsupervised Depth Estimation

Andrea Pilzer

Publications | andrea.pilzer [at] (Email)


Depth estimation is a very interesting and challenging task. We propose a new method based on GANs and CRFs for learning to predict depth in still images in an unsupervised setting. We use stereo images for training our network.

Joint graph learning and video segmentation via multiple cues and topology calibration

Mihai - Marian Puscas

Publications | mihaimarian.puscas [at] (Email) | Website


Video segmentation has become an important and active research area with a large diversity of proposed approaches. Graph-based methods, enabling top performance on recent benchmarks, usually focus on either obtaining a precise similarity graph or designing efficient graph cutting strategies. However, these two components are often conducted in two separated steps, and thus the obtained similarity graph may not be the optimal one for segmentation and this may lead to suboptimal results. In this paper, we propose a novel framework, joint graph learning and video segmentation (JGLVS), which learns the similarity graph and video segmentation simultaneously.

Deformable GANs for Pose-based Human Image Generation

Aliaksandr Siarohin

Publications | aliaksandr.siarohin [at] (Email) | Website


In this paper we address the problem of generating person images conditioned on a given pose. Specifically, given an image of a person and a target pose, we synthesize a new image of that person in the novel pose. In order to deal with pixel-to-pixel misalignments caused by the pose differences, we introduce deformable skip connections in the generator of our Generative Adversarial Network. Moreover, a nearest-neighbour loss is proposed instead of the common L1 and L2 losses in order to match the details of the generated image with the target image.

Multi-modal deep learning architectures for video content understanding

Swathikiran Sudhakaran

Publications | s.sudhakaran [at] (Email)


Deep learning architectures for computer vision applications typically require huge amount of manually annotated data to be applied effectively on a given task. An enormous and ever growing number of images and videos with additional information such as tags, captions, comments, etc. are available in the internet. Adopting a multi-modal approach by integrating this textual data with the visual information present in the images can enable the machines in learning a richer representation of the visual content. The objective of this PhD study is to develop deep learning architectures for vision understanding, by leveraging the availability of textual data present with images and videos.

Diverse and Realistic Image-to-Image Translation

Hao Tang

Publications | hao.tang [at] (Email)


My name is Hao TNAG. I am currently the first year of my Ph.D. in the Department of Information Engineering and Computer Science, Trento University, and a member of Multimedia and Human Understanding Group (MHUG) under the supervision of Prof. Nicu Sebe. My research is focusing on computer vision, machine learning, deep learning and reinforcement learning.

Learning multi-scale structured deep predictions for monocular depth estimation

Dan Xu

Publications | dan.xu [at] (Email) | Website


We address the problem of monocular depth estimation from a single still image. We propose a deep model which fuses complementary information derived from multiple CNN side outputs. Different from previous methods using concatenation or weighted average schemes, the integration is obtained by means of continuous Conditional Random Fields (CRFs). By designing a novel CNN implementation of mean-field updates for continuous CRFs, we show that both proposed models can be regarded as sequential deep networks and that training can be performed end-to-end. We establish new state of the art results on three publicly available datasets, i.e. NYUD-V2, Make3D and KITTI.