Towards transparant Language and Vision Models

Revealing working of models

Ravi Shekhar

Deep Learning, Computer Vision, Natural Language Processing

Deep neural networks have make remarkable progress in lots of AI tasks such as image recognition and image captioning, and obtained state-of-the-art results. However, it is often not clear why these models provided certain output and when these models fail, without any explanation. I am interested in automatically investigating the working of these models. Specifically, I am interested in understanding the working of models which combines natural language and computer vision and provides explanation of outcome of the model.




Video Understanding: Large-Scale Action Recognition

Ionut Cosmin Duta

Action Recognition, Video Classification, Feature Encoding

Understanding the video content is a demanding problem in computer vision and multimedia, which has received a sustained attention from the research community due to the large number of its potential applications, such as automatic video analysis, video indexing and retrieval, video surveillance, and virtual reality. During this project, my research activities are mainly concentrated on the task of human action recognition in videos, contributing to the community with research works, particularly focused on the descriptor extraction and feature encoding for creating an efficient and effective video representation. Broadly, the main goal of this project is to make a forward step towards teaching the computers to understand the videos similar as humans do.





Recovering the Sight to Blind People in Indoor Environments with Smart Technologies

Salim Malek

indoor navigation, object recognition, deep learning, machine learning

The proposed prototype accommodates two complementary units: (i) a guidance system, and (ii) a recognition system. The former works online and takes charge of guiding the blind person through the indoor environment from his/her current location and leading him/her to the desired destination, while allowing avoiding static as well as moving obstacles. By contrast, the latter works on demand. The whole prototype is based on computer vision and machine learning techniques




Understanding High Level Attributes of Visual Data

Andrea Pilzer

Beyond object recognition and object localization in images, the computer vision community has recently focused on abstract features of images such as safety perception, memorability, style and virality. These studies recognised that these abstract image attributes are localised in specific parts of the image. On this study we focus on image virality, the property of an image to spread quickly and become popular in social networks. The goal is to identify salient regions of the image and classify viral versus non-viral images.




Unsupervised tube extraction using transductive learning and dense trajectories

Mihai - Marian Puscas

Deep Nets, Unsupervised and Semi-Supervised Learning

Supervised methods have reached the highest performance in core areas of computer vision such as object detection, but their use is limited by the amount of training data that can be effectively annotated. The goal of this research is to reduce or even nullify the amount of human supervision needed to use these methods, transforming them into weakly supervised or even unsupervised systems.




Multi-modal deep learning architectures for video content understanding

Swathikiran Sudhakaran

computer vision, deep learning, activity recognition, multimedia content understanding

Deep learning architectures for computer vision applications typically require huge amount of manually annotated data to be applied effectively on a given task. An enormous and ever growing number of images and videos with additional information such as tags, captions, comments, etc. are available in the internet. Adopting a multi-modal approach by integrating this textual data with the visual information present in the images can enable the machines in learning a richer representation of the visual content. These metadata can also be used to provide a form of weak supervision for feature learning. In this way, the most important limitation faced by most of the existing deep learning architectures for vision, the requirement of huge quantity of labeled data, can be addressed. The objective of this PhD study is to develop deep learning architectures for vision understanding, by leveraging the availability of textual data present with images and videos.




Recurrent Face Alignment

Wei Wang

Face Alignment

Mainstream direction in face alignment is now dominated by cascaded regression methods. These methods start from an image with an initial shape and build a set of shape increments by computing features with respect to the current shape estimate. These shape increments move the initial shape to the desired location. Despite the advantages of the cascaded methods, they all share two major limitations: (i) shape increments are learned separately from each other in a cascaded manner, (ii) the use of standard generic computer vision features such SIFT, HOG, does not allow these methods to learn problem-specific features. In this work, we propose a novel Recurrent Convolutional Face Alignment method that overcomes these limitations. We frame the standard cascaded alignment problem as a recurrent process and learn all shape increments jointly, by using a recurrent neural network with the gated recurrent unit. Importantly, by combining a convolutional neural network with a recurrent one we alleviate hand-crafted features, widely adopted in the literature and thus allowing the model to learn task-specific features. Moreover, both the convolutional and the recurrent neural networks are learned jointly. Experimental evaluation shows that the proposed method has better performance than the state-of-the-art methods, and further support the importance of learning a single end-to-end model for face alignment.




Deep representation learning for image and video understanding

Dan Xu

Visual detection and reconstruction; CNN; graph model

My name is Dan Xu. I am currently a third year Ph.D. student in the Department of Information Engineering and Computer Science, The University of Trento, and a member of Multimedia and Human Understanding Group (MHUG) under supervision of Prof. Nicu Sebe. I was a research assistant at the Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong. My research is focusing on computer vision, multimedia and machine learning. Specifically, I am interested in deep learning and its applications to a variety of topics such as video activity analysis, visual detection, reconstruction and segmentation.