Summary
Download
Complete ResumeResearch interests
Education / Qualifications
Professional memberships
Employment history
Athena Research Center
Inria Rennes-Bretagne Atlantique
National and Kapodistrian University of Athens (NKUA)
National Technical University of Athens (NTUA)
National Technical University of Athens (NTUA)
Teaching / training activities
National and Kapodistrian University of Athens
National and Kapodistrian University of Athens Greece
National and Kapodistrian University of Athens
In this two-day seminar I have focused on parts of the Deep Learning for Vision course. In particular, visual representation, learning, convolution, differentiation, optimization, and network architectures.
University of Rennes 1, University of Southern Brittany (UBS), ENS Rennes, National Institute of Applied Sciences (INSA), CentraleSupélec Rennes, France
I am the course responsible. This course studies learning visual representations for common computer vision tasks including matching, retrieval, classification, and object detection. The course discusses well-known methods from low-level description to intermediate representation, and their dependence on the end task. It then studies a data-driven approach where the entire pipeline is optimized jointly in a supervised fashion, according to a task-dependent objective. Deep learning models are studied in detail and interpreted in connection to conventional models. The focus of the course is on recent, state of the art methods and large scale applications. The course syllabus includes visual representation, local features and spatial matching, codebooks and kernels, learning, differentiation, convolution, optimization, network architectures, object detection, and retrieval.
National Technical University of Athens Greece
I assist in lectures and conduct a weekly laboratory, using Matlab. The laboratory is graded independently and counts for 50% of the course. The syllabus includes image sampling and quantization, two-dimensional transforms, image filtering, edge detection, enhancement and restoration, image and video coding and compression, and JPEG and MPEG standards. The laboratory includes additional topics, in particular Hough transform, corner and local feature detection, descriptor extraction, image retrieval, template matching and motion analysis.
National Technical University of Athens Greece
I assist in lectures and prepare exercise material. The course syllabus includes signal and system properties, convolution, correlation, sampling, quantization, Fourier series, discrete and continuous time Fourier transforms, Laplace and Z transforms, time and frequency analysis of linear, time-invariant systems, stability, and discrete Fourier transform.
National Technical University of Athens Greece
National Technical University of Athens Greece
Hellanion University of Portsmouth Athens, Greece
Hellanion University of Portsmouth Athens, Greece
Awards
Invited talks
AriadNext Rennes, France
In this talk we focus on metric learning and instance-level image retrieval for large-scale visual localization.
In the first part of the talk, we introduce a new "asymmetric testing" task, where an image database is represented by a large network, while queries are captured by mobile devices and represented by a lightweight network. Rather than re-indexing the database, it is preferable to adapt different smaller models to different end-user devices. Inspired by this task, we introduce "asymmetric metric learning", a novel paradigm of using asymmetric representations at training, in a teacher-student setup.
In the second part, we aim to bridge the gap between metric learning and classification in terms of data augmentation. We improve representations using mixup, which is a powerful data augmentation approach interpolating two or more examples and corresponding target labels at a time. We develop a generalized formulation that encompasses existing metric learning loss functions and modify it to accommodate for mixup, introducing Metric Mix, or Metrix.
In the third part, we study different forms of attention according to the interaction of elements of the feature tensor (local and global) and the dimensions where it is applied (spatial and channel). We present global-local attention module (GLAM), which is attached at the end of a backbone network and incorporates all four forms of attention: local and global, spatial and channel.
ICCV 2021 Virtual
In this talk we focus on metric learning and instance-level image retrieval for large-scale visual localization.
In the first part of the talk, we introduce a new "asymmetric testing" task, where an image database is represented by a large network, while queries are captured by mobile devices and represented by a lightweight network. Rather than re-indexing the database, it is preferable to adapt different smaller models to different end-user devices. Inspired by this task, we introduce "asymmetric metric learning", a novel paradigm of using asymmetric representations at training, in a teacher-student setup.
In the second part, we aim to bridge the gap between metric learning and classification in terms of data augmentation. We improve representations using mixup, which is a powerful data augmentation approach interpolating two or more examples and corresponding target labels at a time. We develop a generalized formulation that encompasses existing metric learning loss functions and modify it to accommodate for mixup, introducing Metric Mix, or Metrix.
In the third part, we study different forms of attention according to the interaction of elements of the feature tensor (local and global) and the dimensions where it is applied (spatial and channel). We present global-local attention module (GLAM), which is attached at the end of a backbone network and incorporates all four forms of attention: local and global, spatial and channel.
Inria Rennes-Bretagne Atlantique France
In this talk we discuss graph-based methods for image retrieval on manifolds, unsupervised representation learning and semi-supervised learning. We begin with a classic method for ranking data on manifolds that is adapted to descriptors of overlapping image regions for image retrieval, implemented by a sparse linear system solver. We show that this is equivalent to linear graph filtering, smoothing in particular, of a sparse signal in the frequency domain. We then apply this methodology for unsupervised network fine-tuning for retrieval, where positive and negative examples are found by disagreements between Euclidean and manifold similarities. Finally, we revisit classic graph-based transductive methods for semi-supervised learning and introduce an inductive framework, using label propagation to train a deep neural network.
In this talk we discuss methods for visual representation learning from raw input data targeting image retrieval, that is, ranking according to visual similarity. The focus is on methods that do not require human annotation for supervision. We begin with background work on feature pooling from convolutional neural network activations, metric learning and algorithmic supervision. We then discuss a number of recent methods for efficient ranking on manifolds and their use for representation learning. These methods are based on a nearest neighbor graph of a given dataset and can be cast as linear graph filtering. Efficient solutions are found in the time or frequency domain, using simple ideas from numerical linear algebra. When applied to representation learning, these methods allow for improving the representation by just observing data.
Visual instance recognition has undergone a spectacular improvement by fine-tuning convolutional networks on properly designed image matching tasks. However, retrieving small objects is a common failure case that requires representing an image with several regions rather than a global descriptor. In this work, we discuss a principled query expansion mechanism on descriptors of overlapping image regions, based on a nearest neighbor graph constructed offline. We introduce a new way of handling unseen queries online, without adjusting the precomputed data. We rank images through a sparse linear system solver, yielding practical query times well below one second.
While this result is encouraging, it also shows that the learned representations still lie on manifolds in a high dimensional space, and that exploring the manifolds online remains expensive. We therefore introduce an explicit embedding, reducing manifold search to Euclidean search followed by dot product similarity search. We show this is equivalent to linear graph filtering of a sparse signal in the frequency domain, and we introduce a scalable offline computation of an approximate Fourier basis of the graph. This reproduces or improves the results of online query expansion, at query times comparable to standard similarity search.
Center for Machine Perception (CMP)
Czech Technical University (CVUT) Prague, Czech Republic
In this talk we discuss the role in approximate nearest neighbor search in large scale clustering, and applications in vision. We begin with a number of binary codes and product quantization extensions, highlighting the relation to nonlinear dimensionality reduction. We touch upon the deep connection between the two problems, involving distance maps in arbitrary dimensions. We then revisit recent advances in approximate k-means variants, and present a new one (IQ-means) that borrows their best ingredients. Combined with powerful deep learned representations, this method achieves clustering of a 100 million image collection on a single machine in less than one hour, while dynamically determining the number of clusters.
Code for IQ-means is available on github.
Foundation for Research and Technology-Hellas (FORTH) Heraklion, Greece
In this talk we present a panoramic view of the geometry underpinning a number of vision problems, ranging from early vision to unsupervised mining in large image collections and beyond. Interplaying between continuous and discrete representations, geometry appears in different forms of duality, embeddings, and manifolds. We begin with planar shape decomposition as studied in psychophysics to model either occlusion or parts of recognition. Focusing on distance maps and the medial axis representation, we then generalize to natural images towards perceptual edge grouping and equivariant local feature detection.
Adopting sets of local features and descriptors as an image representation, we then shift to visual instance search and recognition. We discuss a form of flexible spatial matching as mode seeking in the transformation space, a number of embeddings and match kernels in the descriptor space, and feature selection or aggregation in both. Acknowledging that the problem often boils down to nearest neighbor search in high-dimensional spaces, we consider a number of binary codes and product quantization extensions, highlighting the relation to nonlinear dimensionality reduction. Finally, we touch upon the deep connection between nearest neighbor search and clustering. In doing so, we revisit distance maps and medial representations, now in arbitrary dimensions.
University of Pennsylvania US
In this talk we present a panoramic view of the geometry underpinning a number of vision problems, ranging from early vision to unsupervised mining in large image collections and beyond. Interplaying between continuous and discrete representations, geometry appears in different forms of duality, embeddings, and manifolds. We begin with planar shape decomposition as studied in psychophysics to model either occlusion or parts of recognition. Focusing on distance maps and the medial axis representation, we then generalize to natural images towards perceptual edge grouping and equivariant local feature detection.
Adopting sets of local features and descriptors as an image representation, we then shift to visual instance search and recognition. We discuss a form of flexible spatial matching as mode seeking in the transformation space, a number of embeddings and match kernels in the descriptor space, and feature selection or aggregation in both. Acknowledging that the problem often boils down to nearest neighbor search in high-dimensional spaces, we consider a number of binary codes and product quantization extensions, highlighting the relation to nonlinear dimensionality reduction. Finally, we touch upon the deep connection between nearest neighbor search and clustering. In doing so, we revisit distance maps and medial representations, now in arbitrary dimensions.
National and Kapodistrian University of Athens (NKUA) Greece
Hashing is a popular solution to approximate nearest neighbor search, and appears in two variants: indexing data items in hash tables, or representing items by short binary codes and using these compact representations to approximate distances. We focus on the second approach, and more specifically on methods that learn codes from the data distribution.
We then present methods based on vector quantization, which are a natural generalization. In particular, we discuss exhaustive and non-exhaustive variants of product quantization including recent optimizations, as well as additive quantization. Finally, we explore the opposite direction, that of using nearest neighbor search to speed up vector quantization itself.
TEXMEX Research Team
Inria Rennes-Bretagne Atlantique France
The first part of this talk considers a family of metrics to compare images based on their local descriptors. It encompasses the VLAD descriptor and matching techniques such as Hamming embedding. Making the bridge between these approaches yields a match kernel that takes the best of existing techniques by combining an aggregation procedure with a selective match kernel.
Since image search using either local or global descriptors boils down to approximate nearest neighbor search, the second part of this talk considers this problem, focusing on vector quantization methods. A recent method is presented whereby residuals over a coarse quantizer are used to locally optimize an individual product quantizer per cell. Non-exhaustive search strategies are discussed, including an inverted multi-index.
University of Bordeaux France
Methods based on local features have been very successful in visual search, especially when the objective is to identify near-identical objects or scenes under occlusion and varying viewpoint or lighting conditions. After a brief introduction to such methods, including bag-of-words models, sub-linear indexing and spatial matching, this talk focuses on recent research results related to local feature detection and the role of geometry, as well as a number of applications.
In particular, we present methods based on image gradient and distance maps that are able to detect blob-like regions of arbitrary scale and shape, and their application to image matching. We then investigate the potential of embedding the spatial matching process within the index, so that it becomes sub-linear as well. We also report on an accelerated spatial matching method for re-ranking, that allows flexible matching of multiple surfaces.
We then move to the more difficult problem of organizing large photo collections, and examine the use of sub-linear indexing in a mining process. Photos are automatically grouped wherever they depict the same scene; this structure is then exploited to increase the recall of the retrieval process. Working on community collections of geo-tagged photos depicting urban scenery, this approach is applied to automatic location and landmark recognition from a single photo. We also present our online application, VIRaL. Finally, we present our C++ template library ivl, that is used as infrastructure in our implementations.
Ph.D. students / co-supervision
Representations lie at the heart of artificial intelligence, enabling machines to perceive, interpret and interact with the world. Visual representations, extracted from images or videos, enable tasks such as image classification, image retrieval, and object detection. Visual-textual representations, bridging the gap between the visual and linguistic domains, enable tasks like image captioning, visual question answering, and cross-modal retrieval. The ability to learn and manipulate these representations is paramount for advancing the state-of-the-art in computer vision and beyond. In this dissertation, we investigate novel methods for learning both visual (unimodal) and visual-textual (multimodal) representations, focusing mainly on applications in deep metric learning, image classification, and composed image retrieval. We address the challenges of learning representations from both data-centric and model-centric perspectives, aiming to unlock new capabilities for visual understanding and interaction.
In visual representation learning, we first focus on data and introduce Metrix, a deep metric learning method utilizing mixup for data augmentation. Metrix addresses the challenge of interpolating both examples and target labels, overcoming the non-additive nature of traditional metric learning loss functions. By generalizing existing loss functions to incorporate mixup, Metrix enhances learning and explores new embedding space regions. We introduce a novel metric, utilization, to measure this exploration. Experiments on four benchmark datasets, including various mixup settings, show that Metrix significantly outperforms state-of-the-art methods, improving robustness and generalization. This work exemplifies our aim to advance visual representation learning through innovative data augmentation.
Next, we shift our focus to the model architecture, introducing SimPool, a simple attention-based pooling method at the end of network designed to replace the default one in both convolutional neural networks (CNNs) and vision transformers (ViTs). We develop a generic pooling framework and formulate existing pooling methods as its instantiations, allowing us to analyze, compare and discuss their properties. Through this, we finally derive SimPool, which improves performance in supervised and self-supervised settings on standard benchmarks and downstream tasks. SimPool generates high-quality attention maps that accurately delineate object boundaries, significantly enhancing object localization and robustness to background changes. It improves object discovery metrics and performs efficiently, even when removing ViT blocks, thus optimizing the balance between performance and model complexity. This work exemplifies our aim to advance visual representation learning through innovative model architecture component.
Transitioning to visual-textual representations, we introduce FreeDom, a training-free method for zero-shot composed image retrieval in open-world domain conversion. FreeDom leverages the descriptive power of a frozen vision-language model (VLM) and employs textual inversion, enabling flexible image and text query composition. Unlike traditional methods that invert query images to the continuous latent space of tokens, FreeDom’s inversion into the discrete input space of text is pivotal for its success. Experiments on four benchmark domain conversion datasets, including three newly introduced by us, demonstrate its superior performance. Additionally, FreeDom performs on par with the best methods in generic composed image retrieval. This work exemplifies our aim to advance multimodal representation learning through innovative discrete-space textual inversion.
Expanding on visual-textual representations, we now focus on their applications in remote sensing to introduce a novel task: remote sensing composed image retrieval (RSCIR). This task aims to provide a more expressive and flexible search capability within the remote sensing domain. We explore and qualitatively evaluate the unique challenges and capabilities this task introduces. Users can now pair a query image with a query text specifying modifications related to color, shape, size, texture, density, context, quantity, or the presence of certain classes. To quantitatively assess this, we establish a benchmark, PatternCom, and an evaluation protocol focusing on shape, color, density, and quantity modifications. Our method, WeiCom, operates training-free by utilizing a frozen vision-language model and incorporates a modality control parameter for generating more image- or text-oriented results based on the specific search needs. This work exemplifies our aim to advance multimodal representation learning by introducing a flexible method that showcases the potential of this novel task in a new domain.
Computer vision capabilities improved in the past decade. A better utilization of hardware enabled computers to process more images faster, ensuing the dawn of deep learning. Moreover, over this timespan, model architectures such as convolutional neural networks and transformers have been introduced, enabling computer vision applications to conduct more complex tasks. In particular, image recognition models are now capable of identifying and recognizing elements on an image, even on challenging conditions. These factors have contributed towards the introduction of these models into society.
With the permeation of deep learning technologies within society, a new requirement emerged for these methodologies. Since they are now interacting and affecting human lives directly, it is mandatory to understand their functioning and provide explanations. To address these questions a new research field has emerged: interpretability and explainable AI.
In this thesis, our goal is to understand and further develop interpretability models for state-of-the-art image recognition models. We introduce and briefly explain some of the most relevant high-performance image recognition models for both Convolutional Neural Networks and Transformers. Then, current interpretability approaches designed to provide explanations, as well as their evaluation protocols. We make observations upon these methods and evaluation protocols, highlighting difficulties upon them and suggesting ideas to address their limitations. In the following chapters we present our contributions.
Opti-CAM. Our first contribution builds upon the reasoning of Class Activation Mappings. In particular, this proposal optimizes the weighting coefficient required to compute a saliency map, generating a representation that maximizes class specific probability. This saliency map performs the best across interpretability metrics on multiple datasets. Plus, it highlights that context is relevant towards describing a prediction. Additionally, a novel metric to complement interpretability evaluation is unveiled, addressing shortcomings in this procedure.
Cross Attention Stream. Our second contribution is an addition to current image recognition models, enhancing interpretability measurements. Inspired by novel high performing models such as Transformers, we construct a stream that computes the interaction of an abstract class representation with deep features of convolutional neural networks. This representation is ultimately used to perform classification. Our Stream displays improvements on quantitative evaluation and preserves recognition performance across different models.
Gradient Denoising. Lastly, our final contribution presents a novel training paradigm for deep neural networks. Moreover, this paradigm denoises the gradient information of deep models in the input space. The guided backpropagation representation of the input image is used to regularize models during their training phase. As a result, our trained models display improvements for interpretable evaluation. We apply our paradigm to small architectures in a constrained setting, paving the way for future development for large scale datasets and more complex models.
The primary goal in computer vision is to enable machines to extract meaningful information from visual data, such as images and videos, and leverage this information to perform a wide range of tasks. To this end, substantial research has focused on developing deep learning models capable of encoding comprehensive and robust visual representations. A prominent strategy in this context involves pretraining models on large-scale datasets, such as ImageNet, to learn representations that can exhibit cross-task applicability and facilitate the successful handling of diverse downstream tasks with minimal effort.
To facilitate learning on these large-scale datasets and encode good representations, complex data augmentation strategies have been used. However, these augmentations can be limited in their scope, either being hand-crafted and lacking diversity, or generating images that appear unnatural. Moreover, the focus of these augmentation techniques has primarily been on the ImageNet dataset and its downstream tasks, limiting their applicability to a broader range of computer vision problems.
In this thesis, we aim to tackle these limitations by exploring different approaches to enhance the efficiency and effectiveness in representation learning. The common thread across the works presented is the use of interpolation-based techniques, such as mixup, to generate diverse and informative training examples beyond the original dataset. In the first work, we are motivated by the idea of deformation as a natural way of interpolating images rather than using a convex combination. We show that geometrically aligning the two images in the feature space allows for more natural interpolation that retains the geometry of one image and the texture of the other, connecting it to style transfer. Drawing from these observations, we explore the combination of mixup and deep metric learning. We develop a generalized formulation that accommodates mixup in metric learning, leading to improved representations that explore areas of the embedding space beyond the training classes. Building on these insights, we revisit the original motivation of mixup and generate a larger number of interpolated examples beyond the mini-batch size by interpolating in the embedding space. This approach allows us to sample on the entire convex hull of the mini-batch, rather than just along linear segments between pairs of examples. Finally, we investigate the potential of using natural augmentations of objects from videos. We introduce a "Walking Tours" dataset of first-person egocentric videos, which capture a diverse range of objects and actions in natural scene transitions. We then propose a novel self-supervised pretraining method called DoRA, which detects and tracks objects in video frames, deriving multiple views from the tracks and using them in a self-supervised manner.
Video content has significantly increased in volume and diversity in the digital era, and this expansion has highlighted the necessity for advanced video understanding technologies that transform vast volumes of unstructured data into practical insights by learning from data. Driven by this necessity, this thesis explores semantically understanding videos, leveraging multiple perceptual modes similar to human cognitive processes and efficient learning with limited supervision similar to human learning capabilities. Multimodal semantic video understanding synthesizes visual, audio, and textual data to analyze and interpret video content, facilitating comprehension of underlying semantics and context.
This thesis specifically focuses on video question answering to understand videos as one of the main video understanding tasks. Our first contribution addresses long-range video question answering, which involves answering questions about long videos, such as TV show episodes. These questions require an understanding of extended video content. While recent approaches rely on human-generated external sources, we present processing raw data to generate video summaries.
Our following contribution explores zero-shot and few-shot video question answering, aiming to enhance efficient learning from limited data. We leverage the knowledge of existing large-scale models by eliminating challenges in adapting pre-trained models to limited data, such as overfitting, catastrophic forgetting, and bridging the cross-modal gap between vision and language. We introduce a parameter-efficient method that combines multimodal prompt learning with a transformer-based mapping network while keeping the pre-trained vision and language models frozen. We demonstrate that these contributions significantly enhance the capabilities of multimodal video question-answering systems, where specifically human-annotated labeled data is limited or unavailable.
Deep neural networks can be used to create highly accurate image classification models. One prerequisite is to have access to large-scale datasets for training. In the context of few-shot learning, the training set is limited to few images, so training from scratch is not feasible. Instead, a first training stage leverages a distinct set of abundant data to learn generic knowledge that can be transferred to few-shot tasks. A large part of the literature focuses on meta-learning, which consists in training strategies that apply to small sets of images.
In this thesis, we rather follow a simpler approach. First, a task-independent representation function is learned on abundant data by solving a distinct task such as multi-class classification on a set of base classes. Then, the learned representation is combined with new data of novel classes to solve the few-shot task. In both stages, we introduce solutions that aim at leveraging available data as much as possible.
In particular, for representation learning, we propose dense classification training, which for the first time studies local activations in the domain of few-shot learning. We also adapt the representation function to render it task-dependent, with a second learning phase using the few-shot set. The risk of overfitting during adaptation makes it a challenging task, which is not frequently performed outside of meta-learning. We propose two solutions for adaptation. By implanting, learning is limited to a few parameters; alternatively, by few-step adaptation, learning is limited to a few gradient updates.
Additionally, we study alternative few-shot learning settings, in which access to data is modified. In transductive learning, multiple images need to be classified at the same time. In this context, we propose local propagation, a method that uses similarities between local representations of images to propagate class information, effectively leveraging the distribution the extra unlabeled data. We also introduce few-shot few-shot learning, a new setting, where only few or no in-domain data is accessible for representation learning. In this context, we take advantage of a representation obtained from a classifier that is pre-trained on a large-scale dataset of a different domain, which can still be adapted to the domain if data is available.
In few-shot learning, because data is so scarce, we show that selecting relevant regions with an attention mechanism is important. In local propagation, this prevents propagation based on similarities of background regions. In few-shot-few-shot learning, it helps compensate for the domain gap. We propose two simple solutions that successfully fulfill this role. Finally, we apply our knowledge of few-shot learning on the specific problem of classifying aerial images.
This thesis is about adversarial attacks and defenses in deep learning. We propose to improve the performance of adversarial attacks in terms of speed, distortion and invisibility.
We contribute a definition of invisibility in terms of smoothness, which we integrate into the optimization of adversarial example generation. We achieve smooth adversarial perturbations of less distortion. To improve the efficiency of generating adversarial examples, we introduce an optimization algorithm called the Boundary Projection (BP) attack. BP first follows the gradient of the classification loss to until it finds an adversarial solution. It then searches along the class boundary to minimize the distortion. BP succeeds in generating adversarial examples with low-distortion very efficiently.
We also study defenses against adversarial examples. We apply quantization on local patches of both images and intermediate layer features, using different kinds of codebooks that are either fixed or learned from training data. Experiments show that such patch replacement is efficient and robust against adversarial attacks, while it requires no network training.
Neural network representations have proved to be relevant for many computer vision tasks such as image classification, object detection, segmentation or instance-level image retrieval, but require a large number of labeled data. In this thesis, we propose solutions to extract the most information with the least supervision.
Focusing on the classification task first, we examine active learning in the context of deep learning and show that combining it with semi-supervised and unsupervised learning greatly boosts results. We then investigate the image retrieval task, and in particular we exploit the spatial localization information available "for free" in CNN feature maps. We first propose to represent an image by a collection of affine local features detected within activation maps, which are memory-efficient and robust enough to perform spatial matching. Then, extracting information from feature maps, we discover objects of interest in images of a dataset and gather their representations in a nearest neighbor graph. Using the centrality measure on the graph, we construct a saliency map per image, which focuses on repeating objects and allows us to compute a global representation, suppressing background clutter.
New applications that exploit the huge data volume in community photo collections are emerging every day and visual image search is therefore becoming increasingly important. In this thesis we propose clustering- and nearest neighbor-based improvements for visual image search. Clustering is either performed on feature space or on image space, i.e. on high-dimensional vector spaces or metric spaces, respectively.
We first introduce a clustering method that combines the flexibility of Gaussian mixtures with the scaling properties needed to construct visual vocabularies for image retrieval. It is a variant of expectation-maximization that can converge rapidly while dynamically estimating the number of components. We employ approximate nearest neighbor search to speed-up the E-step and exploit its iterative nature to make search incremental, boosting both speed and precision. We achieve superior performance in large scale retrieval, being as fast as the best known approximate k-means algorithm.
We then present our locally optimized product quantization scheme, an approximate nearest neighbor search method that locally optimizes product quantizers per cell, after clustering the data in the original space. When combined with a multi-index, its performance is unprecedented and sets the new state-of-the-art in a billion scale dataset. At the same time, our approach enjoys query times in the order of a few milliseconds, an it becomes comparable in terms of speed even to hashing approaches.
We next focus on large community photo collections. Most applications for such collections focus on popular subsets, e.g. images containing landmarks or associated to Wikipedia articles. In this thesis we are concerned with the problem of accurately finding the location where a photo is taken without needing any metadata, that is, solely by its visual content. We also recognize landmarks where applicable, automatically linking them to Wikipedia. We show that the time is right for automating the geo-tagging process, and we show how this can work at large scale. In doing so, we do exploit redundancy of content in popular locations--but unlike most existing solutions, we do not restrict to landmarks. In other words, we can compactly represent the visual content of all thousands of images depicting e.g. the Parthenon and still retrieve any single, isolated, non-landmark image like a house or a graffiti on a wall.
Starting from an existing, geo-tagged dataset, we cluster images into sets of different views of the the same scene. This is a very efficient, scalable, and fully automated mining process. We then align all views in a set to one reference image and construct a 2D scene map. Our indexing scheme operates directly on scene maps. We evaluate our solution on a challenging one million urban image dataset and provide public access to our service through our online application, VIRaL.
The thesis concludes with two chapters. The first is a summary of other approaches for visual search and applications, like geometry indexing, logo detection and clothing recognition, while the second presents conclusions and possible future directions.
Low-level image analysis offers an intermediate image representation that is used by high-level computer vision algorithms (e.g. object detection and recognition, image and video retrieval, image matching). Local features extracted as regions of interest, or spatio-temporal interest points extracted from videos, combined with local descriptors, as well as global descriptors, offer a compact representation of visual information. Despite the fact that many local feature detectors have been proposed recently, this field of research is still open to new methods, as new and more complex application fields are introduced. Recently, the interest of the computer vision community has focused on deep neural networks, based on results in image classification tasks.
We propose an new local feature detector, based on geometric constructions. In particular, we propose using α-shapes to describe the shape of a set of points sampled on an image. Given the point set, α-shapes describe image objects in different scales and with different level of detail. For image sampling, we propose two different approaches: sampling on image edges and sampling using error diffusion. For sampling image edges, we propose a method that exploits the local affine shape in order to adapt sampling density, as well as a baseline method that uses fixed density sampling. We also propose sampling using error diffusion on two different functions of image intensity. The first one is based on first-order derivatives of image intensity (gradient strength), while the second one is based on second-order derivatives (Hessian response).
We use different triangulations of the samples and different α-shapes, and propose the anisotropically weighted α-shapes that exploit the local shape of each simplex of the triangulation. For selecting regions of interest, we propose different importance measures for the connected components of α-shapes. We qualitatively and quantitatively evaluate the proposed local feature extraction algorithm, under all proposed variations for each algorithm step. Our detector extracts a relatively small number of features from image regions that correspond to highly repeatable object parts. Its performance exceeds the state-of-the-art in most cases.
We also propose an efficient method for describing video clips, using deep neural networks. We segment videos in shots, using a novel method that exploits a global "objectness" measure. For describing video frames, we exploit neural networks feature maps, and then aggregate the responses to create a single descriptor for the video shot. We evaluate the proposed method on a surgical video retrieval experiment, where other methods based on local features are outperformed.
A wide range of properties and assumptions determine the most appropriate spatial matching model for an application, e.g. recognition, detection, registration, or large scale image retrieval. Most notably, these include discriminative power, geometric invariance, rigidity constraints, mapping constraints, assumptions made on the underlying features or descriptors and, of course, computational complexity.
We present a new approach to image indexing and retrieval, which integrates appearance with global image geometry in the indexing process, while enjoying robustness against viewpoint change, photometric variations, occlusion, and background clutter. We exploit shape parameters of local features to estimate image alignment via a single correspondence. Then, for each feature, we construct a sparse spatial map of all remaining features, encoding their normalized position and appearance, typically vector quantized to visual word. An image is represented by a collection of such feature maps and RANSAC-like matching is reduced to a number of set intersections. We use min-wise independent permutations and derive a similarity measure for feature map collections. In addition to random selection, we have further exploited multiple view matching for feature selection. This allows us to scale geometry indexing up to 1M images. We then exploit sparseness to build an inverted file whereby the retrieval process is sub-linear in the total number of images, ideally linear in the number of relevant ones.
We further present a very simple model inspired by Hough voting in the transformation space, where votes arise from single feature correspondences. A relaxed matching process allows for multiple matching surfaces or non-rigid objects under one-to-one mapping, yet is linear in the number of correspondences. We apply it to geometry re-ranking in a search engine, yielding superior performance with the same space requirements but a dramatic speed-up compared to the state of the art.
We further extend and use our relaxed spatial matching for self-matching and symmetry detection. We assume that features participating in symmetric and repeating structures have higher probability to be matched between different views of the same object. Information from geometric self-matching and matching of the image with its mirrored counterpart is used for feature selection of single images.
In contrast to the previous methods that we discussed or proposed, which all use only visual word information to perform feature matching, we further exploit the Hamming (HE) Embedding technique, which further use descriptor information. HE employs each feature with visual word and a binary signature which allows more precise feature matching. We develop a novel query expansion strategy which is aligned with the HE representation. We achieve to improve performance even without geometry matching, in contrast to previous query expansion methods, along with low query times. We finally show that combining our scheme with geometry matching can further boost performance and outperform state of the art methods.
The growth of production and demand for digital audiovisual content during the last few decades has been overwhelming. To fulfill the needs of its users, this multimedia content should be annotated, commented and classified into appropriate semantic classes, in order to facilitate search and access to it. This thesis deals with the analysis of multimedia content and faces a few of the most important research problems in the field of multimedia analysis. More specifically, it faces problems such as image classification, image region classification and detection of concepts in images. To achieve this, certain techniques that exploit directly and indirectly the knowledge of a domain are proposed and evaluated. This knowledge is encoded either in the form of appropriate ontologies, or by modeling the context of the images and their regions, or by applying machine learning techniques. It emphasizes on the use of the bag-of-words model in order to describe the visual features of images. Finally, the techniques applied in high-level concept detection in images are extended in order to be applied to the problems of image retrieval and video summarization.
The main research area of this thesis, in a broad sense, is the integration of knowledge technologies into the analysis and description of multimedia. Knowledge technologies can aid computer vision tasks towards the improvement of the understanding of visual content, by exploiting a priori knowledge in algorithms of semantic image and video analysis. More specifically, we examine the problem of image and video segmentation and we propose novel techniques for the detection, extraction, recognition and tracking of objects, based on semantic and visual criteria. We propose a semantic segmentation approach, which enhances region growing algorithms with semantic characteristics, in order to deal with problems that raise from the shortcoming of describing semantic entities by visual characteristics exclusively. Moreover, we propose a structured knowledge framework, which we call visual logics, based on description logics and their fuzzy extensions, to link visual data with concepts that form the vocabulary of a domain. We use a set of axioms and a reasoning engine to infer possible semantic interpretation of parts or the whole of an image.
Although human vision appears to be easy and unconscious, there exist complex neural mechanisms in primary visual cortex that form the pre-attentive component of the Human Visual System (HVS) and lead to visual awareness. Considerable research has been carried out into the attention mechanisms of the HVS and computational models have been developed and employed to common computer vision problems. Most of the models simulate the bottom-up mechanism of the HVS and their major goal is to filter out redundant visual information and detect/enhance the most salient parts of the input. The Human Visual System (HVS) has the ability to fixate quickly on the most informative (salient) regions of a scene and therefore reduce the inherent visual uncertainty. Computational visual attention (VA) schemes have been proposed to account for this important characteristic of the HVS. The dissertation studies and expands the field of computational visual attention methods, proposes novel models both for spatial (images) and spatio-temporal (video sequences) analysis and evaluates both qualitatively and quantitatively in a variety of relevant applications.
The main research objective of this Thesis is to tackle issues related to multimedia content processing, search and retrieval, under the prism of context, as the latter is expressed in the fields of knowledge adaptation and information access. More specifically, the main research motivation was caused by two major research fields: (i) multimedia content personalization and (ii) multimedia content analysis based on visual context. It tackles issues such as data mining, thematic categorization of multimedia documents, multimedia personalization, retrieval and ranking of personalized multimedia documents, knowledge-assisted analysis optimization through visual context exploitation, mid-level visual analysis and context utilization, contextual image classification problems, etc. Towards this direction, it presents research results and indicative applications, in order to facilitate the proposed interpretation.
Uncertainty has gradually attained acceptance and a very distinct role in scientific thought as well as in the scientific view of the world. As far as intelligent knowledge based systems are concerned, uncertainty is present at all levels of their operation and its role is determinant of their effectiveness. In this thesis we propose a series of solutions to uncertainty related problems. In their turn, these solutions provide for further thought and progress in a series of directions.
In the first part of the thesis, which is also the lengthiest, the emphasis is on the semantics. In this framework, the important problems to consider are those of modeling real world concepts thus constructing a formal knowledge base and of exploiting the information contained in this knowledge base in practical applications, given its size. In this direction, chapter 2 proposes the utilization of fuzzy relations for the representation of knowledge and explains how this knowledge can be used in order to automatically extract the context. Chapters 3 and 4 focus on the size of this knowledge and provide computational models for its efficient handling. Chapters 5 and 6 deal with the intelligent utilization of such knowledge in the framework if information retrieval.
In the second part of the thesis we move on to a level between concepts and numeric data. Thus, chapter 7 explains how we can use high level linguistic information in order to handle uncertain low level numerical data. Focus is both on the uncertainty within the low level data and on the flexibility required in order for the high level information to provide for an adequate description of the real world.
In the third and last part of the thesis we work solely with numerical data. Chapters 8 and 9 deal with the automated analysis of data for the generation of neural models that are able to map the structure of the data, while chapter 10 moves on to the processing of these models in order to automatically extract higher level information from the available numerical data.
Chapter 11 summarizes conclusions drawn from this thesis and refers to directions of possible further work that come out of this work.
Diploma Thesis/M.Sc. students / co-supervision
The rapid evolution of complex deep learning models has yielded remarkable achievements in several applications, underlining their proficiency in recognizing vital data patterns. However, this success came at the cost of transparency and interpretability. These intricate architectures are considered as black boxes, since their low-level, non-linear computations are too many to analyze. Much like the human brain, we understand individual neuron interactions but we struggle to comprehend how the model combines information to form higher concepts. This gap of knowledge gave rise to Explainable AI (XAI).
This particular field of AI is in its developing stage and a sequence of questions to comprehend the functioning of a model is yet to be formed. However, the exploration began with a fundamental question: "what factors influence the prediction of a model for a given input", leading to the development of attribution methods. These methods assign importance scores to each input feature. The proliferation of such methods necessitated the creation of evaluation metrics to measure their effectiveness. Yet, such metrics have limitations, leading to the development of sets of axioms and criteria that were considered essential for robust methods to satisfy. While facing similar limitations, they provided a more conceptually sound approach for robust attributions.
In response to these challenges, this thesis explores the concept of Zero Information. This concept aims to conceal all information contained in parts of an image for a particular model, revealing their contribution to the prediction of the model. We develop a new approach to the problem, by designing criteria related to information concealment. First, we design an algorithm for hiding information from the whole image. By translating criteria into loss functions, the algorithm finds the most influential input features that lead to the drop of the model's confidence and use them to define an attribution method. Then, we hide information from parts of the image, by extending the criteria to capture feature interactions. An optimization algorithm is designed to meet these criteria, while leveraging generative models for reconstructing the hidden parts with natural fill. The method is tested across multiple metrics, showing strong performance against other techniques. It can then be exploited by different attribution methods and evaluation metrics based on information concealment, yielding better results.
Overall, this thesis attempts to develop a methodology towards robust filling techniques for Zero Information. It does not give definite answers to the problem, since it is constrained by its complexity. Instead, it paves the way for better criteria to be developed in the future, to answer a fundamental question of XAI and unlock the power of different methods.
In computer vision, image captioning is a challenging task, in which the goal is to bridge the gap between visual content and natural language understanding. Image captioning, as the name suggests, is the process where a descriptive caption is automatically generated from an image. Another challenging task is visual question answering, where a user can ask questions about an image and receive meaningful answers. In recent years, there is a lot of effort in the research community to improve both processes, by introducing different architectures and methods. Image captioning and visual question answering are two very related vision-language tasks. However, they are treated individually.
In this thesis, we follow a lightweight approach for image captioning and we made a thorough investigation of all its components. We then extend that method to handle visual question answering tasks. Finally, we introduce a unified model, which is trained via multitask learning on both image captioning and visual question answering. This single model can handle both tasks at inference, achieving competitive performance in both. Surprisingly, although multitask learning often leads to inferior performance in the individual tasks, in our case it even improves performance.
In another direction, we employ the power of diffusion generative models to boost the performance of our image captioning and visual question answering models. Using diffusion, we generate images from the existing captions of each training set to create new, synthetic datasets. By controlling each generated image to be similar to the existing one corresponding to the caption, we verify that the synthetic datasets can assist to improve the performance of captioning, as well as visual question answering in the presence of multitask learning.
In recent years, the rapid development of deep neural networks (DNN) has led to a remarkable performance in many computer vision tasks. The increasing complexity of the models, the computational power, the amount available of data and the supervision during the training process are the main causes behind this success. As an alternative to supervised representation learning, self-supervised methods are becoming popular in dispensing the need for carefully labelled datasets.
Undoubtedly, the more complex the models get, the greater is the need for understanding their predictions. The primary objective of this thesis is to interpret both supervised and self-supervised models, using either convolutional neural networks or visual transformers as a backbone. Variations of visualization methods are used, based on class activation maps (CAM) and attention mechanisms. Given an input image, these methods provide us with a saliency map that is used to interpret the network prediction. This map indicates the regions of the image that the model pays the most attention to.
We evaluate these methods qualitatively and quantitatively. We further propose new alternative or complementary visualization methods, which show where important information can be hidden inside the network and how to reveal it. These new methods further improve the quantitative results. Our study highlights the importance of interpretability, shows some common properties and differences in the way supervised and self-supervised models make their predictions and provides valuable information on both the models and the visualization methods.
Thanks to the knowledge we gain from the interpretability study, we further investigate self-supervised learning, in particular using mask image modeling (MIM). Here, we indicate the regions of an image that are most important to be hidden from a student network and define a more challenging MIM-based self-supervision pre-text task. Based on this, we propose new masking strategies that achieve higher k-NN, linear probing scores and acceleration in the learning process of downstream tasks.
Considering the computational efficiency challenge these methods face, we conduct experiments on different scales of a dataset and number of training epochs and show their impact on the scores. Here, we further visually explain the influence of each masking strategy and scale of a dataset by using interpretability methods during the learning and evaluation process. Finally, we introduce a new loss function based on contrastive learning and achieve improvements over the baseline when used with different masking strategies.
Metric learning is an important paradigm for a variety of problems in machine learning and computer vision. It has been successfully employed for fine-grained classification, retrieval, face recognition, person re-identification and few-shot learning, among other tasks. Metric learning is an approach based on a distance metric that aims to determine similarities or dissimilarities between samples. The goal is to reduce the distance between similar samples and at the same time to increase the distance of dissimilar ones. Therefore, it is crucial that the distance measure is learnable to adapt to data from different domains.
Training a Convolutional Neural Network to distinguish similar from dissimilar images requires some kind of supervision. In the era of big data, due to limited human-powered annotated data, deep learning methods are recently adapted to work without supervision. Self-supervised methods can be considered as a special form of unsupervised learning methods with a supervised form, where supervision is induced by self-supervised tasks rather than predetermined prior knowledge. Unlike a completely unsupervised setting, self-supervised learning uses information from the dataset itself to generate pseudo-labels.
In this work, we consider a number of self-supervised metric learning methods that use different sample mining techniques as well as loss functions. We investigate their effectiveness in both using pre-trained networks on ImageNet and initialized from scratch. The evaluation is performed on four benchmark metric learning and retrieval datasets. It appears that soft loss functions that exploit contextual similarities between samples outperform hard ones that use pairwise similarities. Furthermore, we find that augmented versions of the original images can be used as positive pairs to initiate the self-supervised training process.
Deep neural networks have become the de facto model for computer vision applications. Their success is partially attributable to their scalability, i.e., the empirical observation that training them on larger datasets produces better performance. Deep networks often achieve their strong performance through supervised learning, which requires a labeled dataset. The performance benefit conferred by the use of a larger dataset can therefore come at a significant cost, since labeling data often requires human labor. This cost can be extraordinary when labeling must be done by an expert.
A powerful approach for training models on a large amount of data without requiring a large amount of labels is semi-supervised learning (SSL). SSL mitigates the requirement for labeled data by providing a means of leveraging unlabeled data. Since unlabeled data can often be obtained with minimal human labor, any performance boost conferred by SSL often comes with low cost. This has led to a plethora of SSL methods that are designed for deep networks.
In this thesis, we propose two methods that combine successful ideas in problems related to our task at hand. In particular, we propose CleanMatch and WeightMatch, two new semi-supervised learning methods that unify dominant approaches and address their limitations. CleanMatch consists of two stages: (1) iterative selection of the most confident pseudo-labels provided by a combination of consistency regularization and pseudo-labeling, following FixMatch; and (2) augmentation of the labeled set with the selected examples of the first stage and semi-supervised training, using FixMatch on the augmented dataset. WeightMatch estimates a weight reflecting the confidence of each labeled example, forcing the model to rely more on the confident ones during training.
Our methods achieve state-of-the-art performance on multiple datasets. They achieve substantial accuracy improvements on label-scarce versions of CIFAR-10, SVHN and CIFAR-100.
The objective of metric learning is to learn a distance metric that decreases the distance between similar objects and increases the distance between dissimilar ones. Similarity and dissimilarity can be subjective and thus some supervision is needed to define them. Learning a distance metric can be useful for many tasks, such as classification, retrieval and clustering. The classification and retrieval tasks can be reduced to class-level and instance-level nearest neighbor tasks respectively, while the clustering task can be made easier given the similarity matrix.
Before deep learning, metric learning approaches were based either on linear transformations using the Mahalanobis or/and Euclidean distance, or on non-linear transformations using kernel-based methods. Both of them, however, had limitations. Linear transformations had a limited ability to capture nonlinear feature structure. Non-linear transformations performed better, but often suffered from overfitting. Both methods were limited by their ability to process raw data and thus needed feature engineering.
With the remarkable success of convolutional neural networks, deep metric Learning was introduced. Neural networks are discriminatively trained to learn the non-linear mapping from raw input data to a lower dimensional embedding. This is usually done in a supervised way, where embeddings with the same class label are pushed closer and embeddings with different class labels are pulled apart. The training process involves minimizing a loss function having exactly these properties. The advantage of deep metric learning is that it jointly extracts the features and learns the embedding.
The contribution of this work is threefold. First, we conduct extensive experiments using the most commonly used architectures (GoogLeNet, BNInception, ResNet50) on the most commonly used datasets (CUB200-2011, CARS196, Stanford Online Products) using 10 different loss functions (Contrastive, Triplet, LiftedStructure, NPair, ProxyNCA, ArcFace, Margin, MultiSimilarity, SoftTriple, ProxyAnchor) and four different embedding sizes (64, 128, 512, 1024). We make an ablation study and draw important conclusions using the results, revealing significant flaws in the evaluation and little true progress over time. Second, we introduce and propose a new setup for training using a fixed validation set. We conduct experiments using this split, as well as a 10-fold cross validation. Our setup seems to balance perfectly between computational complexity and retrieval quality. Finally, we design, implement and experiment with a new loss function that is on a par with the state-of-the-art.
The goal of neural architecture search is to automatically find the optimal network architecture, that is, the optimal succession of layers and how they are connected to each other to solve a particular learning task (for instance, image classification). This is a combinatorial optimization problem and finding the optimal set of connections over all possible combinations is intractable.
We address this problem by combining ideas from different fields. We define a "fully-dense" network where every layer has access to all previous ones and contains all possible connections between layers. This super-network contains different possible subsequences of layers by considering the possibility to skip unnecessary connections. We select the most important ones using a pruning criterion that removes the unnecessary connections. This way, the optimal architecture can be found within the super-network.
However, the number of connections in the super-network is quadratic in the number of layers. Training the super-network is not practical when it is very deep. To alleviate this problem, we devise a greedy algorithm, growing the super-network a few layers at a time, training it and pruning its connections at each iteration. This way, the number of connections at each iteration is linear in the network depth.
Our goal in this work has been to investigate new methods for object detection combining already existing tools and algorithms, Convolutional Neural Networks (CNN) and the Hough transform. CNNs are a standard tool in deep learning for image classification, and are increasingly used for object detection as well. The Hough transform is a category of algorithms that uses votes to predict where potential objects could be. We start with some background to help to understand this report, and discuss related work. Then, we present our work, step by step. Starting with a standard neural network architecture, we gradually add new functions, layers that would get us closer to our goal. We first retrain parts of the network to fit our new training data. Then, we change the network so it would consider subparts of images and not images as a whole. Next, we add layers to limit false positive and duplicates during detection. Finally, we create a end-to-end method to train the network. Since this is still ongoing work, we discuss methods that can be tested in the future.
Object proposals is a relatively new problem which appeared due to the complexity of modern object detectors and their high execution time. The purpose of object proposal algorithms is the high speed class-agnostic detection of all objects in the image. The proposals are then passed to the object detectors so that they avoid the exhaustive search of the image using the sliding window approach. This way, the time needed to detect objects is drastically reduced which enables them to use more complex and effective algorithms. Modern object detectors use object proposals.
In our thesis we present most modern methods for the extraction of object proposals and we propose a new method, Segment Boxes. This method uses segmentation of the image and by using the resulting segments we score windows inside the image based on the possibility that they contain objects. We try to encapsulate good ideas of other methods as well as some of our own to achieve best results, so we end up with several variants of our method.
We compare those different approaches and the best ones are compared with the state-of-the-art methods, using the appropriate metrics, on images from datasets PASCAL VOC07 and ImageNet2013. We then use our proposals with a modern object detector which uses deep learning and convolutional neural networks, Fast R-CNN, and we compare again our results with those of other methods, this time on the problem of object detection. Our results are competitive with those of the state-of-the-art methods, and in some cases they even exceed them, while achieving low execution time (one of our approaches runs on 0.3 seconds per image). Our goal is to examine the potential of segmentation on the problem of object proposals.
This thesis addresses the problem of content based large scale image retrieval (CBIR). We study the algorithms and methods that are being used to produce compact image vector representations and primarily the VLAD image representation. We showcase the main algorithm used to produce the VLAD vector and known methods to improve it. Finally, we examine novel image representation vectors based on VLAD and a normalization scheme that can be used for further improvement.
We propose a new method for the vector representation of an image which is based on aggregation of image descriptors while exploiting their spatial information. Our method, Spatial Pyramid with Vectors of Locally Aggregated Descriptors (SP-VLAD), was designed for the problem of image retrieval in order to achieve high accuracy and efficiency, with low memory requirements. SP-VLAD is based on the ideas of two other methods, Spatial Pyramid Matching (SPM) and Vector of Locally Aggregated Descriptors (VLAD). Specifically, it combines the idea of spatial pyramid with the VLAD descriptor vectors. The SP-VLAD method achieves high accuracy and significantly outperforms SPM and VLAD methods. The promising results led us to apply our method to the problem of image classification as well; exceeding the classification rate of SPM and VLAD methods on all databases which were used. The excellent results of our method were achieved with low memory usage after the dimension reduction of the descriptor vectors with the PCA method which resulted to vectors of 64 or 128 dimensions. We also created the dataset Flowers 15 for the needs of our research in order to be able to test the SP-VLAD method upon images of flowering plants.
License plate recognition is an integral part of intelligent traffic control systems, with ever more applications. The goal of this diploma thesis is the creation of a prototype system for license plate detection and recognition with the use of local descriptors of the image. The most important methods so far for both parts of the problem are described. Problems that arose during the implementation of the method are presented, as well as the chosen solutions against them. Furthermore, the need to create a global data set for license plate recognition is noted, since without it there cannot be a reliable comparison among the suggested solutions and selection of the best. For the data set of Greek license plates that was available, the detection rate was 94% and the recognition rate 32.3%, with the use of a free platform.
Augmented Reality is a growing area in the recent years, in the field of Virtual Reality. A system of Augmented Reality supplements the real world with virtual graphical models, creating the illusion of coexistence between real and virtual world. This process requires accurate camera pose estimation based on computer vision methods. In the framework of this thesis, we aim to study the camera pose estimation methods based on the detection of local features in successive frames, without prior knowledge of the environment. Furthermore we present an implementation of an Augmented Reality application. Epipolar constraints are used for the camera pose estimation.
In the framework of this Diploma thesis we introduce a new image categorization method, which integrates spatial matching and indexing in the classification process. Spatial matching is based on Hough pyramid matching (HPM); indexing is based on an inverted file structure as in image retrieval; and classification is carried out with a multi-class support vector machine (SVM) classifier. We use HPM as an image similarity measure and we show that under reasonable assumptions it is a Mercer kernel. We do so by explicitly expressing it as an inner product in a high dimensional space where images lie given an appropriate quantized representation of their local features and descriptors. We then use this kernel for SVM training instead of a linear kernel, which is a typical choice under the bag of words (BoW) model. It is the first time that a kernel function takes spatial configuration into account while being invariant to translation, scale and rotation. In most cases, artificial perturbations are the only way to achieve geometric invariance, with an exponential increase of training time.
We train one binary SVM classifier for each category following an one-versus-the -rest strategy and then combine individual classifiers into one multi-class classifier. Comparing to nearest-neighbor classifier using e.g. image retrieval methods, we exploit the sparse representation of SVMs: at classification time, the query image is matched via HPM against the chosen support vectors only. However, matching need not be exhaustive. Support vectors are indexed into an inverted file, and HPM may be applied only to a small subset that is top-ranking according to any scalar similarity measure, e.g. based on BoW. The method therefore easily applies to large scale classification, while training for unseen classes does not require re-training for existing ones.
Due to the nature of local features and their use in invariant matching, the method is most appropriate for specific object recognition. We apply it to landmark recognition, conducting experiments on our own dataset, constructed from the World cities dataset via a semi-automatic process that combines visual and geographical clustering. We compare to a baseline classifier using a BoW representation and achieve more than a twofold increase in accuracy on experiments of up to 68 landmarks.
In the framework of this thesis, we present new image segmentation techniques based on a weighted medial axis decomposition procedure. Starting from image gradient or gray-scale contour map, we first compute a weighted distance map and its weighted medial axis by a linear-time process. Now, applying the same distance propagation from the medial axis backwards, we dually obtain an initial image partition and a graph representing image structure. This is equivalent to applying watershed transform on the weighted distance map, hence is both topological and contrast-weighted. However, it is more efficient because we first decompose the medial axis and then use our linear-time process to propagate on the remaining image surface. Using a disjoint-set data structure, we then merge adjacent regions according to different criteria.
Several criteria were examined and tested. First, we use medial axis saddle point height to express similarity between adjacent regions and merge correspondingly. A second distinct direction we follow is to merge adjacent regions according to how fragmented they are. Last but not least, we use ultrametric contour map representation to implement hierarchical segmentation. As inter-region ultrametric dissimilarities, we use mean boundary strength on the common boundary between adjacent regions and inter-region fragmentation. All the above mentioned techniques are evaluated using the Berkeley Segmentation Dataset and compared with some state of the art algorithms. Without learning, we achieve performance near the state of the art with very practical running times.
In this diploma thesis we investigate large scale image retrieval. We describe the stages of image retrieval, giving emphasis in the visual vocabulary construction. Moreover, we mention the problems that arise due to quantization of the descriptors and introduce several techniques that appease them. More specifically, one of these techniques introduces the use of synonym visual words. In order to discover the synonym visual word we should construct sets of matching image patches, called feature tracks. For this purpose, we develop a novel technique for constructing feature tracks. Given a collection of geo-tagged images, we cluster these images according to (a) their locations and (b) their visual features. Hence we obtain view clusters: clusters with images that depict the same scene. Matching features are discovered, through geometric verification between the images in the cluster and the image reference (center of the cluster). Given the feature tracks, we can find matching visual words. Finally, we test and evaluate the performance of this technique implementing retrieval experiments in Oxford building dataset.
The large amount of optical information and the easy access to available data through the Internet has led to the emerging need for efficient description of image content. Many techniques for fast image retrieval have been proposed in literature, but in the recent years the use local features has come to maturity because of their efficiency.
In this thesis, the most known methods of local detectors and descriptors are firstly studied. Next, an integrated system of local invariant features is implemented, with the use of already tested techniques, and different methods are combined in order to compose a powerful tool for image analysis. The experimental evaluation follows, which is done over a standard set (benchmark) of images under various transformations (photometric and geometric). All previously analyzed methods are compared via objective criteria: repeatability score, accuracy of detectors (localization), matching score and performance of descriptors.
The efficiency of local features is testified in the image retrieval system with the use of large image databases. The experimental procedure provides a quantitative comparison of the aforementioned techniques. The main image retrieval mechanism is related to text retrieval methods: a visual vocabulary is created and a model vector is constructed for each image, which represents its semantic content. The image retrieval procedure is done through vector similarity measures. Different visual vocabularies (generated by various local feature methods) are compared with respect to image retrieval evaluation criteria, like precision, recall and mean Average Precision (mAP).
Over the recent years, the amount of digital images available online has increased rapidly. These huge multimedia collections contain diverse data and cover almost every aspect of life in terms of visual and semantic content. Proper indexing and analysis of such data is an essential process, in order to be able to retrieve its useful visual information. Searching through image libraries has become an everyday process, the same way as Google text-based search. In this diploma thesis, techniques for content-based image retrieval are presented and evaluated, and a web-based image search platform is created. Various techniques are applied, using either global or local features, such as the MPEG-7 and SURF descriptors, extracted locally, from points of interest or segmented regions. A bag-of-words model is used for indexing and geometric constraints are also taken into account. These techniques are evaluated over many common datasets, in order to test the universality of their use, towards a web-scale image retrieval system.
Object detection in images is a filed of image analysis that is searched intensively during the past few years. In this diploma thesis we present a complete object detection method which was created by Viola and Jones in 2001. Haar-like features are used to describe images, while the classification of the candidate regions of an image is performed by a cascade of classifiers created by the AdaBoost algorithm in order to increase detection speed. By using this method, we trained several detectors for interior parts of a car, as well as its exterior, with sample images from the LabelMe dataset. We show and explain the choices that were made in every detector training. The results of the evaluation of every detector are presented in precision-recall and receiver operator characteristic (ROC) diagrams. We also present some conclusions in order to achieve the best results from this method. In this diploma thesis we created a program for semi-automatic annotation of images, which detects objects in images using the presented method.
The enlarging audiovisual multimedia content during the last few years has emerged the need of automatic feature extraction and description of this content. With the use of various descriptors, including those defined by the MPEG-7 standard, its low level information is captured. In this diploma thesis MPEG-7 visual descriptors are examined and a descriptor extraction application is developed based on the MPEG-7 eXperimentation Model. This application is evaluated in order to verify its alignment to the XM. This application is then used within a high-level detection approach. A region-based technique is applied and a visual thesaurus is constructed to formalize knowledge. Neural-network detectors are trained in order to detect high-level concepts. Moreover, the utility of the well known Latent Semantic Analysis technique is investigated. The dataset of the TRECVID benchmark has been used for testing this techniques. Finally a car exterior/interior classification problem is also tackled. Extensive experimental results are presented for each of the aforementioned problems.
Object detection in images or sequences of images is an important field of research in the past few years. This report studies a method for the detection of people in still images, with unconstrained scene conditions, such as complex background and uncontrolled lighting. A feature vector is used, which is adopts Histograms of Oriented Gradients (HOGs) in a dense grid on the image. The classification is achieved via a linear SVM. The study and evaluation of the method are being achieved through two implementations, which are differentiated by the choice of HOG features. Moreover, the sensitivity of each implementation in basic parameters of the method is evaluated, altering the values of these parameters. From the evaluations and observations, is concluded that the use of HOGs captures ideally the characteristics of human form and gives us reliable results for the detection of people in still images.
Recent years have seen a rapid increase of the size of digital image collections. Recent research has focused on the efficient processing, searching and retrieval of similar images from a database. In this diploma thesis we study the representation and retrieval of similar images based on the curvature scale space (CSS) method in the presence of affine transformations. More specifically, we examine the robustness of the method under affine transformations and compare its performance with the performance of other alternative methods. In order to achieve invariance, we used curve normalization based on affine length parametrization and evaluated the effectiveness of this application.
At experimental level, a database of curves (contours) of different categories of shapes has been constructed. Initially, by applying random affine transforms, we created a number of affine-transformed versions of the above curves. We then used the CSS method to represent and retrieve similar curves of shapes from the database. The resulting matching cost is a measure of comparison of curve similarity and indicates the effectiveness of the method. Finally, we study the application of an alternative normalization method instead of the affine length based normalization, which appears to improve the effectiveness of the CSS method.
Recently, the enlarging available video data has led to the emerging need for automatic analysis, synopsis and extraction of information from videos. Every video sequence consists of a number of shots, each of them containing temporally associated frames, while contiguous shots are connected to each other with some type of transition at their boundaries. The first step for any kind of video analysis seems to be the detection of these boundaries followed by a temporal segmentation of video, while the next and more important step is related with the synopsis or the summarization of video. This is achieved by selecting certain number of characteristic frames (key-frames) from each shot, so that the content of the shot is represented in a short and also meaningful way.
In this diploma thesis, a video summarization system is being constructed, encapsulating shot change detection, feature extraction and key-frame selection. This system combines methods working directly on MPEG compressed domain and automatically locates shot changes of video. Features from each frame of the sequence are then extracted and their values produce a point in the feature space. Therefore the entire video sequence is represented by a trajectory in this multidimensional space. According to mathematical methods defining the curvature of a curve, the characteristic points of the curve are determined, in areas where locally extreme behavior is observed. The system automatically extracts the key-frames via points of local maxima and minima of curvature, irrespective of the video content.
In this work, two different ways of computing curvature are compared, as well as and two other existing methods that use characteristic points of a curve. Experiments concern in the effectiveness of these methods in terms of video key-frame selection. We attempt to improve performance of these methods by applying the computational model of Visual Attention, in order to extract features from the salient regions of the image. Finally, through the results of these experiments we aim to indicate capabilities, disadvantages and cases of improving and expanding this video summarization system.
Research grants [European]
GRAPES aims at considerably advancing the state of the art in Mathematics, Computer-Aided Design, and Machine Learning in order to promote game changing approaches for generating, optimizing, and learning 3D shapes, along with a multisectoral training for young researchers. Recent advances in the above domains have solved numerous tasks concerning multimedia and 2D data. However, automation of 3D geometry processing and analysis lags severely behind, despite their importance in science, technology and everyday life, and the well-understood underlying mathematical principles. The CAD industry, although well established for more than 20 years, urgently requires advanced methods and tools for addressing new challenges.
The scientific goal of GRAPES is to bridge this gap based on a multidisciplinary consortium composed of leaders in their respective fields. Top-notch research is also instrumental in forming the new generation of European scientists and engineers. Their disciplines span the spectrum from Computational Mathematics, Numerical Analysis, and Algorithm Design, up to Geometric Modeling, Shape Optimization, and Deep Learning. This allows the 15 PhD candidates to follow either a theoretical or an applied track and to gain knowledge from both research and innovation through a nexus of inter-sectoral secondments and Network-wide workshops.
Horizontally, our results lead to open-source, prototype implementations, software integrated into commercial libraries as well as open benchmark datasets. These are indispensable for dissemination and training but also to promote innovation and technology transfer. Innovation relies on the active participation of SMEs, either as a beneficiary hosting an ESR or as associate partners hosting secondments. Concrete applications include simulation and fabrication, hydrodynamics and marine design, manufacturing and 3D printing, retrieval and mining, reconstruction and visualization, urban planning and autonomous driving.
The main objective of WeKnowIt is to develop novel techniques for exploiting multiple layers of intelligence from user-contributed content, which together constitute Collective Intelligence, a form of intelligence that emerges from the collaboration and competition among many individuals, and that seemingly has a mind of its own. To this end, input from various sources is analyzed and combined: from digital content items and contextual information (Media Intelligence), massive user feedback (Mass Intelligence), and users social interaction (Social Intelligence) so as to benefit end-users (Personal Intelligence) and organizations (Organizational Intelligence).
The automatic generation of Collective Intelligence constitutes a departure from traditional methods for information sharing, since for example, semantic analysis has to fuse information from both the content itself and the social context, while at the same time the social dynamics have to be taken into account. Such intelligence provides added-value to the available content and renders existing procedures and workflows more efficient.
Public administrations represent the largest information bound professional communities: among them the judicial sector is one of the largest, where the needs of cooperation are critical, creating an exceedingly large improvement potential through adoption of novel content management techniques and development of new solutions for its specific needs of retrieval and semantic analysis. This potential is even larger considering the growing transnational cooperation also among several national law systems, highlighting the need to adapt the technological profiles of new member states.
In this context, JUMAS envisages an advanced knowledge management system able to extract semantics from multimedia data. JUMAS is tailored at managing situations where multiple cameras and audio sources are used to record assemblies and reconstructing debate sequences for future consultation.
The main objective of IMAGINATION is to bring digital cultural and scientific resources closer to their users, by making user interaction image-based and context-aware. Our ultimate aim is to enable users to navigate through digital cultural and scientific resources through its images. IMAGINATION will provide a novel image-based access method to digital cultural and scientific resources. It will reduce complexity by the provision of intuitive navigation method.
IMAGINATION will facilitate an interactive and creative experience providing an intuitive navigation through images and parts of images. To do so IMAGINATION will combine, apply and improve existing techniques to provide a new way of navigation through cultural heritage multimedia archives. It will exploit the context of resources stored in its knowledge space by combining text-mining, image segmentation and image recognition algorithms. This combination will cause a synergy effect and will result in semi-automatically generated, high-level semantic metadata.
The focus of IMAGINATION is on indexing, retrieving and exploring non-textual complex objects and will apply knowledge technologies and visualization techniques for improved navigation and access to multimedia collections. Comprehensive tool support (including an ontology editor and a semi-automated image annotation tool) will be provided, together with an easy-to-use web-based interface which visualizes the contextualized content stored in the IMAGINATION knowledge space.
A major outcome of the project will be the new and intuitive approach of navigation trough images and a set of technologies and tools to support the annotation of images by manual, semi-automatic and automatic techniques.
BOEMIE will pave the way towards automation of the process of knowledge acquisition from multimedia content, by introducing the notion of evolving multimedia ontologies, which will be used for the extraction of information from multimedia content in networked sources, both public and proprietary. BOEMIE advocates a synergistic approach that combines multimedia extraction and ontology evolution in a bootstrapping process involving, on the one hand, the continuous extraction of semantic information from multimedia content in order to populate and enrich the ontologies and, on the other hand, the deployment of these ontologies to enhance the robustness of the extraction system. The ambitious scope of the BOEMIE project and the proven specialized competence of the carefully composed project consortium ensure that the project will achieve the significant advancement of the state of the art needed to successfully merge the component technologies.
Multimedia Semantic Syndication for Enhanced News Services (MESH) will apply multimedia analysis and reasoning tools, network agents and content management techniques to extract, compare and combine meaning from multiple multimedia sources, and produce advanced personalized multimedia summaries, deeply linked among them and to the original sources to provide end users with an easy-to-use "multimedia mesh" concept, with enhanced navigation aids. A step further will empower users with the means to reuse available content by offering media enrichment and semantic mixing of both personal and network content, as well as automatic creation from semantic descriptions.
Encompassing all the system, dynamic usage management will be included to facilitate agreement between content chain players (content providers, service providers and users). In a sentence, the project will create multimedia content brokers acting on behalf of users to acquire, process, create and present multimedia information personalized (to user) and adapted (to usage environment). These functions will be fully exhibited in the application area of news, by creation of a platform that will unify news organizations through the online retrieval, editing, authoring and publishing of news items.
X-Media addresses the issue of knowledge management in complex distributed environments. It will study,develop and implement large scale methodologies and techniques for knowledge management able to support sharing and reuse of knowledge that is distributed in different media (images, documents and data) and repositories (data bases, knowledge bases, document repositories, etc.) or that is inaccessible for current systems, which cannot capture the knowledge implicit across media. All the developed methodologies aim at seamlessly integrating with current work practices. Usability will be a major concern together with ease of customization for new applications.
Technologies will be able to support knowledge workers in an effective way, (i) hiding the complexity of the underlying search/retrieval process, (ii) resulting in a natural access to knowledge, (iii) allowing interoperability between heterogeneous information resources and (iv) including heterogeneity of data type (data, image, texts). The expected impact on organizations is to dramatically improve access to, sharing of and use of information by humans as well as by and between machines. Expected benefits are a dramatic reduction of management costs and increasing feasibility of complex knowledge management tasks.
K-Space is a network of leading research teams from academia and industry conducting integrative research and dissemination activities in semantic inference for automatic and semi-automatic annotation and retrieval of multimedia content. K-Space exploits the complementary expertise of project partners, enables resource optimization and fosters innovative research in the field. The aim of K-Space research is to narrow the gap between low-level content descriptions that can be computed automatically by a machine and the richness and subjectivity of semantics in high-level human interpretations of audiovisual media: The Semantic Gap. Specifically, K-Space integrative research focus on three areas:
(1) Content-based multimedia analysis: Tools and methodologies for low-level signal processing, object segmentation, audio/speech processing and text analysis, and audiovisual content structuring and description.
(2) Knowledge extraction: Building of a multimedia ontology infrastructure, knowledge acquisition from multimedia content, knowledge-assisted multimedia analysis, context based multimedia mining and intelligent exploitation of user relevance feedback.
(3) Semantic multimedia: Knowledge representation for multimedia, distributed semantic management of multimedia data, semantics-based interaction with multimedia and multimodal media analysis.
An objective of the Network is to implement an open and expandable framework for collaborative research based on a common reference system.
Due to the convergence of several strands of scientific and technological progress we are witnessing the emergence of unprecedented opportunities for the creation of a knowledge driven society. Indeed, databases are accruing large amounts of complex multimedia documents, networks allow fast and almost ubiquitous access to an abundance of resources and processors have the computational power to perform sophisticated and demanding algorithms. However, progress is hampered by the sheer amount and diversity of the available data. As a consequence, access can only be efficient if based directly on content and semantics, the extraction and indexing of which is only feasible if achieved automatically.
Given the above, we feel that there is both a need and an opportunity to systematically incorporate machine learning into an integrated approach to multimedia data mining. Indeed, enriching multimedia databases with additional layers of automatically generated semantic metadata as well as with artificial intelligence to reason about these (meta)data, is the only conceivable way that we will be able to mine for complex content, and it is at this level that MUSCLE will focus its main effort. Realizing this vision will require breakthrough progress to alleviate a number of key bottlenecks along the path from data to understanding.
Long term market viability of multimedia services requires significant improvements to the tools, functionality, and systems to support target users. aceMedia seeks to overcome the barriers to market success which include user difficulties in finding desired content, limitations in the tools available to manage personal and purchased content, and high costs to commercial content owners for multimedia content processing and distribution, by creation of means to generate semantic-based, context and user aware content, able to adapt itself to user preferences and environments.
aceMedia will build a system to extract and exploit meaning inherent to the content in order to automate annotation and to add functionality that makes it easier for all users to create, communicate, find, consume and re-use content. aceMedia targets knowledge discovery and embedded self-adaptability to enable content to be self organizing, self annotating, self associating; more readily searched (faster, more relevant results); and adaptable to user requirements (self reformatting).
aceMedia introduces the novel concept of the Autonomous Content Entity (ACE), which has three layers: content, its associated metadata, and an intelligence layer consisting of distributed functions that enable the content to instantiate itself according to its context (e.g. network, user terminal, user preferences). The ACE may be created by a commercial content provider, to enable personalized self-announcement and automatic content collections, or may be created in a personal content system in order to make summaries of personal content, or automatically create personal albums of linked content.
The ACE concept will be verified by two user focused application prototypes, enabled for both home network and mobile communication environments. This enables the aceMedia partners to evaluate the technical feasibility and user acceptance of the ACE concept, with a view to market exploitation after the end of the project.
MIRROR aims to create a collection of components and tools for a distributed knowledge management system that will support physical and social interactions. Mirror aims to establish a Europe-wide community of practice for learning and innovation in the area of natural sciences museums by developing a novel learning methodology and by implementing state-of-the-art tools, techniques and systems.
The overall objective of FAETHON project is to develop an integrated information system that offers enhanced search and retrieval capabilities to users of digital audiovisual (a/v) archives. This novel system will exploit the advances in handling a/v content and related metadata, as introduced by MPEG-4 and worked out by MPEG-7, to offer advanced access services characterized by the tri-fold "semantic phrasing of the request (query)", "unified handling" and "personalized response".
From a technical point of view, the proposed system will play the role of an intermediate access server residing between the end users and multiple heterogeneous audiovisual archives organized according to new MPEG standards. Various types of interfacing modules will be designed/ implemented to support smooth communication of the intermediate server to the a/v archives. The major final product will be an integrated software system consisting of the two, semantic unification and personalization subsystems, together with two types of interfaces. Namely, those between the system and the individual a/v archives and those between the system and the end-users.
Systematic principles for integrating symbolic and subsymbolic processing will be developed in the project. Key aims are to ensure that the resulting total hybrid system retains desirable properties of both processing levels. On the one side the signal processing abilities, robustness and learning capability of neural networks should be preserved. On the other side advantage should be taken of the ability of rule-based systems to exploit high level knowledge and existing algorithms and to explain (to a user) why conclusions were reached in particular cases. The methodologies to be developed in the project will be tested in a challenging application related to human computer interaction, which is recognition of emotion based on both voice and visual cues. Low level features will be extracted from signals using neural networks and subsequent formulation of rules will provide a conceptual framework, substantial for emotion analysis.
This EU-funded project aims to establish a pilot European multimedia network and organization with the purpose of motivating encouraging school pupils to take up a career in technology and related businesses and industries, and to assist with the learning of languages within the context of learning about technological subjects in school at both primary and secondary level. The network will take the structural form of a group of European universities with skills in the development and use of multimedia materials and experience in teacher training working together with teachers and pupils in schools to provide a unified range of user configurable, multi-lingual, multimedia courses on topics in advanced technology for primary and secondary schools. Course materials will be developed in at least 6 topics with all materials available in English, German, and Greek.
The project will be of 2 years duration starting in early 1998. ISDN links will be established between the 3 participating universities and from each of the universities to its group of schools (14 schools in total). Telecommunications service providers will establish these connections and be involved as partners in the project to assist with technical developments as well as with commercial aspects for exploitation of results. The project will form the pilot study for the creation of a self-funding European educational multimedia network and organization for the teaching and learning of advanced technology which has the capability of assisting language teaching. In the 2 years following the project (2000-2002) it is planned to provide access to MODULATES materials for over 1,000 schools throughout the EU.
The M.CUBE is an Esprit project co-financed by European Commission. It promotes the production of European multimedia and supports producers, companies, publishers and other cultural and technological bodies interested in applying multimedia technology to the cultural field. The aim of the project is to create a multimedia support network, based on four Mediterranean regions, dedicated to fostering the development of European multimedia applications in the area of culture and arts.
The main objectives of M.CUBE are to (i) improve the competitiveness of the European multimedia industry, enhancing the quality of its products, improving the business capabilities of the enterprises, and attracting funds for new developments, (ii) create a critical mass to penetrate the consumer market at international level, putting together many small developers to address the market through big distributors, and (iii) exploit the enormous European potential in terms of cultural contents. Supplying multimedia support services to cultural users is a strategic task for M.CUBE. Museums, galleries, collections, public administrations, etc. are pilot users and clients of advanced technological solutions, based on the application of multimedia to culture and arts.
The M.CUBE contribution is to: First, contribute to the development of a whole editorial line, defining technical and artistic contents, market targets and channels, certification process, methodological recommendations, policy and programmes. Second, create a network of manufacturers, through setting up local associations and the development of international exchanges. Third, develop business opportunities with the final aim of stimulating new markets. Fourth, provide a large set of services through intermediation and direct delivery, technological and manufacturing services, training, access to multimedia repositories, certification, consulting activities, information desk, etc.
The project revolved around information extraction techniques from video sequences in order to detect and minimize data unnecessary to transmit. Our contribution was mainly in the fields of multiresolution techniques and hierarchical representation of images and in the use of ROIs (Regions of Interest) for efficient coding and compression of images.
Research grants [French]
Visual scene analysis and understanding research has gone from low level concepts (e.g. object detection and segmentation) to high level semantic understanding, such as visual interestingness, memorability and image/video captioning. At the same time, there is progress beyond unimodal processing towards solving multimodal problems. Visual question answering (VQA) and visual dialog have achieved a great attention as means of AI systems interacting with humans in conversational language about visual content.
VQA is often treated as a classification problem: Using deep visual and language representations, a classifier predicts a probability distribution over a fixed list of frequent answers. Recent methods use attention models on both image and question. At the same time, memory networks are used for scene understanding and question answering over a long period of time.
This is a CIFRE PhD thesis project with InterDigital, aiming at designing novel question answering techniques based on deep learning to improve living experience at home. In particular, it will investigate moving from image understanding towards long-term multimodal-based context understanding. This may allow answering questions based on what has happened in the past.
We propose to address the problem of ambiguities of visual and textual content by learning and then combining their representations. As a use case, we propose to solve a new scientific task, namely Multimedia Question Answering, using three different sources of information. The task consists in answering a textual question associated with visual data, relying on an external knowledge base containing millions of unique entities. Each entity is represented by textual and visual content as well as links to other entities. In practice, we focus on four types of entities, namely persons, organisations (e.g. companies, NGOs, intergovernmental organizations), geographical points of interest (e.g. touristic places, remarkable buildings) and objects (e.g. commercial products). Achieving this objective requires to progress on the disambiguation of each modality with respect to the others and the knowdge base. We also propose to merge the representations into a common tri-modal space.
An important problem will be to determine the content to associate to an entity to adequately represent it, depending on its type (person, object, organisation, place). Since an entity can be associated to several vectors, each originating in a different modality, the challenge is to define a representation that is compact (for performance) while still expressive enough to reflect the potential links of the entity to other entities. The project has a potential economic impact in the fields of data intelligence, including applications in marketing, security, tourism and cultural heritage. In case of success, the output of the MEERQAT project could directly contribute to improve chatbots. During the project, the direct output will be mainly academic, consisting of scientfic articles with the corresponding material to reproduce experiments. We also plan to release a new benchmark for the proposed task, in the context of an international evaluation campaign.
The proposed project lies in the field of computer vision, pattern recognition, and machine learning. We particularly study two problems of image recognition, image classification and image retrieval. Like machine learning, computer vision has witnessed a core change with the recent repopularization of Deep Neural Networks (DNN). Despite the success of DNN, several limitations are to be investigated.
(1) Complex recognition problems such as fine grained classification (highly similar categories e.g. bird species, airplane/car models, etc.) show that state of the art DNNs are still improved by better objective functions and more discriminative intermediate representations.
(2) Despite progress in using less annotated data, DNN can hardly cope with learning from few examples.
(3) DNNs have so many parameters and complex structures that it is extremely hard to understand what happens in every layer in producing the final decision.
This project aims to address these limitations. In particular, we will work towards building networks capable of solving fine-grained visual recognition tasks. We will improve the capabilities of networks to learn from few to no data, building highly discriminative representations that can address complex recognition problems. Following that, we will provide insight on how such models take their decisions.
For several years, Bibliothèque nationale de France (BnF) has been pursuing its policy of enriching its Gallica digital library with specialized collections, the main feature of which is to create batches of still images of printed or handwritten text. Thus, important batches of books, manuscripts, newspapers and magazines, photographs, maps, stamps, coins, sound recordings and scores have been added to Gallica. The impossibility of manually indexing all iconographic collections by unit encourages the adoption of an image indexing solution based on the automatic or semi-automatic analysis of their visual content. The objective is to specialize an image classification tool in heritage corpora in order to add semantics to the image analysis process. The challenge is to make the richness of these collections accessible to as many people as possible (general public and professionals), through intuitive and ergonomic tools and interfaces.
In this context, the objective of the project is to investigate automatic classification of images in Gallica according to their visual content, and producing annotations that could enrich existing metadata. The experience of Linkmedia in visual indexing and deep learning makes it possible to consider an experiment on the particular contents of the BnF. The project consists of the following.
(1) Identify an experimental iconographic corpus from Gallica digital collections.
(2) Build a prototype visual classification engine on this corpus.
(3) Evaluate the performance of the prototype by controlled evaluation using ground truth and by the Gallica Studio community.
Learning from few training samples is a topic that enjoys a great scientific and industrial interest. In fact, deep learning approaches developed and advanced in recent years, have been typically relying on huge amounts of data. Most recently, given the impressive performances of deep models in various large-scale tasks, the scientific community has started exploring the feasibility of these powerful techniques for other tasks with reduced amounts of available data. There are plenty of cases where access to high volumes of data is potentially difficult or expensive or where the number of available training samples is intrinsically low. For such cases, the learning strategy of these multi-million parameters architectures needs to be rethought in order to allow the networks to squeeze out the maximum amount of information from the few available samples.
This is a CIFRE PhD thesis project with Safran Tech, aiming to study architectures and learning methods most suitable for object recognition from few samples and to validate these approaches on multiple recognition tasks and use-cases in aerial imagery. In particular, use cases include (1) target objects being small in the image; (2) recognizing objects of the same class; (3) recognizing instances of the same object. The operational context of the recognition tasks requires the ability to recognize objects from a small image corpus with possibly large variations in illumination, orientation and context of the object of interest.
The ability of our mobile devices to process visual information is currently not limited by their camera or computing power but by the network. Many mobile apps suffer from long latency due to data transmitted over the network for visual search. MobilAI aims to provide fast visual recognition on mobile devices, offering quality user experience whatever the network conditions. The idea is to transfer efficient deep learning solutions for image classification and retrieval onto embedded platforms such as smart phones. The intention is to use such solutions in B2B and B2C application contexts, for instance recognizing products and ordering online, accessing information about artifacts in exhibitions, or identifying identity documents. In all cases, visual recognition is performed on the device, with minimal or no access to the network.
Research grants [Greek]
IS-Helleana aims to implement modern Semantic Web technologies for the development of an integrated system for unified access, management, search and interactive presentation of the Greek audiovisual inventory. The system will enable audiovisual content providers to display their content in a single interoperable way, within the context of a generalized semantic, rich display of the Greek audiovisual inventory, both in Greece and internationally, and users to effectively search for content and participate in the online interactive services provided by audiovisual content providers. The basic tool to achieve this goal is an online platform for semantic integration, management, enrichment and visualization of the audiovisual inventory.
The documentation and analysis of Byzantine Art is an important component of the overall effort to maintain cultural heritage and contributes to learning and comprehending ones history traversal path. Efficient publishing of the multi-dimensional and multifaceted information that is necessary for the complete documentation of artworks should draw on a good organization of the data.
Eikonognosia is a research project funded by the Greek General Secretariat of Research and Technology (GSRT) that aims to efficiently organize and publish detailed information about icons in the World Wide Web. Information derived from the analysis conducted in the Art Diagnosis Center of Ormylia Foundation is taken as a case study.
Eikonognosia provides the means for organizing detailed and multidimensional information about Byzantine icons in a way that is compatible to international standards (CIDOC-CRM - ISO 21127:2006) and allows for an easy retrieval of data with advanced semantic web technologies. The ultimate goal for Eikonognosia is to foster the cultural heritage community by providing an integrated framework that helps to facilitate organization, retrieval and presentation of data from the cultural heritage domain.
The pbjective of the project is the development of innovative techniques for the representation, analysis and extraction of semantic information for the manipulation of multimedia content, with emphasis on their application to television news bulletins. Research and development will concentrate on techniques that enable the automatic analysis of multimedia content and the extraction of knowledge, resulting in the (semi-)automatic creation of metadata, as well as provide support for smart, semantic search services. The final output of the project will be a system for the analysis of television new bulletins, which will provide intelligent search functionalities in archives of digital television material.
The main scope of the proposed framework is the development of innovative analysis techniques, export of semantic information and management of multimedia content in general and multimedia documents in particular. The proposed system, called Visual Asset, will be constituted by a number of distinguishable subsystems that will undertake the different stages of processing, analysis, storage and access to content provided by multimedia documents (texts, images, video, audio and 3D representations) and are summarized in the following:
(1) Image and video processing subsystem aiming at automatic export of low level characteristics (color, texture, form, movement, speed, etc) and export of characteristic parts of objects (object segmentation) (e.g. segmentation of persons in a video sequence). These results will be used in the system integrating visual and textual information via the use of ontologies, but also in the effective search and retrieval system, based on visual information.
(2) Integration of visual and textual information subsystem. Object of this subsystem will be representation of knowledge with use of ontologies, analysis of multimedia content based on visual and textual and production of metadata with a common way of representation, regarding both textual and visual information.
(3) Modeling and logical analysis of documents subsystem, aiming at automatic categorization and creation of effective search and retrieval applications, providing advanced functionalities in organization and management of big document volumes. The subsystem will integrate visual and textual information results and will use the notion of context in a document in order to fulfill the tasks of automatic categorization, logical analysis (table of contents, automatic recognition of chapters, titles, notes, reports in images, video, etc) and efficient search and retrieval.
The network, consisting of six Greek academic institutions and two SMEs, will organize a series of training activities in order to introduce knowledge technologies for multimedia content in Greek research and commercial institutions.
PANORAMA aims at development of a system for efficient search and mining of audiovisual data from large distributed multimedia databases through several types of networks (Internet, intranets, etc.). The main objectives of the project are the interoperability of databases and the availability of software products and services on networks for open multimedia access. Digitalization of archives assets should provide all possible information (video, images, sound, texts, etc.) for wide public and professional use.
Main subjects for implementation are the content-based retrieval from multimedia databases as well as search and extraction of characteristic scenes from video data for insertion in synthetic environments. Definition of audiovisual objects should be adopted within the framework of MPEG-4 and MPEG-7 standards in order to focus the project on the application of emerging technologies. Intellectual property rights and copyrights, preservation and security of the information are points that focusing specific attention by using innovative methods for protection such as watermarking for video data authentication and several encryption techniques.
The objective of the project is the design and prototype implementation of a Video on Demand (VoD) and Near Video on Demand (NVoD) system based on MPEG-2 digital video/audio technology. IVML and INFOLAB are exploiting their long experience in digital video applications and related services to bring to the market the final product - hardware and software - called "InTV". This product serves hotels and hospitals, providing them with the ability to offer VOD services to their clients on a pay-per-view basis.
InTV uses state of the art technologies such as high quality MPEG-2 video and is built on mature and evolving platforms including the Oracle Video Server. Movies are stored in MPEG-2 format in the video server's hard disks, and video streams are transmitted to the subscribers' rooms using a local area network. End-users are able to watch movies on a terminal (monitor or TV set) connected to a set-top-box and send messages from the set-top box back to the video server, using an infrared remote control.
The system provides both VoD and NVoD channels, including broadcast TV channels, and supports interactive movie selection as well as standard VCR controls (play/stop, pause, fast forward/rewind etc.) for the VoD service. The integrated InTV system offers a variety of features that make it attractive for mid and large sized hotel type enterprises, including scaleable architecture, easy installation and maintenance, interactive client service, pre-scheduled movies and flexible charging schemes. Optionally, it integrates other advanced features and on-line services, such as connection to Internet, web browsing, e-mail, etc.
This projects aims at the improvement of the IVML with the acquisition of modern equipment in order to provide better cooperation with other Greek partners in matters of editing, analysis and synthesis of images and video sequences as well as transmission, storage and retrieval in multimedia environments. This includes certification along the ISO 9001 standard.
The aim of the project was the design and operation of an integrated PACS system for the best Greek Hospital for the treatment of Cardiomascular Diseases. The PACS has the ability to transmit and store radiology images, ultrasound still images, ultrasound video images of the heart, gamma-camera images and angiography video.
Conference organization
25th ACM Multimedia Conference
25th European Signal Processing Conference
Information Society Technologies Event 2006
15th International Conference on Knowledge Engineering and Knowledge Management
International Conference on Visual Information Engineering
International Conference on Artificial Neural Networks
International Conference on Artificial Neural Networks
2nd European Semantic Web Conference
ACM International Conference on Image and Video Retrieval
Academic Juries
Conservatoire National des arts et Métiers (Cnam)
Inria Grenoble, Naver Labs Europe
IMT Atlantique Bretagne-Pays de la Loire
Ecole Normale Superieure, Inria, Valeo.ai
National and Kapodistrian University of Athens (NKUA)
Paris-Sorbonne Université
National and Kapodistrian University of Athens (NKUA)
Paris-Sorbonne Université
Université de Bordeaux
Universidad Autonoma de Madrid
Advising activities
Evaluator activities
State Education Development Agency Latvia
European Commission, DG Information Society and Media, Unit E2