# Summary

Dr. Yannis Avrithis is a research scientist in LinkMedia team of Inria Rennes-Bretagne Atlantique, carrying out research on computer vision and machine learning. Before that he was at the National and Kapodistrian University of Athens (NKUA) and at the National Technical University of Athens (NTUA), Greece, where he lead the Image and Video Analysis (IVA) team. He holds PhD and Diploma degrees from NTUA and MSc degree from Imperial College, University of London, UK. He has been involved in 16 European, 3 French and 10 Greek research projects, he has co-supervised 12 Ph.D. theses and 16 Diploma theses, and he has published 3 theses, 3 edited volumes, 28 articles in journals, 112 in conferences and workshops, 8 book chapters and 21 technical reports in the above fields. He has contributed to the organization of 23 conferences and workshops, and is a reviewer in 15 scientific journals and 15 conferences. He is a member of IEEE, ACM, EURASIP, and the Technical Chamber of Greece.

# Research interests

Visual feature detection, representation of visual appearance and geometry, image matching and registration, image indexing and retrieval, clustering, nearest neighbor search, visual representation learning, unsupervised and semi-supervised learning, object detection and recognition, scene classification, image/video segmentation and tracking, and video summarization.

1996-2001

2019-
2001-
1998-
1996-
1994-

# Employment history

2016-
Research Scientist
Inria Rennes-Bretagne Atlantique
Campus de Beaulieu, 35042 Rennes Cedex, France
2015
Research Scientist
Laboratory of Algebraic and Geometric Algorithms
National and Kapodistrian University of Athens
Panepistimioupolis, 157 84 Ilissia, Greece
2014-
Odd Concepts Inc.
147-2 Suite 1201 Michelan 147 Bldg, Gangnam-gu, Seoul, Korea
2001-2014
Senior Researcher
Image, Video and Multimedia Systems Laboratory
National Technical University of Athens
9 Iroon Polytechniou Str., 157 73 Zographou, Greece
1999-2001
Project Manager, Research & Development
Syntax IT, Inc.
218 Mesogion Ave., 155 61 Holargos, Athens, Greece
1997-1999
Software Engineer
Infolab Multimedia Ltd
320 Mesogion Ave., 155 62 Holargos, Athens, Greece
1995-1997
Image, Video and Multimedia Systems Laboratory
National Technical University of Athens
9 Iroon Polytechniou Str., 157 73 Athens, Greece

# Teaching / training activities

2019-
Visiting Professor Computer Vision NKUA
Interdisciplinary postgraguate program
Information Technologies in Medicine and Biology (ITMB)
National and Kapodistrian University of Athens Greece
2018
Deep Learning for Computer Vision NKUA
Interdisciplinary postgraguate program
Information Technologies in Medicine and Biology (ITMB)
National and Kapodistrian University of Athens
Seminar Athens, Greece May 10-11, 2018

In this two-day seminar I have focused on parts of the Deep Learning for Vision course. In particular, visual representation, learning, convolution, differentiation, optimization, and network architectures.

2017-
Interdepartmental master
Research in Computer Science (SIF)
University of Rennes 1, University of Southern Brittany (UBS), ENS Rennes, National Institute of Applied Sciences (INSA), CentraleSupélec Rennes, France

I am the course responsible. This course studies learning visual representations for common computer vision tasks including matching, retrieval, classification, and object detection. The course discusses well-known methods from low-level description to intermediate representation, and their dependence on the end task. It then studies a data-driven approach where the entire pipeline is optimized jointly in a supervised fashion, according to a task-dependent objective. Deep learning models are studied in detail and interpreted in connection to conventional models. The focus of the course is on recent, state of the art methods and large scale applications. The course syllabus includes visual representation, local features and spatial matching, codebooks and kernels, learning, differentiation, convolution, optimization, network architectures, object detection, and retrieval.

2007-2014
School of Electrical and Computer Engineering
National Technical University of Athens Greece

I assist in lectures and conduct a weekly laboratory, using Matlab. The laboratory is graded independently and counts for 50% of the course. The syllabus includes image sampling and quantization, two-dimensional transforms, image filtering, edge detection, enhancement and restoration, image and video coding and compression, and JPEG and MPEG standards. The laboratory includes additional topics, in particular Hough transform, corner and local feature detection, descriptor extraction, image retrieval, template matching and motion analysis.

2007
Organizing Committee member SSMS 2007
Multimedia Semantics: Analysis, Annotation, Retrieval and Applications
Summer School Glasgow, UK July 15-21, 2007
2006
Organizing Committee member SSMS 2006
Multimedia Semantics: Analysis, Annotation, Retrieval and Applications
Summer School Chalkidiki, Greece September 4-8, 2006
2005-2014
School of Electrical and Computer Engineering
National Technical University of Athens Greece

I assist in lectures and prepare exercise material. The course syllabus includes signal and system properties, convolution, correlation, sampling, quantization, Fourier series, discrete and continuous time Fourier transforms, Laplace and Z transforms, time and frequency analysis of linear, time-invariant systems, stability, and discrete Fourier transform.

2003-2005
Seminar Organizer MultiMine
Multimedia Knowledge Discovery and Management
Research and Technology Training Network Athens, Greece
1996-1999
1996-1998
1998-2000
Lecturer Pascal programming language
Department of Computer Science
Hellanion University of Portsmouth Athens, Greece
1998-2000
Lecturer C programming language
Department of Computer Science
Hellanion University of Portsmouth Athens, Greece

# Invited talks

Mar 2019
Inria Rennes-Bretagne Atlantique France

In this talk we discuss graph-based methods for image retrieval on manifolds, unsupervised representation learning and semi-supervised learning. We begin with a classic method for ranking data on manifolds that is adapted to descriptors of overlapping image regions for image retrieval, implemented by a sparse linear system solver. We show that this is equivalent to linear graph filtering, smoothing in particular, of a sparse signal in the frequency domain. We then apply this methodology for unsupervised network fine-tuning for retrieval, where positive and negative examples are found by disagreements between Euclidean and manifold similarities. Finally, we revisit classic graph-based transductive methods for semi-supervised learning and introduce an inductive framework, using label propagation to train a deep neural network.

Mar 2018
Safran Tech Paris, France

In this talk we discuss methods for visual representation learning from raw input data targeting image retrieval, that is, ranking according to visual similarity. The focus is on methods that do not require human annotation for supervision. We begin with background work on feature pooling from convolutional neural network activations, metric learning and algorithmic supervision. We then discuss a number of recent methods for efficient ranking on manifolds and their use for representation learning. These methods are based on a nearest neighbor graph of a given dataset and can be cast as linear graph filtering. Efficient solutions are found in the time or frequency domain, using simple ideas from numerical linear algebra. When applied to representation learning, these methods allow for improving the representation by just observing data.

May 2017
Facebook Research Menlo Park, CA, US

Visual instance recognition has undergone a spectacular improvement by fine-tuning convolutional networks on properly designed image matching tasks. However, retrieving small objects is a common failure case that requires representing an image with several regions rather than a global descriptor. In this work, we discuss a principled query expansion mechanism on descriptors of overlapping image regions, based on a nearest neighbor graph constructed offline. We introduce a new way of handling unseen queries online, without adjusting the precomputed data. We rank images through a sparse linear system solver, yielding practical query times well below one second.

While this result is encouraging, it also shows that the learned representations still lie on manifolds in a high dimensional space, and that exploring the manifolds online remains expensive. We therefore introduce an explicit embedding, reducing manifold search to Euclidean search followed by dot product similarity search. We show this is equivalent to linear graph filtering of a sparse signal in the frequency domain, and we introduce a scalable offline computation of an approximate Fourier basis of the graph. This reproduces or improves the results of online query expansion, at query times comparable to standard similarity search.

Mar 2016
In 38th Pattern Recognition and Computer Vision Colloquium
Center for Machine Perception (CMP)
Czech Technical University (CVUT) Prague, Czech Republic

In this talk we discuss the role in approximate nearest neighbor search in large scale clustering, and applications in vision. We begin with a number of binary codes and product quantization extensions, highlighting the relation to nonlinear dimensionality reduction. We touch upon the deep connection between the two problems, involving distance maps in arbitrary dimensions. We then revisit recent advances in approximate k-means variants, and present a new one (IQ-means) that borrows their best ingredients. Combined with powerful deep learned representations, this method achieves clustering of a 100 million image collection on a single machine in less than one hour, while dynamically determining the number of clusters.

Code for IQ-means is available on github.

Jan 2016
Institute of Computer Science (ICS)
Foundation for Research and Technology-Hellas (FORTH) Heraklion, Greece

In this talk we present a panoramic view of the geometry underpinning a number of vision problems, ranging from early vision to unsupervised mining in large image collections and beyond. Interplaying between continuous and discrete representations, geometry appears in different forms of duality, embeddings, and manifolds. We begin with planar shape decomposition as studied in psychophysics to model either occlusion or parts of recognition. Focusing on distance maps and the medial axis representation, we then generalize to natural images towards perceptual edge grouping and equivariant local feature detection.

Adopting sets of local features and descriptors as an image representation, we then shift to visual instance search and recognition. We discuss a form of flexible spatial matching as mode seeking in the transformation space, a number of embeddings and match kernels in the descriptor space, and feature selection or aggregation in both. Acknowledging that the problem often boils down to nearest neighbor search in high-dimensional spaces, we consider a number of binary codes and product quantization extensions, highlighting the relation to nonlinear dimensionality reduction. Finally, we touch upon the deep connection between nearest neighbor search and clustering. In doing so, we revisit distance maps and medial representations, now in arbitrary dimensions.

Jun 2015
General Robotics, Automation, Sensing & Perception (GRASP) Laboratory
University of Pennsylvania US

In this talk we present a panoramic view of the geometry underpinning a number of vision problems, ranging from early vision to unsupervised mining in large image collections and beyond. Interplaying between continuous and discrete representations, geometry appears in different forms of duality, embeddings, and manifolds. We begin with planar shape decomposition as studied in psychophysics to model either occlusion or parts of recognition. Focusing on distance maps and the medial axis representation, we then generalize to natural images towards perceptual edge grouping and equivariant local feature detection.

Adopting sets of local features and descriptors as an image representation, we then shift to visual instance search and recognition. We discuss a form of flexible spatial matching as mode seeking in the transformation space, a number of embeddings and match kernels in the descriptor space, and feature selection or aggregation in both. Acknowledging that the problem often boils down to nearest neighbor search in high-dimensional spaces, we consider a number of binary codes and product quantization extensions, highlighting the relation to nonlinear dimensionality reduction. Finally, we touch upon the deep connection between nearest neighbor search and clustering. In doing so, we revisit distance maps and medial representations, now in arbitrary dimensions.

Mar 2015
Laboratory of Algebraic and Geometric Algorithms (ΕρΓΑ)
National and Kapodistrian University of Athens (NKUA) Greece

Hashing is a popular solution to approximate nearest neighbor search, and appears in two variants: indexing data items in hash tables, or representing items by short binary codes and using these compact representations to approximate distances. We focus on the second approach, and more specifically on methods that learn codes from the data distribution.

We then present methods based on vector quantization, which are a natural generalization. In particular, we discuss exhaustive and non-exhaustive variants of product quantization including recent optimizations, as well as additive quantization. Finally, we explore the opposite direction, that of using nearest neighbor search to speed up vector quantization itself.

Oct 2014
In $\mu$-Workshop on Computer Vision
TEXMEX Research Team
Inria Rennes-Bretagne Atlantique France

The first part of this talk considers a family of metrics to compare images based on their local descriptors. It encompasses the VLAD descriptor and matching techniques such as Hamming embedding. Making the bridge between these approaches yields a match kernel that takes the best of existing techniques by combining an aggregation procedure with a selective match kernel.

Since image search using either local or global descriptors boils down to approximate nearest neighbor search, the second part of this talk considers this problem, focusing on vector quantization methods. A recent method is presented whereby residuals over a coarse quantizer are used to locally optimize an individual product quantizer per cell. Non-exhaustive search strategies are discussed, including an inverted multi-index.

Mar 2011
Laboratoire Bordelais de Recherche en Informatique (LaBRI)
University of Bordeaux France

Methods based on local features have been very successful in visual search, especially when the objective is to identify near-identical objects or scenes under occlusion and varying viewpoint or lighting conditions. After a brief introduction to such methods, including bag-of-words models, sub-linear indexing and spatial matching, this talk focuses on recent research results related to local feature detection and the role of geometry, as well as a number of applications.

In particular, we present methods based on image gradient and distance maps that are able to detect blob-like regions of arbitrary scale and shape, and their application to image matching. We then investigate the potential of embedding the spatial matching process within the index, so that it becomes sub-linear as well. We also report on an accelerated spatial matching method for re-ranking, that allows flexible matching of multiple surfaces.

We then move to the more difficult problem of organizing large photo collections, and examine the use of sub-linear indexing in a mining process. Photos are automatically grouped wherever they depict the same scene; this structure is then exploited to increase the recall of the retrieval process. Working on community collections of geo-tagged photos depicting urban scenery, this approach is applied to automatic location and landmark recognition from a single photo. We also present our online application, VIRaL. Finally, we present our C++ template library ivl, that is used as infrastructure in our implementations.

Sep 2006
Semantic multimedia content analysis
In Workshop on Knowledge in Multimedia Content
IST Event 2006 Helsinki, Finland

# Ph.D. students / co-supervision

2018-
Yann Lifchitz
Few-shot learning for object recognition in aerial images
2017-
Hanwei Zhang
2016-
Invariance and supervision in visual learning
2013-
Nikos Papanelopoulos
Visual sketch retrieval
2009-2014
Yannis Kalantidis
now Research Scientist at Facebook Research, US

New applications that exploit the huge data volume in community photo collections are emerging every day and visual image search is therefore becoming increasingly important. In this thesis we propose clustering- and nearest neighbor-based improvements for visual image search. Clustering is either performed on feature space or on image space, i.e. on high-dimensional vector spaces or metric spaces, respectively.

We first introduce a clustering method that combines the flexibility of Gaussian mixtures with the scaling properties needed to construct visual vocabularies for image retrieval. It is a variant of expectation-maximization that can converge rapidly while dynamically estimating the number of components. We employ approximate nearest neighbor search to speed-up the E-step and exploit its iterative nature to make search incremental, boosting both speed and precision. We achieve superior performance in large scale retrieval, being as fast as the best known approximate k-means algorithm.

We then present our locally optimized product quantization scheme, an approximate nearest neighbor search method that locally optimizes product quantizers per cell, after clustering the data in the original space. When combined with a multi-index, its performance is unprecedented and sets the new state-of-the-art in a billion scale dataset. At the same time, our approach enjoys query times in the order of a few milliseconds, an it becomes comparable in terms of speed even to hashing approaches.

We next focus on large community photo collections. Most applications for such collections focus on popular subsets, e.g. images containing landmarks or associated to Wikipedia articles. In this thesis we are concerned with the problem of accurately finding the location where a photo is taken without needing any metadata, that is, solely by its visual content. We also recognize landmarks where applicable, automatically linking them to Wikipedia. We show that the time is right for automating the geo-tagging process, and we show how this can work at large scale. In doing so, we do exploit redundancy of content in popular locations--but unlike most existing solutions, we do not restrict to landmarks. In other words, we can compactly represent the visual content of all thousands of images depicting e.g. the Parthenon and still retrieve any single, isolated, non-landmark image like a house or a graffiti on a wall.

Starting from an existing, geo-tagged dataset, we cluster images into sets of different views of the the same scene. This is a very efficient, scalable, and fully automated mining process. We then align all views in a set to one reference image and construct a 2D scene map. Our indexing scheme operates directly on scene maps. We evaluate our solution on a challenging one million urban image dataset and provide public access to our service through our online application, VIRaL.

The thesis concludes with two chapters. The first is a summary of other approaches for visual search and applications, like geometry indexing, logo detection and clothing recognition, while the second presents conclusions and possible future directions.

2008-2016
Christos Varytimidis
now Senior Machine Learning Engineer at Workable, Greece

Low-level image analysis offers an intermediate image representation that is used by high-level computer vision algorithms (e.g. object detection and recognition, image and video retrieval, image matching). Local features extracted as regions of interest, or spatio-temporal interest points extracted from videos, combined with local descriptors, as well as global descriptors, offer a compact representation of visual information. Despite the fact that many local feature detectors have been proposed recently, this field of research is still open to new methods, as new and more complex application fields are introduced. Recently, the interest of the computer vision community has focused on deep neural networks, based on results in image classification tasks.

We propose an new local feature detector, based on geometric constructions. In particular, we propose using α-shapes to describe the shape of a set of points sampled on an image. Given the point set, α-shapes describe image objects in different scales and with different level of detail. For image sampling, we propose two different approaches: sampling on image edges and sampling using error diffusion. For sampling image edges, we propose a method that exploits the local affine shape in order to adapt sampling density, as well as a baseline method that uses fixed density sampling. We also propose sampling using error diffusion on two different functions of image intensity. The first one is based on first-order derivatives of image intensity (gradient strength), while the second one is based on second-order derivatives (Hessian response).

We use different triangulations of the samples and different α-shapes, and propose the anisotropically weighted α-shapes that exploit the local shape of each simplex of the triangulation. For selecting regions of interest, we propose different importance measures for the connected components of α-shapes. We qualitatively and quantitatively evaluate the proposed local feature extraction algorithm, under all proposed variations for each algorithm step. Our detector extracts a relatively small number of features from image regions that correspond to highly repeatable object parts. Its performance exceeds the state-of-the-art in most cases.

We also propose an efficient method for describing video clips, using deep neural networks. We segment videos in shots, using a novel method that exploits a global "objectness" measure. For describing video frames, we exploit neural networks feature maps, and then aggregate the responses to create a single descriptor for the video shot. We evaluate the proposed method on a surgical video retrieval experiment, where other methods based on local features are outperformed.

2008-2013
Giorgos Tolias
now Postdoc Researcher at Center for Machine Perception, Czech Republic

A wide range of properties and assumptions determine the most appropriate spatial matching model for an application, e.g. recognition, detection, registration, or large scale image retrieval. Most notably, these include discriminative power, geometric invariance, rigidity constraints, mapping constraints, assumptions made on the underlying features or descriptors and, of course, computational complexity.

We present a new approach to image indexing and retrieval, which integrates appearance with global image geometry in the indexing process, while enjoying robustness against viewpoint change, photometric variations, occlusion, and background clutter. We exploit shape parameters of local features to estimate image alignment via a single correspondence. Then, for each feature, we construct a sparse spatial map of all remaining features, encoding their normalized position and appearance, typically vector quantized to visual word. An image is represented by a collection of such feature maps and RANSAC-like matching is reduced to a number of set intersections. We use min-wise independent permutations and derive a similarity measure for feature map collections. In addition to random selection, we have further exploited multiple view matching for feature selection. This allows us to scale geometry indexing up to 1M images. We then exploit sparseness to build an inverted file whereby the retrieval process is sub-linear in the total number of images, ideally linear in the number of relevant ones.

We further present a very simple model inspired by Hough voting in the transformation space, where votes arise from single feature correspondences. A relaxed matching process allows for multiple matching surfaces or non-rigid objects under one-to-one mapping, yet is linear in the number of correspondences. We apply it to geometry re-ranking in a search engine, yielding superior performance with the same space requirements but a dramatic speed-up compared to the state of the art.

We further extend and use our relaxed spatial matching for self-matching and symmetry detection. We assume that features participating in symmetric and repeating structures have higher probability to be matched between different views of the same object. Information from geometric self-matching and matching of the image with its mirrored counterpart is used for feature selection of single images.

In contrast to the previous methods that we discussed or proposed, which all use only visual word information to perform feature matching, we further exploit the Hamming (HE) Embedding technique, which further use descriptor information. HE employs each feature with visual word and a binary signature which allows more precise feature matching. We develop a novel query expansion strategy which is aligned with the HE representation. We achieve to improve performance even without geometry matching, in contrast to previous query expansion methods, along with low query times. We finally show that combining our scheme with geometry matching can further boost performance and outperform state of the art methods.

2005-2009
Evaggelos Spyrou
now Assistant Professor at University of Thessaly, Greece

The growth of production and demand for digital audiovisual content during the last few decades has been overwhelming. To fulfill the needs of its users, this multimedia content should be annotated, commented and classified into appropriate semantic classes, in order to facilitate search and access to it. This thesis deals with the analysis of multimedia content and faces a few of the most important research problems in the field of multimedia analysis. More specifically, it faces problems such as image classification, image region classification and detection of concepts in images. To achieve this, certain techniques that exploit directly and indirectly the knowledge of a domain are proposed and evaluated. This knowledge is encoded either in the form of appropriate ontologies, or by modeling the context of the images and their regions, or by applying machine learning techniques. It emphasizes on the use of the bag-of-words model in order to describe the visual features of images. Finally, the techniques applied in high-level concept detection in images are extended in order to be applied to the problems of image retrieval and video summarization.

2004-2009
now Machine Learning Engineer at Tadaweb, Luxemburg

The main research area of this thesis, in a broad sense, is the integration of knowledge technologies into the analysis and description of multimedia. Knowledge technologies can aid computer vision tasks towards the improvement of the understanding of visual content, by exploiting a priori knowledge in algorithms of semantic image and video analysis. More specifically, we examine the problem of image and video segmentation and we propose novel techniques for the detection, extraction, recognition and tracking of objects, based on semantic and visual criteria. We propose a semantic segmentation approach, which enhances region growing algorithms with semantic characteristics, in order to deal with problems that raise from the shortcoming of describing semantic entities by visual characteristics exclusively. Moreover, we propose a structured knowledge framework, which we call visual logics, based on description logics and their fuzzy extensions, to link visual data with concepts that form the vocabulary of a domain. We use a set of axioms and a reasoning engine to infer possible semantic interpretation of parts or the whole of an image.

2003-2008
Konstantinos Rapantzikos
now VP of Data Science at Workable, Greece

Although human vision appears to be easy and unconscious, there exist complex neural mechanisms in primary visual cortex that form the pre-attentive component of the Human Visual System (HVS) and lead to visual awareness. Considerable research has been carried out into the attention mechanisms of the HVS and computational models have been developed and employed to common computer vision problems. Most of the models simulate the bottom-up mechanism of the HVS and their major goal is to filter out redundant visual information and detect/enhance the most salient parts of the input. The Human Visual System (HVS) has the ability to fixate quickly on the most informative (salient) regions of a scene and therefore reduce the inherent visual uncertainty. Computational visual attention (VA) schemes have been proposed to account for this important characteristic of the HVS. The dissertation studies and expands the field of computational visual attention methods, proposes novel models both for spatial (images) and spatio-temporal (video sequences) analysis and evaluates both qualitatively and quantitatively in a variety of relevant applications.

2003-2008
Phivos Mylonas
now Associate Professor at Ionian University, Greece

The main research objective of this Thesis is to tackle issues related to multimedia content processing, search and retrieval, under the prism of context, as the latter is expressed in the fields of knowledge adaptation and information access. More specifically, the main research motivation was caused by two major research fields: (i) multimedia content personalization and (ii) multimedia content analysis based on visual context. It tackles issues such as data mining, thematic categorization of multimedia documents, multimedia personalization, retrieval and ranking of personalized multimedia documents, knowledge-assisted analysis optimization through visual context exploitation, mid-level visual analysis and context utilization, contextual image classification problems, etc. Towards this direction, it presents research results and indicative applications, in order to facilitate the proposed interpretation.

2001-2005
Manolis Wallace
now Assistant Professor at University of Peloponnese, Greece

Uncertainty has gradually attained acceptance and a very distinct role in scientific thought as well as in the scientific view of the world. As far as intelligent knowledge based systems are concerned, uncertainty is present at all levels of their operation and its role is determinant of their effectiveness. In this thesis we propose a series of solutions to uncertainty related problems. In their turn, these solutions provide for further thought and progress in a series of directions.

In the first part of the thesis, which is also the lengthiest, the emphasis is on the semantics. In this framework, the important problems to consider are those of modeling real world concepts thus constructing a formal knowledge base and of exploiting the information contained in this knowledge base in practical applications, given its size. In this direction, chapter 2 proposes the utilization of fuzzy relations for the representation of knowledge and explains how this knowledge can be used in order to automatically extract the context. Chapters 3 and 4 focus on the size of this knowledge and provide computational models for its efficient handling. Chapters 5 and 6 deal with the intelligent utilization of such knowledge in the framework if information retrieval.

In the second part of the thesis we move on to a level between concepts and numeric data. Thus, chapter 7 explains how we can use high level linguistic information in order to handle uncertain low level numerical data. Focus is both on the uncertainty within the low level data and on the flexibility required in order for the high level information to provide for an adequate description of the real world.

In the third and last part of the thesis we work solely with numerical data. Chapters 8 and 9 deal with the automated analysis of data for the generation of neural models that are able to map the structure of the data, while chapter 10 moves on to the processing of these models in order to automatically extract higher level information from the available numerical data.

Chapter 11 summarizes conclusions drawn from this thesis and refers to directions of possible further work that come out of this work.

# Diploma Thesis/M.Sc. students / co-supervision

2015-2016
now PhD Student at ÉTS Montréal, Canada

Our goal in this work has been to investigate new methods for object detection combining already existing tools and algorithms, Convolutional Neural Networks (CNN) and the Hough transform. CNNs are a standard tool in deep learning for image classification, and are increasingly used for object detection as well. The Hough transform is a category of algorithms that uses votes to predict where potential objects could be. We start with some background to help to understand this report, and discuss related work. Then, we present our work, step by step. Starting with a standard neural network architecture, we gradually add new functions, layers that would get us closer to our goal. We first retrain parts of the network to fit our new training data. Then, we change the network so it would consider subparts of images and not images as a whole. Next, we add layers to limit false positive and duplicates during detection. Finally, we create a end-to-end method to train the network. Since this is still ongoing work, we discuss methods that can be tested in the future.

2014-2015
Giorgos Mitsis
now PhD Student at National Technical University of Athens, Greece

Object proposals is a relatively new problem which appeared due to the complexity of modern object detectors and their high execution time. The purpose of object proposal algorithms is the high speed class-agnostic detection of all objects in the image. The proposals are then passed to the object detectors so that they avoid the exhaustive search of the image using the sliding window approach. This way, the time needed to detect objects is drastically reduced which enables them to use more complex and effective algorithms. Modern object detectors use object proposals.

In our thesis we present most modern methods for the extraction of object proposals and we propose a new method, Segment Boxes. This method uses segmentation of the image and by using the resulting segments we score windows inside the image based on the possibility that they contain objects. We try to encapsulate good ideas of other methods as well as some of our own to achieve best results, so we end up with several variants of our method.

We compare those different approaches and the best ones are compared with the state-of-the-art methods, using the appropriate metrics, on images from datasets PASCAL VOC07 and ImageNet2013. We then use our proposals with a modern object detector which uses deep learning and convolutional neural networks, Fast R-CNN, and we compare again our results with those of other methods, this time on the problem of object detection. Our results are competitive with those of the state-of-the-art methods, and in some cases they even exceed them, while achieving low execution time (one of our approaches runs on 0.3 seconds per image). Our goal is to examine the potential of segmentation on the problem of object proposals.

2014-2015
Vasileios Chatzipanos
now Software Engineer at Crunchr, The Netherlands

This thesis addresses the problem of content based large scale image retrieval (CBIR). We study the algorithms and methods that are being used to produce compact image vector representations and primarily the VLAD image representation. We showcase the main algorithm used to produce the VLAD vector and known methods to improve it. Finally, we examine novel image representation vectors based on VLAD and a normalization scheme that can be used for further improvement.

2013-2014

We propose a new method for the vector representation of an image which is based on aggregation of image descriptors while exploiting their spatial information. Our method, Spatial Pyramid with Vectors of Locally Aggregated Descriptors (SP-VLAD), was designed for the problem of image retrieval in order to achieve high accuracy and efficiency, with low memory requirements. SP-VLAD is based on the ideas of two other methods, Spatial Pyramid Matching (SPM) and Vector of Locally Aggregated Descriptors (VLAD). Specifically, it combines the idea of spatial pyramid with the VLAD descriptor vectors. The SP-VLAD method achieves high accuracy and significantly outperforms SPM and VLAD methods. The promising results led us to apply our method to the problem of image classification as well; exceeding the classification rate of SPM and VLAD methods on all databases which were used. The excellent results of our method were achieved with low memory usage after the dimension reduction of the descriptor vectors with the PCA method which resulted to vectors of 64 or 128 dimensions. We also created the dataset Flowers 15 for the needs of our research in order to be able to test the SP-VLAD method upon images of flowering plants.

2013-2014
Georgios Perakis
now Key Account Manager at inos Automation, Germany

License plate recognition is an integral part of intelligent traffic control systems, with ever more applications. The goal of this diploma thesis is the creation of a prototype system for license plate detection and recognition with the use of local descriptors of the image. The most important methods so far for both parts of the problem are described. Problems that arose during the implementation of the method are presented, as well as the chosen solutions against them. Furthermore, the need to create a global data set for license plate recognition is noted, since without it there cannot be a reliable comparison among the suggested solutions and selection of the best. For the data set of Greek license plates that was available, the detection rate was 94% and the recognition rate 32.3%, with the use of a free platform.

2011-2012
Christos Arvanitis
now DevOps Engineer at Impact Tech, Cyprus

Augmented Reality is a growing area in the recent years, in the field of Virtual Reality. A system of Augmented Reality supplements the real world with virtual graphical models, creating the illusion of coexistence between real and virtual world. This process requires accurate camera pose estimation based on computer vision methods. In the framework of this thesis, we aim to study the camera pose estimation methods based on the detection of local features in successive frames, without prior knowledge of the environment. Furthermore we present an implementation of an Augmented Reality application. Epipolar constraints are used for the camera pose estimation.

2011-2012
Agni Delvinioti
now Data Scientist at Ferrovie dello Stato Italiane, Italy

In the framework of this Diploma thesis we introduce a new image categorization method, which integrates spatial matching and indexing in the classification process. Spatial matching is based on Hough pyramid matching (HPM); indexing is based on an inverted file structure as in image retrieval; and classification is carried out with a multi-class support vector machine (SVM) classifier. We use HPM as an image similarity measure and we show that under reasonable assumptions it is a Mercer kernel. We do so by explicitly expressing it as an inner product in a high dimensional space where images lie given an appropriate quantized representation of their local features and descriptors. We then use this kernel for SVM training instead of a linear kernel, which is a typical choice under the bag of words (BoW) model. It is the first time that a kernel function takes spatial configuration into account while being invariant to translation, scale and rotation. In most cases, artificial perturbations are the only way to achieve geometric invariance, with an exponential increase of training time.

We train one binary SVM classifier for each category following an one-versus-the -rest strategy and then combine individual classifiers into one multi-class classifier. Comparing to nearest-neighbor classifier using e.g. image retrieval methods, we exploit the sparse representation of SVMs: at classification time, the query image is matched via HPM against the chosen support vectors only. However, matching need not be exhaustive. Support vectors are indexed into an inverted file, and HPM may be applied only to a small subset that is top-ranking according to any scalar similarity measure, e.g. based on BoW. The method therefore easily applies to large scale classification, while training for unseen classes does not require re-training for existing ones.

Due to the nature of local features and their use in invariant matching, the method is most appropriate for specific object recognition. We apply it to landmark recognition, conducting experiments on our own dataset, constructed from the World cities dataset via a semi-automatic process that combines visual and geographical clustering. We compare to a baseline classifier using a BoW representation and achieve more than a twofold increase in accuracy on experiments of up to 68 landmarks.

2011-2012

In the framework of this thesis, we present new image segmentation techniques based on a weighted medial axis decomposition procedure. Starting from image gradient or gray-scale contour map, we first compute a weighted distance map and its weighted medial axis by a linear-time process. Now, applying the same distance propagation from the medial axis backwards, we dually obtain an initial image partition and a graph representing image structure. This is equivalent to applying watershed transform on the weighted distance map, hence is both topological and contrast-weighted. However, it is more efficient because we first decompose the medial axis and then use our linear-time process to propagate on the remaining image surface. Using a disjoint-set data structure, we then merge adjacent regions according to different criteria.

Several criteria were examined and tested. First, we use medial axis saddle point height to express similarity between adjacent regions and merge correspondingly. A second distinct direction we follow is to merge adjacent regions according to how fragmented they are. Last but not least, we use ultrametric contour map representation to implement hierarchical segmentation. As inter-region ultrametric dissimilarities, we use mean boundary strength on the common boundary between adjacent regions and inter-region fragmentation. All the above mentioned techniques are evaluated using the Berkeley Segmentation Dataset and compared with some state of the art algorithms. Without learning, we achieve performance near the state of the art with very practical running times.

2011-2012
Charalampos Moustafelos
now Consultant at d-fine, Switzerland

In this diploma thesis we investigate large scale image retrieval. We describe the stages of image retrieval, giving emphasis in the visual vocabulary construction. Moreover, we mention the problems that arise due to quantization of the descriptors and introduce several techniques that appease them. More specifically, one of these techniques introduces the use of synonym visual words. In order to discover the synonym visual word we should construct sets of matching image patches, called feature tracks. For this purpose, we develop a novel technique for constructing feature tracks. Given a collection of geo-tagged images, we cluster these images according to (a) their locations and (b) their visual features. Hence we obtain view clusters: clusters with images that depict the same scene. Matching features are discovered, through geometric verification between the images in the cluster and the image reference (center of the cluster). Given the feature tracks, we can find matching visual words. Finally, we test and evaluate the performance of this technique implementing retrieval experiments in Oxford building dataset.

2008-2009
George Koumoulos

The large amount of optical information and the easy access to available data through the Internet has led to the emerging need for efficient description of image content. Many techniques for fast image retrieval have been proposed in literature, but in the recent years the use local features has come to maturity because of their efficiency.

In this thesis, the most known methods of local detectors and descriptors are firstly studied. Next, an integrated system of local invariant features is implemented, with the use of already tested techniques, and different methods are combined in order to compose a powerful tool for image analysis. The experimental evaluation follows, which is done over a standard set (benchmark) of images under various transformations (photometric and geometric). All previously analyzed methods are compared via objective criteria: repeatability score, accuracy of detectors (localization), matching score and performance of descriptors.

The efficiency of local features is testified in the image retrieval system with the use of large image databases. The experimental procedure provides a quantitative comparison of the aforementioned techniques. The main image retrieval mechanism is related to text retrieval methods: a visual vocabulary is created and a model vector is constructed for each image, which represents its semantic content. The image retrieval procedure is done through vector similarity measures. Different visual vocabularies (generated by various local feature methods) are compared with respect to image retrieval evaluation criteria, like precision, recall and mean Average Precision (mAP).

2007-2008
Yannis Kalantidis
now Research Scientist at Facebook Research, US

Over the recent years, the amount of digital images available online has increased rapidly. These huge multimedia collections contain diverse data and cover almost every aspect of life in terms of visual and semantic content. Proper indexing and analysis of such data is an essential process, in order to be able to retrieve its useful visual information. Searching through image libraries has become an everyday process, the same way as Google text-based search. In this diploma thesis, techniques for content-based image retrieval are presented and evaluated, and a web-based image search platform is created. Various techniques are applied, using either global or local features, such as the MPEG-7 and SURF descriptors, extracted locally, from points of interest or segmented regions. A bag-of-words model is used for indexing and geometric constraints are also taken into account. These techniques are evaluated over many common datasets, in order to test the universality of their use, towards a web-scale image retrieval system.

2007-2008
Christos Varytimidis
now Senior Machine Learning Engineer at Workable, Greece

Object detection in images is a filed of image analysis that is searched intensively during the past few years. In this diploma thesis we present a complete object detection method which was created by Viola and Jones in 2001. Haar-like features are used to describe images, while the classification of the candidate regions of an image is performed by a cascade of classifiers created by the AdaBoost algorithm in order to increase detection speed. By using this method, we trained several detectors for interior parts of a car, as well as its exterior, with sample images from the LabelMe dataset. We show and explain the choices that were made in every detector training. The results of the evaluation of every detector are presented in precision-recall and receiver operator characteristic (ROC) diagrams. We also present some conclusions in order to achieve the best results from this method. In this diploma thesis we created a program for semi-automatic annotation of images, which detects objects in images using the presented method.

2006-2007
Giorgos Tolias
now Postdoc Researcher at Center for Machine Perception, Czech Republic

The enlarging audiovisual multimedia content during the last few years has emerged the need of automatic feature extraction and description of this content. With the use of various descriptors, including those defined by the MPEG-7 standard, its low level information is captured. In this diploma thesis MPEG-7 visual descriptors are examined and a descriptor extraction application is developed based on the MPEG-7 eXperimentation Model. This application is evaluated in order to verify its alignment to the XM. This application is then used within a high-level detection approach. A region-based technique is applied and a visual thesaurus is constructed to formalize knowledge. Neural-network detectors are trained in order to detect high-level concepts. Moreover, the utility of the well known Latent Semantic Analysis technique is investigated. The dataset of the TRECVID benchmark has been used for testing this techniques. Finally a car exterior/interior classification problem is also tackled. Extensive experimental results are presented for each of the aforementioned problems.

2006-2007
Antonis Makrimallis

Object detection in images or sequences of images is an important field of research in the past few years. This report studies a method for the detection of people in still images, with unconstrained scene conditions, such as complex background and uncontrolled lighting. A feature vector is used, which is adopts Histograms of Oriented Gradients (HOGs) in a dense grid on the image. The classification is achieved via a linear SVM. The study and evaluation of the method are being achieved through two implementations, which are differentiated by the choice of HOG features. Moreover, the sensitivity of each implementation in basic parameters of the method is evaluated, altering the values of these parameters. From the evaluations and observations, is concluded that the use of HOGs captures ideally the characteristics of human form and gives us reliable results for the detection of people in still images.

2005-2006
Vicky Giannekou
now Software Engineer at Intralot, Greece

Recent years have seen a rapid increase of the size of digital image collections. Recent research has focused on the efficient processing, searching and retrieval of similar images from a database. In this diploma thesis we study the representation and retrieval of similar images based on the curvature scale space (CSS) method in the presence of affine transformations. More specifically, we examine the robustness of the method under affine transformations and compare its performance with the performance of other alternative methods. In order to achieve invariance, we used curve normalization based on affine length parametrization and evaluated the effectiveness of this application.

At experimental level, a database of curves (contours) of different categories of shapes has been constructed. Initially, by applying random affine transforms, we created a number of affine-transformed versions of the above curves. We then used the CSS method to represent and retrieve similar curves of shapes from the database. The resulting matching cost is a measure of comparison of curve similarity and indicates the effectiveness of the method. Finally, we study the application of an alternative normalization method instead of the affine length based normalization, which appears to improve the effectiveness of the CSS method.

2004-2005
George Koumoulos

Recently, the enlarging available video data has led to the emerging need for automatic analysis, synopsis and extraction of information from videos. Every video sequence consists of a number of shots, each of them containing temporally associated frames, while contiguous shots are connected to each other with some type of transition at their boundaries. The first step for any kind of video analysis seems to be the detection of these boundaries followed by a temporal segmentation of video, while the next and more important step is related with the synopsis or the summarization of video. This is achieved by selecting certain number of characteristic frames (key-frames) from each shot, so that the content of the shot is represented in a short and also meaningful way.

In this diploma thesis, a video summarization system is being constructed, encapsulating shot change detection, feature extraction and key-frame selection. This system combines methods working directly on MPEG compressed domain and automatically locates shot changes of video. Features from each frame of the sequence are then extracted and their values produce a point in the feature space. Therefore the entire video sequence is represented by a trajectory in this multidimensional space. According to mathematical methods defining the curvature of a curve, the characteristic points of the curve are determined, in areas where locally extreme behavior is observed. The system automatically extracts the key-frames via points of local maxima and minima of curvature, irrespective of the video content.

In this work, two different ways of computing curvature are compared, as well as and two other existing methods that use characteristic points of a curve. Experiments concern in the effectiveness of these methods in terms of video key-frame selection. We attempt to improve performance of these methods by applying the computational model of Visual Attention, in order to extract features from the salient regions of the image. Finally, through the results of these experiments we aim to indicate capabilities, disadvantages and cases of improving and expanding this video summarization system.

# Research grants [European]

2020-2022
Investigator GRAPES
H2020-860843 / MSCA-ITN

GRAPES aims at considerably advancing the state of the art in Mathematics, Computer-Aided Design, and Machine Learning in order to promote game changing approaches for generating, optimizing, and learning 3D shapes, along with a multisectoral training for young researchers. Recent advances in the above domains have solved numerous tasks concerning multimedia and 2D data. However, automation of 3D geometry processing and analysis lags severely behind, despite their importance in science, technology and everyday life, and the well-understood underlying mathematical principles. The CAD industry, although well established for more than 20 years, urgently requires advanced methods and tools for addressing new challenges.

The scientific goal of GRAPES is to bridge this gap based on a multidisciplinary consortium composed of leaders in their respective fields. Top-notch research is also instrumental in forming the new generation of European scientists and engineers. Their disciplines span the spectrum from Computational Mathematics, Numerical Analysis, and Algorithm Design, up to Geometric Modeling, Shape Optimization, and Deep Learning. This allows the 15 PhD candidates to follow either a theoretical or an applied track and to gain knowledge from both research and innovation through a nexus of inter-sectoral secondments and Network-wide workshops.

Horizontally, our results lead to open-source, prototype implementations, software integrated into commercial libraries as well as open benchmark datasets. These are indispensable for dissemination and training but also to promote innovation and technology transfer. Innovation relies on the active participation of SMEs, either as a beneficiary hosting an ESR or as associate partners hosting secondments. Concrete applications include simulation and fabrication, hydrodynamics and marine design, manufacturing and 3D printing, retrieval and mining, reconstruction and visualization, urban planning and autonomous driving.

2008-2011
Principal Investigator and Coordinator WeKnowIt
FP7-215453 / IP

The main objective of WeKnowIt is to develop novel techniques for exploiting multiple layers of intelligence from user-contributed content, which together constitute Collective Intelligence, a form of intelligence that emerges from the collaboration and competition among many individuals, and that seemingly has a mind of its own. To this end, input from various sources is analyzed and combined: from digital content items and contextual information (Media Intelligence), massive user feedback (Mass Intelligence), and users social interaction (Social Intelligence) so as to benefit end-users (Personal Intelligence) and organizations (Organizational Intelligence).

The automatic generation of Collective Intelligence constitutes a departure from traditional methods for information sharing, since for example, semantic analysis has to fuse information from both the content itself and the social context, while at the same time the social dynamics have to be taken into account. Such intelligence provides added-value to the available content and renders existing procedures and workflows more efficient.

2008-2010
Principal Investigator JUMAS
FP7-214306 / IP

Public administrations represent the largest information bound professional communities: among them the judicial sector is one of the largest, where the needs of cooperation are critical, creating an exceedingly large improvement potential through adoption of novel content management techniques and development of new solutions for its specific needs of retrieval and semantic analysis. This potential is even larger considering the growing transnational cooperation also among several national law systems, highlighting the need to adapt the technological profiles of new member states.

In this context, JUMAS envisages an advanced knowledge management system able to extract semantics from multimedia data. JUMAS is tailored at managing situations where multiple cameras and audio sources are used to record assemblies and reconstructing debate sequences for future consultation.

2006-2009
Principal Investigator Imagination
FP6-034626 / STREP

The main objective of IMAGINATION is to bring digital cultural and scientific resources closer to their users, by making user interaction image-based and context-aware. Our ultimate aim is to enable users to navigate through digital cultural and scientific resources through its images. IMAGINATION will provide a novel image-based access method to digital cultural and scientific resources. It will reduce complexity by the provision of intuitive navigation method.

IMAGINATION will facilitate an interactive and creative experience providing an intuitive navigation through images and parts of images. To do so IMAGINATION will combine, apply and improve existing techniques to provide a new way of navigation through cultural heritage multimedia archives. It will exploit the context of resources stored in its knowledge space by combining text-mining, image segmentation and image recognition algorithms. This combination will cause a synergy effect and will result in semi-automatically generated, high-level semantic metadata.

The focus of IMAGINATION is on indexing, retrieving and exploring non-textual complex objects and will apply knowledge technologies and visualization techniques for improved navigation and access to multimedia collections. Comprehensive tool support (including an ontology editor and a semi-automated image annotation tool) will be provided, together with an easy-to-use web-based interface which visualizes the contextualized content stored in the IMAGINATION knowledge space.

A major outcome of the project will be the new and intuitive approach of navigation trough images and a set of technologies and tools to support the annotation of images by manual, semi-automatic and automatic techniques.

2006-2009
Principal Investigator BOEMIE
FP6-027538 / STREP

BOEMIE will pave the way towards automation of the process of knowledge acquisition from multimedia content, by introducing the notion of evolving multimedia ontologies, which will be used for the extraction of information from multimedia content in networked sources, both public and proprietary. BOEMIE advocates a synergistic approach that combines multimedia extraction and ontology evolution in a bootstrapping process involving, on the one hand, the continuous extraction of semantic information from multimedia content in order to populate and enrich the ontologies and, on the other hand, the deployment of these ontologies to enhance the robustness of the extraction system. The ambitious scope of the BOEMIE project and the proven specialized competence of the carefully composed project consortium ensure that the project will achieve the significant advancement of the state of the art needed to successfully merge the component technologies.

2006-2009
Principal Investigator MESH
FP6-027685 / IP

Multimedia Semantic Syndication for Enhanced News Services (MESH) will apply multimedia analysis and reasoning tools, network agents and content management techniques to extract, compare and combine meaning from multiple multimedia sources, and produce advanced personalized multimedia summaries, deeply linked among them and to the original sources to provide end users with an easy-to-use "multimedia mesh" concept, with enhanced navigation aids. A step further will empower users with the means to reuse available content by offering media enrichment and semantic mixing of both personal and network content, as well as automatic creation from semantic descriptions.

Encompassing all the system, dynamic usage management will be included to facilitate agreement between content chain players (content providers, service providers and users). In a sentence, the project will create multimedia content brokers acting on behalf of users to acquire, process, create and present multimedia information personalized (to user) and adapted (to usage environment). These functions will be fully exhibited in the application area of news, by creation of a platform that will unify news organizations through the online retrieval, editing, authoring and publishing of news items.

2006-2010
Principal Investigator X-Media
FP6-26978 / IP

X-Media addresses the issue of knowledge management in complex distributed environments. It will study,develop and implement large scale methodologies and techniques for knowledge management able to support sharing and reuse of knowledge that is distributed in different media (images, documents and data) and repositories (data bases, knowledge bases, document repositories, etc.) or that is inaccessible for current systems, which cannot capture the knowledge implicit across media. All the developed methodologies aim at seamlessly integrating with current work practices. Usability will be a major concern together with ease of customization for new applications.

Technologies will be able to support knowledge workers in an effective way, (i) hiding the complexity of the underlying search/retrieval process, (ii) resulting in a natural access to knowledge, (iii) allowing interoperability between heterogeneous information resources and (iv) including heterogeneity of data type (data, image, texts). The expected impact on organizations is to dramatically improve access to, sharing of and use of information by humans as well as by and between machines. Expected benefits are a dramatic reduction of management costs and increasing feasibility of complex knowledge management tasks.

2006-2008
Principal Investigator K-Space
FP6-027026 / NoE

K-Space is a network of leading research teams from academia and industry conducting integrative research and dissemination activities in semantic inference for automatic and semi-automatic annotation and retrieval of multimedia content. K-Space exploits the complementary expertise of project partners, enables resource optimization and fosters innovative research in the field. The aim of K-Space research is to narrow the gap between low-level content descriptions that can be computed automatically by a machine and the richness and subjectivity of semantics in high-level human interpretations of audiovisual media: The Semantic Gap. Specifically, K-Space integrative research focus on three areas:

(1) Content-based multimedia analysis: Tools and methodologies for low-level signal processing, object segmentation, audio/speech processing and text analysis, and audiovisual content structuring and description.

(2) Knowledge extraction: Building of a multimedia ontology infrastructure, knowledge acquisition from multimedia content, knowledge-assisted multimedia analysis, context based multimedia mining and intelligent exploitation of user relevance feedback.

(3) Semantic multimedia: Knowledge representation for multimedia, distributed semantic management of multimedia data, semantics-based interaction with multimedia and multimodal media analysis.

An objective of the Network is to implement an open and expandable framework for collaborative research based on a common reference system.

2004-2007
Principal Investigator MUSCLE
FP6-507752 / NoE

Due to the convergence of several strands of scientific and technological progress we are witnessing the emergence of unprecedented opportunities for the creation of a knowledge driven society. Indeed, databases are accruing large amounts of complex multimedia documents, networks allow fast and almost ubiquitous access to an abundance of resources and processors have the computational power to perform sophisticated and demanding algorithms. However, progress is hampered by the sheer amount and diversity of the available data. As a consequence, access can only be efficient if based directly on content and semantics, the extraction and indexing of which is only feasible if achieved automatically.

Given the above, we feel that there is both a need and an opportunity to systematically incorporate machine learning into an integrated approach to multimedia data mining. Indeed, enriching multimedia databases with additional layers of automatically generated semantic metadata as well as with artificial intelligence to reason about these (meta)data, is the only conceivable way that we will be able to mine for complex content, and it is at this level that MUSCLE will focus its main effort. Realizing this vision will require breakthrough progress to alleviate a number of key bottlenecks along the path from data to understanding.

2004-2007
Principal Investigator aceMedia
FP6-001765 / IP

Long term market viability of multimedia services requires significant improvements to the tools, functionality, and systems to support target users. aceMedia seeks to overcome the barriers to market success which include user difficulties in finding desired content, limitations in the tools available to manage personal and purchased content, and high costs to commercial content owners for multimedia content processing and distribution, by creation of means to generate semantic-based, context and user aware content, able to adapt itself to user preferences and environments.

aceMedia will build a system to extract and exploit meaning inherent to the content in order to automate annotation and to add functionality that makes it easier for all users to create, communicate, find, consume and re-use content. aceMedia targets knowledge discovery and embedded self-adaptability to enable content to be self organizing, self annotating, self associating; more readily searched (faster, more relevant results); and adaptable to user requirements (self reformatting).

aceMedia introduces the novel concept of the Autonomous Content Entity (ACE), which has three layers: content, its associated metadata, and an intelligence layer consisting of distributed functions that enable the content to instantiate itself according to its context (e.g. network, user terminal, user preferences). The ACE may be created by a commercial content provider, to enable personalized self-announcement and automatic content collections, or may be created in a personal content system in order to make summaries of personal content, or automatically create personal albums of linked content.

The ACE concept will be verified by two user focused application prototypes, enabled for both home network and mobile communication environments. This enables the aceMedia partners to evaluate the technical feasibility and user acceptance of the ACE concept, with a view to market exploitation after the end of the project.

2002-2004
Principal Investigator and Coordinator MIRROR
IST-2001-32504

MIRROR aims to create a collection of components and tools for a distributed knowledge management system that will support physical and social interactions. Mirror aims to establish a Europe-wide community of practice for learning and innovation in the area of natural sciences museums by developing a novel learning methodology and by implementing state-of-the-art tools, techniques and systems.

2001-2003
Principal Investigator and Technical Leader FAETHON
IST-1999-20502

The overall objective of FAETHON project is to develop an integrated information system that offers enhanced search and retrieval capabilities to users of digital audiovisual (a/v) archives. This novel system will exploit the advances in handling a/v content and related metadata, as introduced by MPEG-4 and worked out by MPEG-7, to offer advanced access services characterized by the tri-fold "semantic phrasing of the request (query)", "unified handling" and "personalized response".

From a technical point of view, the proposed system will play the role of an intermediate access server residing between the end users and multiple heterogeneous audiovisual archives organized according to new MPEG standards. Various types of interfacing modules will be designed/ implemented to support smooth communication of the intermediate server to the a/v archives. The major final product will be an integrated software system consisting of the two, semantic unification and personalization subsystems, together with two types of interfaces. Namely, those between the system and the individual a/v archives and those between the system and the end-users.

1997-2001
Researcher PHYSTA
Training and Mobility of Research

Systematic principles for integrating symbolic and subsymbolic processing will be developed in the project. Key aims are to ensure that the resulting total hybrid system retains desirable properties of both processing levels. On the one side the signal processing abilities, robustness and learning capability of neural networks should be preserved. On the other side advantage should be taken of the ability of rule-based systems to exploit high level knowledge and existing algorithms and to explain (to a user) why conclusions were reached in particular cases. The methodologies to be developed in the project will be tested in a challenging application related to human computer interaction, which is recognition of emotion based on both voice and visual cues. Low level features will be extracted from signals using neural networks and subsequent formulation of rules will provide a conceptual framework, substantial for emotion analysis.

1998-1999
Researcher MODULATES

This EU-funded project aims to establish a pilot European multimedia network and organization with the purpose of motivating encouraging school pupils to take up a career in technology and related businesses and industries, and to assist with the learning of languages within the context of learning about technological subjects in school at both primary and secondary level. The network will take the structural form of a group of European universities with skills in the development and use of multimedia materials and experience in teacher training working together with teachers and pupils in schools to provide a unified range of user configurable, multi-lingual, multimedia courses on topics in advanced technology for primary and secondary schools. Course materials will be developed in at least 6 topics with all materials available in English, German, and Greek.

The project will be of 2 years duration starting in early 1998. ISDN links will be established between the 3 participating universities and from each of the universities to its group of schools (14 schools in total). Telecommunications service providers will establish these connections and be involved as partners in the project to assist with technical developments as well as with commercial aspects for exploitation of results. The project will form the pilot study for the creation of a self-funding European educational multimedia network and organization for the teaching and learning of advanced technology which has the capability of assisting language teaching. In the 2 years following the project (2000-2002) it is planned to provide access to MODULATES materials for over 1,000 schools throughout the EU.

1996-1999
Researcher MCUBE
ESPRIT 22266 Multimedia Support Centers

The M.CUBE is an Esprit project co-financed by European Commission. It promotes the production of European multimedia and supports producers, companies, publishers and other cultural and technological bodies interested in applying multimedia technology to the cultural field. The aim of the project is to create a multimedia support network, based on four Mediterranean regions, dedicated to fostering the development of European multimedia applications in the area of culture and arts.

The main objectives of M.CUBE are to (i) improve the competitiveness of the European multimedia industry, enhancing the quality of its products, improving the business capabilities of the enterprises, and attracting funds for new developments, (ii) create a critical mass to penetrate the consumer market at international level, putting together many small developers to address the market through big distributors, and (iii) exploit the enormous European potential in terms of cultural contents. Supplying multimedia support services to cultural users is a strategic task for M.CUBE. Museums, galleries, collections, public administrations, etc. are pilot users and clients of advanced technological solutions, based on the application of multimedia to culture and arts.

The M.CUBE contribution is to: First, contribute to the development of a whole editorial line, defining technical and artistic contents, market targets and channels, certification process, methodological recommendations, policy and programmes. Second, create a network of manufacturers, through setting up local associations and the development of international exchanges. Third, develop business opportunities with the final aim of stimulating new markets. Fourth, provide a large set of services through intermediation and direct delivery, technological and manufacturing services, training, access to multimedia repositories, certification, consulting activities, information desk, etc.

1993-1996
Human Capital and Mobility

The project revolved around information extraction techniques from video sequences in order to detect and minimize data unnecessary to transmit. Our contribution was mainly in the fields of multiresolution techniques and hierarchical representation of images and in the use of ROIs (Regions of Interest) for efficient coding and compression of images.

# Research grants [French]

2018-2019
Investigator BnF
MIC/BnF

For several years, Bibliothèque nationale de France (BnF) has been pursuing its policy of enriching its Gallica digital library with specialized collections, the main feature of which is to create batches of still images of printed or handwritten text. Thus, important batches of books, manuscripts, newspapers and magazines, photographs, maps, stamps, coins, sound recordings and scores have been added to Gallica. The impossibility of manually indexing all iconographic collections by unit encourages the adoption of an image indexing solution based on the automatic or semi-automatic analysis of their visual content. The objective is to specialize an image classification tool in heritage corpora in order to add semantics to the image analysis process. The challenge is to make the richness of these collections accessible to as many people as possible (general public and professionals), through intuitive and ergonomic tools and interfaces.

In this context, the objective of the project is to investigate automatic classification of images in Gallica according to their visual content, and producing annotations that could enrich existing metadata. The experience of Linkmedia in visual indexing and deep learning makes it possible to consider an experiment on the particular contents of the BnF. The project consists of the following. (1) Identify an experimental iconographic corpus from Gallica digital collections. (2) Build a prototype visual classification engine on this corpus. (3) Evaluate the performance of the prototype by controlled evaluation using ground truth and by the Gallica Studio community.

2018-2021
Principal Investigator
CIFRE-2017-1744

Learning from few training samples is a topic that enjoys a great scientific and industrial interest. In fact, deep learning approaches developed and advanced in recent years, have been typically relying on huge amounts of data. Most recently, given the impressive performances of deep models in various large-scale tasks, the scientific community has started exploring the feasibility of these powerful techniques for other tasks with reduced amounts of available data. There are plenty of cases where access to high volumes of data is potentially difficult or expensive or where the number of available training samples is intrinsically low. For such cases, the learning strategy of these multi-million parameters architectures needs to be rethought in order to allow the networks to squeeze out the maximum amount of information from the few available samples.

This is a CIFRE PhD thesis project aiming to study architectures and learning methods most suitable for object recognition from few samples and to validate these approaches on multiple recognition tasks and use-cases in aerial imagery. In particular, use cases include (1) target objects being small in the image; (2) recognizing objects of the same class; (3) recognizing instances of the same object. The operational context of the recognition tasks requires the ability to recognize objects from a small image corpus with possibly large variations in illumination, orientation and context of the object of interest.

2018-2020
Principal Investigator MobilAI
Images & Réseaux AAP-PME-2017

The ability of our mobile devices to process visual information is currently not limited by their camera or computing power but by the network. Many mobile apps suffer from long latency due to data transmitted over the network for visual search. MobilAI aims to provide fast visual recognition on mobile devices, offering quality user experience whatever the network conditions. The idea is to transfer efficient deep learning solutions for image classification and retrieval onto embedded platforms such as smart phones. The intention is to use such solutions in B2B and B2C application contexts, for instance recognizing products and ordering online, accessing information about artifacts in exhibitions, or identifying identity documents. In all cases, visual recognition is performed on the device, with minimal or no access to the network.

# Research grants [Greek]

2011-2014
Investigator IS-Helleana
EPAN II 09SYN-72-922

IS-Helleana aims to implement modern Semantic Web technologies for the development of an integrated system for unified access, management, search and interactive presentation of the Greek audiovisual inventory. The system will enable audiovisual content providers to display their content in a single interoperable way, within the context of a generalized semantic, rich display of the Greek audiovisual inventory, both in Greece and internationally, and users to effectively search for content and participate in the online interactive services provided by audiovisual content providers. The basic tool to achieve this goal is an online platform for semantic integration, management, enrichment and visualization of the audiovisual inventory.

2006-2008
Investigator Eikonognosia
Image, Sound and Language Processing

The documentation and analysis of Byzantine Art is an important component of the overall effort to maintain cultural heritage and contributes to learning and comprehending ones history traversal path. Efficient publishing of the multi-dimensional and multifaceted information that is necessary for the complete documentation of artworks should draw on a good organization of the data.

Eikonognosia is a research project funded by the Greek General Secretariat of Research and Technology (GSRT) that aims to efficiently organize and publish detailed information about icons in the World Wide Web. Information derived from the analysis conducted in the Art Diagnosis Center of Ormylia Foundation is taken as a case study.

Eikonognosia provides the means for organizing detailed and multidimensional information about Byzantine icons in a way that is compatible to international standards (CIDOC-CRM - ISO 21127:2006) and allows for an easy retrieval of data with advanced semantic web technologies. The ultimate goal for Eikonognosia is to foster the cultural heritage community by providing an integrated framework that helps to facilitate organization, retrieval and presentation of data from the cultural heritage domain.

2006-2008
Principal Investigator OntoMedia
Semantic Multimedia Content Analysis Using Knowledge Technologies
PENED
2006-2007
Principal Investigator DELTIO
Image, Sound and Language Processing

The pbjective of the project is the development of innovative techniques for the representation, analysis and extraction of semantic information for the manipulation of multimedia content, with emphasis on their application to television news bulletins. Research and development will concentrate on techniques that enable the automatic analysis of multimedia content and the extraction of knowledge, resulting in the (semi-)automatic creation of metadata, as well as provide support for smart, semantic search services. The final output of the project will be a system for the analysis of television new bulletins, which will provide intelligent search functionalities in archives of digital television material.

2004-2007
Principal Investigator VisualAsset
EPAN-03DSBEPRO-44

The main scope of the proposed framework is the development of innovative analysis techniques, export of semantic information and management of multimedia content in general and multimedia documents in particular. The proposed system, called Visual Asset, will be constituted by a number of distinguishable subsystems that will undertake the different stages of processing, analysis, storage and access to content provided by multimedia documents (texts, images, video, audio and 3D representations) and are summarized in the following:

(1) Image and video processing subsystem aiming at automatic export of low level characteristics (color, texture, form, movement, speed, etc) and export of characteristic parts of objects (object segmentation) (e.g. segmentation of persons in a video sequence). These results will be used in the system integrating visual and textual information via the use of ontologies, but also in the effective search and retrieval system, based on visual information.

(2) Integration of visual and textual information subsystem. Object of this subsystem will be representation of knowledge with use of ontologies, analysis of multimedia content based on visual and textual and production of metadata with a common way of representation, regarding both textual and visual information.

(3) Modeling and logical analysis of documents subsystem, aiming at automatic categorization and creation of effective search and retrieval applications, providing advanced functionalities in organization and management of big document volumes. The subsystem will integrate visual and textual information results and will use the notion of context in a document in order to fulfill the tasks of automatic categorization, logical analysis (table of contents, automatic recognition of chapters, titles, notes, reports in images, video, etc) and efficient search and retrieval.

2003-2005
Principal Investigator MultiMine
EPAN

The network, consisting of six Greek academic institutions and two SMEs, will organize a series of training activities in order to introduce knowledge technologies for multimedia content in Greek research and commercial institutions.

1999-2001
Researcher PANORAMA
EPET II - EKBAN

PANORAMA aims at development of a system for efficient search and mining of audiovisual data from large distributed multimedia databases through several types of networks (Internet, intranets, etc.). The main objectives of the project are the interoperability of databases and the availability of software products and services on networks for open multimedia access. Digitalization of archives assets should provide all possible information (video, images, sound, texts, etc.) for wide public and professional use.

Main subjects for implementation are the content-based retrieval from multimedia databases as well as search and extraction of characteristic scenes from video data for insertion in synthetic environments. Definition of audiovisual objects should be adopted within the framework of MPEG-4 and MPEG-7 standards in order to focus the project on the application of emerging technologies. Intellectual property rights and copyrights, preservation and security of the information are points that focusing specific attention by using innovative methods for protection such as watermarking for video data authentication and several encryption techniques.

1997-2000
Researcher VoD
SYN-96

The objective of the project is the design and prototype implementation of a Video on Demand (VoD) and Near Video on Demand (NVoD) system based on MPEG-2 digital video/audio technology. IVML and INFOLAB are exploiting their long experience in digital video applications and related services to bring to the market the final product - hardware and software - called "InTV". This product serves hotels and hospitals, providing them with the ability to offer VOD services to their clients on a pay-per-view basis.

InTV uses state of the art technologies such as high quality MPEG-2 video and is built on mature and evolving platforms including the Oracle Video Server. Movies are stored in MPEG-2 format in the video server's hard disks, and video streams are transmitted to the subscribers' rooms using a local area network. End-users are able to watch movies on a terminal (monitor or TV set) connected to a set-top-box and send messages from the set-top box back to the video server, using an infrared remote control.

The system provides both VoD and NVoD channels, including broadcast TV channels, and supports interactive movie selection as well as standard VCR controls (play/stop, pause, fast forward/rewind etc.) for the VoD service. The integrated InTV system offers a variety of features that make it attractive for mid and large sized hotel type enterprises, including scaleable architecture, easy installation and maintenance, interactive client service, pre-scheduled movies and flexible charging schemes. Optionally, it integrates other advanced features and on-line services, such as connection to Internet, web browsing, e-mail, etc.

1997-1999
Researcher IVML
EPET II Service Providing Laboratories

This projects aims at the improvement of the IVML with the acquisition of modern equipment in order to provide better cooperation with other Greek partners in matters of editing, analysis and synthesis of images and video sequences as well as transmission, storage and retrieval in multimedia environments. This includes certification along the ISO 9001 standard.

1995-1997
Researcher NIKA
EKBAN-504

The aim of the project was the design and operation of an integrated PACS system for the best Greek Hospital for the treatment of Cardiomascular Diseases. The PACS has the ability to transmit and store radiology images, ultrasound still images, ultrasound video images of the heart, gamma-camera images and angiography video.

# Professional activities

2019
Workshops Chair ACM-MM 2019
27th ACM Multimedia Conference
Nice, France Oct 21-25, 2019
2017
Area Chair ACM-MM 2017
Multimedia Search and Recommendation
25th ACM Multimedia Conference
Mountain View, CA, USA Oct 23-27, 2017
2017
Area Chair EUSIPCO 2017
Image, Video, and Multimedia Processing
25th European Signal Processing Conference
Kos, Greece Aug 28-Sep 2, 2017
2009
Program Chair CIVR 2009
ACM International Conference on Image and Video Retrieval
Santorni, Greece Jul 8-10, 2009
2009
2009
Program Chair MMM 2009
15th International Multimedia Modeling Conference
Sophia-Antipolis, France Jan 7-9, 2009
2006
Organizer SMAR/VIE 2006
Special Session on Semantic Multimedia Analysis for Annotation and Retrieval
International Conference on Visual Information Engineering
Bangalore, India Sep 26-28, 2006
2006
Local Organizing Committee member ICANN 2006
International Conference on Artificial Neural Networks
Athens, Greece Sep 10-14, 2006
2005
Co-Organizer IMS/ICANN 2005
Special Session on Intelligent Multimedia and Semantics
International Conference on Artificial Neural Networks
Warsaw, Poland Sep 11-15, 2005
2005
2005
Local Arrangements MMSW/ESWC 2005
Workshop on Multimedia and the Semantic Web
2nd European Semantic Web Conference
Heraklion, Greece May 29-Jun 1, 2005
2004
Session Chair CIVR 2004
EU Project Session
ACM International Conference on Image and Video Retrieval
Dublin, Ireland Jul 21-23, 2004
2003
Programme Co-Chair WIAAC 2003
Athens, Greece Apr 3, 2003
2001
Liaison with Industry, Local Organization Committee VLBV 2001
International Workshop on Very Low Bitrate Video Coding
Athens, Greece Oct 11-12, 2001
1997
Local Arrangements WANN 1997
International Workshop on Applications and Perspectives of Neural Network Technology in Greece and the Broader Balkan Area
Athens, Greece Feb 21, 1997

2014
2012
2008-
2007-2016
2006-2010
2006
2005-2009
2004-2005
2004-2008
2004-
2004-2016
2004
2003
2002
2002

2013-
2013-
2012-
2012-2016
2012
2009-2013
2009
2007-2016
2007-2008
2007
2007-2017
2006-2007
2005-2017
2004-2016
2002-2009

2005-2013
2003-2016