Making sense of hundreds of photos, or several hours of video clips, could get easier thanks to automatic semantic annotation.
Ever arrived home from a trip with several hundred photos, or several hours of video clips? Unfortunately, for anyone who owns a digital camera or video recorder, this chore is all too familiar. Annotating and managing personal digital content can be a huge burden. Current digital cameras support location and time tagging, but adding meaningful labels, known as 'semantic tagging', must still be performed manually, which is time-consuming.
Consumers have a wide range of multimedia devices in their home networks - set-top boxes, PVRs, home media hubs and PCs - for the storage and display of their digital content. A further chore for the user is to locate content across a multitude of personal devices.
Sharing of photos is also hugely popular - as witnessed by the growth of websites such as Flickr. But which pictures should be shared with whom? Someone arriving home from holiday may want to share pictures with friends and family. But pictures from a wild party may be circulated only to a select group of friends. This requires manual sorting of photos.
Low-level descriptors such as colour and texture features are readily machine-computable and provide the capability to search for images by visual similarity. However, humans often prefer to search for content using familiar, everyday terms in natural language (referred to here as semantic queries).
The EC IST part-funded aceMedia project took up the challenge of bridging the so-called semantic gap between low-level visual descriptors and high-level semantics. aceMedia developed an end-to-end system for content management based on the requirements of consumers and professional content providers. Given that images and video can be automatically annotated to a certain extent, intelligent applications can be enabled to help the user manage their content.
Ontologies are a key ingredient in bridging the semantic gap. Ontologies are shared formal description of the concepts, entities and relations of a real-world domain, such as beach holidays or motor sports. They are specified using standard ontology languages, such as RDFS or OWL. The languages differ in the features they support. Population of specific domain ontologies with instances and relationships that refer to the actual content a user is interested in, requires iterative development relying on a domain expert. Annotating content according to a specified ontology leads to metadata ('tags') which are consistent, and can therefore be used for downstream applications (e.g. search).
The first step in the image annotation process is to compute a set of low-level visual descriptors based on colour, texture, edge and shape information. Segmentation of the image into homogeneous regions is performed, based on the low-level descriptions.
Scene classification uses a machine learning-based classifier (Support Vector Machine) to classify the images into different scene types (e.g. indoor versus outdoor). Training of the classifier is performed on a set of training images using both positive and negative examples. Knowledge-assisted segment classification then labels the regions in the image by discovering the instances of the ontology concepts (e.g. beach, sky, sea) in the content.
A key feature mentioned by users searching for images is the presence of people. The person detector is used to detect images of standing or walking people. In user studies carried out within aceMedia, users were interested in queries such as finding 'images with many people on a beach'. The detector scans the image with a detection rectangle at multiple scales, computing image gradient information in order to detect the human profile.
For short or medium- range shots of people, the location of faces in images can be determined by detection of facial features. The variation in poses, facial expressions and illumination are issues which have been considered to increase robustness of the detector. Face recognition requires the user to manually label the first few occurrences of a person's face, and then all future occurrences are labelled automatically.
In addition to the automatically extracted image data, the user may also want to add their own annotations. This can be done by entering free text which is processed using natural language processing (NLP) tools.
Since the individual annotation processes are liable to errors, for example 'sea' may be incorrectly labelled as 'sky' as in some lighting situations their visual characteristics may be similar, a reasoning process which uses domain knowledge, such as spatial relationships between objects, is applied to remove inconsistent annotations and also to infer new ones. The reasoning is applied to the outputs of all the individual automatic annotation processes as well as annotations entered by the user.
The final output of the annotation system is a consistent set of global and region-based labels, consistent with the chosen ontology.
By automatically extracting semantically significant key frames from video, for example by detection of scene changes, many of these automatic annotation techniques can also be applied to video. Additionally, the use of motion information can increase the robustness of processes such as person detection. In videos, it is possible to identify not only objects in a scene but also events such as a player serving in a tennis match.
Currently image 'tagging' (labelling with keywords) is very popular on photo sharing websites. However, inconsistencies in labelling mean that many relevant images may not be found in a keyword search. For example, an image labelled with 'beach' will not be located by a keyword search for 'seaside'. aceMedia has integrated full NLP for both textual annotations entered by the user and for processing user queries. This provides for expansions of searches to allow for synonyms, and enables the user to carry out complex searches.
Home Wi-Fi networks are becoming increasingly ubiquitous, enabling content to be easily transferred across devices. Indeed, many mobile phones feature fully-fledged digital cameras as well as wireless connectivity. The UPnP (Universal Plug and Play) standard allows consumer electronic devices to automatically detect each other on the same wireless network. DLNATM (Digital Living Network Alliance) provides an evolving set of standardised mechanisms for sharing content across set-top boxes, mobile phones and PCs.
As content migrates across different devices, it is important to that all the related metadata as well as additional information required for its usage is available. For this, aceMedia developed the concept of the Autonomous Content Entity (ACE). The ACE encapsulates the content, the metadata and an intelligence layer. The intelligence layer consists of distributed functions that enable the content to instantiate itself according to its context. For example, video content being displayed on the set-top box will require transcoding before it can be viewed on the user's mobile phone.
Where domain-specific knowledge has been created, structured according to defined ontologies, automatic semantic annotation of image and content is already a reality. By leveraging domain knowledge and reasoning techniques, these mechanisms can be made increasingly robust. Digital cameras from several manufacturers already support face detection, although primarily for the purpose of improving the performance of auto-focus and auto- exposure.
Automatic semantic annotation enables intelligent automated content management such as self-organisation and self-governance. These were very well received by consumers and professionals and provide a strong indicator for the future direction of multimedia content management.
The authors are with Motorola Labs Applications Research Centre in Basingstoke, UK. Adrian Matellanes and Simon Waddington were the aceMedia project managers.
The authors acknowledge the contributions of Paola Hobson, Jonathan Teh, Alexander Baxevanis, Patricia Charlton (Motorola) and Esko Dijk, Freddy Snijder (Philips) as well as many aceMedia colleagues from Philips, Queen Mary University of London, Fraunhofer FIT, Telefónica I+D, France Telecom, Universidad Autónoma de Madrid, Alinari, Belgavox, INRIA, University of Koblenz Landau, CERTH Informatics and Telematics Institute (ITI), Dublin City University (DCU)
This research was supported by the EC IST FP6 project aceMedia (www.acemedia.org [new window]) under contract FP6-001765.