Sven Eberhardt  {Project Profile}

Current Research / Engineering Projects (2016)

Since I've started in the amazing Serre Lab in late 2015, I've had the pleasure to apply my deep learning and computer vision knowledge to a number of exciting and incredibly diverse engineering and research projects in various disciplines! Here are some examples:

How Deep is Visual Processing in the the Human Brain?

As computer vision gets better and better, Deep convolutional models get deeper and deeper and their performance surpasses human vision in many areas (He et al, 2015). Although hierarchical models of vision were once inspired by the primate visual cortex (Riesenhuber, 1999) Do these models still relate to rapid visual processing in the human brain? In a large-scale human psychophysics study run on Mechanical Turk, we found out: Yes, but only up to a certain depth. NIPS 2016 Paper

Where do humans look at when presented an image, where do they look at to make a decision and where should they look at if we model human vision iwth a deep neural network? These three maps are not necesserily related. We find that not only is salience-based attention not identical to decision-based attention, but there is also a large discrepancy between feature importants of human and deep neural networks. We built a web-app called in which humans select discriminative image regions and a backend-CNN derives an object label decision based on this attention map. Built in NodeJS+Bootstrap; CNN backend in Flask. [Paper submitted].


Ever since Google released its Deep Dreaming method and its various spin-offs, I've been working on a real-time, interactive dreaming implementation that will allow a human to draw cooperatively with a computer. The system is a web-based canvas that connects to a GPU-driven back-end that detects changes made by a human, guesses what the human wanted to draw and then augments the human drawing iteratively with stored images in the deep model.

A Deep Shape Model

Although it's not quite clear what exactly current deep convolutional vision models encode, we found that most image categorization models are image pattern-based and have no real notion of the 3D structure of a scene shown to the network. In this project, we're leveraging massive 3D image rendering using an automated Blender pipeline to feed automatically generated renderings into a convolutional network along with their ground-truth 3D structure. We have developed a unified network (SvenNet) that combines features of 3D structure and object identities into a single network on a shared feature representation. Using this network, we are able to estimate 3D shape information on actual real-world scenes.


Tracking birds of different species in the wild just from low temporal resolution camera streams under moving backgrounds, moving cameras and with distractors flying through an image is a challenging task! Existing motion-based background subtraction models fail when the scene contains moving elements or environmental changes.
We collected about 100,000 outdoor bird images and control frames from live cameras to train a combined motion- and image-based deep convolutional network to detect a frame-perfect trajectory of birds in various settings, including occlusion by moving grass and branches. The final model combines automatic background subtraction, a convolutional deep network and some Kalman-like filtering hacks to achieve a versatile bird tracker that tops at 97% tracking accuracy.

Deep Leaf Recognition

The diversity and richness of leaves in the different floras of the world is both remarkable and beautiful. Between size and shape, primary and secondary veins, blade and trunk structure, we found evidence for a strong phylogenetic signal that allows taxonomic classification into families and orders from leaf structure alone by training a deep vision architecture. While even experts disagree on the precise ordering of families, the deep model not only predicts families of unknown leaves. Using a deep transfer learning technique, we were even able to transfer family classification into the fossil domain to make assessment of extinct botanical species. [GSA Abstract]
We're now expanding our method to a cooperation with several Northeastern Herbaria, allowing us to expand our network architecture with training on millions of annotated specimen from all over the world. We're currently expanding our deep network to a siamese model, which is able to predict genetic distance between species and build its taxonomic representation from both visual and available genetic data.


When I find the time (which is not much these days), I like to contribute to an OpenSource game called OpenClonk. The game is written in C++ and cross-platform for Windows, Linux and Mac. I have contributed since I was a kid for about 15 years and I'm the top contributor by commit count ;-), so I feel it deserves a mention here. [project homepage] [github contributions].


MIT Hacking Medicine Grand Hack

We got 1st place on "Aging in place" at the MIT Grand Hack 2016 with our AlzEYEmers project. The hack is a portable camera that feeds its stream into a deep net to warn of common household dangers for Alzheimer's patients. I built the computer vision backend based on a simple VGG16 and some fine-tuning [visitor blog posting].

TVNext hackathon

We got 2nd place at the st place on the TVnext hackathon with Assistive TV. I generated captions using neural talk and we hooked it up to a TV stream input, so viewers could press a button to get an on-demand scene description of what was currently happening on TV.

Jupyter Forwarder

For our non-technical students, I wrote a Jupyter notebook forwarder that handles secured login, automatically schedules notebook sessions as jobs on our cluster, tunnels connections and forwards their current homework assignments. Flask/Paramiko/SLURM plus lots of bash script magic.

Past Projects

Brain Explained!

The human brain is a truly miraculous device. We perceive our surroundings from vision, audition and haptics as a consistent world seemingly effortlessly. When we think about the world we live in, we reason in terms of objects, spaces or affordances not necessarily tied to a modality. But how exactly do we reach a coherent representation just from sensor input coming from our eyes, ears and skin? How do we learn to understand, reason and affect our surrounding world?

My research focus lies on mechanisms of visual processing in the human brain. My work experience ranges from data recording on humans through psychophysics and fMRI, through modeling employing both flat and hierarchical (deep) models, to applications using bio-inspired models for pattern recognition tasks. Lately, I've been mostly into modeling human vision using the latest craze in Deep Learning. Some key research results are described below.

Vision-based localization

How can we localize ourselves when only one-shot visual data without prior knowledge is available (the "lost robot problem")? Classic approaches often try to identify and localize landmarks to derive hypothetic locations. However, we found that localization works surprisingly well on global texture statistics alone, ignoring any specific objects[3][9]. Surprisingly, this result is stable across many datasets and at different scales (indoor room localization, outdoor localization at world, country and city scale and virtual world localization[to be published]).

We found through experiments with human subjects that primate peripheral vision may be tuned to solve tasks like localization based on these low-level summary statistics features [Journal of Vision, S.I. on periphery, to appear 2016].

We have developed a novel descriptor that dynamically splits textons into large image regions to perform the localization task[4]. We tested the descriptor on existing (INDECS) and two novel datasets based on massively downloaded Google StreetView data and on automatic capture of 10'000 images from a virtual world computer game (Skyrim).

We also tested image descriptor applicability for localization tasks compared to other tasks and found that localization is unique and different to most common vision-based machine learning tasks such as object detection and scene categorization even on the lowest level features to be employed in this task based on orthogonal invariance requirements. We propose an integrated, multi-pathway model that mimics specialization in the ventral stream of the human visual system to account for the different affordances on features of the tasks[6].

Common models for the human visual system assume that there is a strictly one-dimensional feature hierarchy of increasingly complex features that eventually yield general-purpose semantic features on which specialized classifiers may run. Our results refute this theory and state that an optimal visual system should provide specialized, task-dependent units early in the processing hierarchy[7].

Unfortunately, psychophysics experiments to find task-dependent correlates to our models in human performance were not conclusive.

Cross-modal Deep Learning

We present a novel learning approach of "making sense of the senses". The idea is that a Deep Learning model takes natural video and audio input and tries to build a representation that maximizes correlations between the processed senses. In this way, audio signals train the video processing system and the visual features train the auditory processing system.

While other Deep Learning mechanisms require that large, carefully labeled datasets relevant to the task are available, our approach can learn in an unsupervised manner from non-annotated natural scene videos. The big advantage compared to other unsupervised methods such as RBMs is that correlations between the senses are usually on a higher, semantic level.

We've shown that by training between auditory and visual modalities, we can produce receptive field properties that encode object size both in the visual and in the auditory domain. The trained model is able to predict audio signals from video input when only one modality is available. When both modalities are shown, the model can explain human psychophysics results of multi-modal thresholds when detecting object and sound movement. [to be published]

We are currently extending this model to proprioception using IMU data and human POV cameras. We are also working on training between submodalities within vision such as color, luminance, movement and stereo vision. Another project in progress is to cross-train visual inputs from fMRI data.

Hierarchical vision models

Based on results of the HMax model, we looked at effects of modifying a number of parameters on performance and unsupervised learning results.

We have varied a nonlinearity step within the feature processing hierarchy between a MIN and a MAX function and found that when training on Slow Feature Analysis with natural videos, cells wired with a MAX function would connect similar patterns over different locations (translation invariance), the MIN function would cause different features to be pooled within the same location (feature specificity). Unfortunately, the effect was limit to low stages and could not be found on higher levels of the hierarchy.[11]

We found that while adding submodalities such as color to processing added little performance value to common tasks such as object recognition[13], a great improvement in diagnosticity for a localization task could be achieved[10]. We also tested a naive stereo matching approach where disparity features would be calculated and fused into the processing pipeline. Although not useful for 3D object recognition, the approach was able to predict 3D structure from monocular images after training on binocular images[5].

Prism adaptation on virtual hand

Using a Geomagic Phantom haptic device and a setup using a mirror and a mounted screen, we built a virtual reality setup in which subjects could move a rod and see a virtual display of their own hand in real time. The program could be adjusted to allow manipulations on the virtual hand such as shifting as if the subject was wearing prism glasses. This setup was used to study dual adaptation paradigms in prism adaptation.[8]

Gaze Perimetry

Perimetry is a method to estimate human visual field capabilities and defects. Classic perimetry works by showing small dots at different luminance threshold and have subjects press a button whenever the dot was seen. Measuring a complete visual field in this way is time-consuming and tiring especially for older subjects. Gaze perimetry is a novel method in which subjects just follow a dot that is jumping across the screen, greatly increasing comfort for tested subject. Method patented by Manfred Fahle.

Peer-reviewed publications


[0] Eberhardt, S.**, Cader, J.** & Serre, T. "How Deep is the Feature Analysis underlying Rapid Visual Categorization?." NIPS, 2016.[Preprint PDF]*[arXiv]

[1] Linsley, D., Eberhardt, S., Gupta, P. & Serre, T. "Clicktionary: A web-based game for exploring the atoms of object recognition" (Paper in submission).

[2] Eberhardt, S., Christoph, Z., Schill, K. "Peripheral pooling is tuned to the localization task." to appear in Journal of Vision, S.I. on peripheral vision, 2016.[PDF]*

[3] Eberhardt, S., & Christoph, Z. "Self-localization on texture statistics." Image Processing (ICIP), 2014 IEEE International Conference on. IEEE, 2014.[PDF]*

[4] Eberhardt, S. (2014). "Indoor place categorization based on adaptive partitioning of texture histograms." ScienceOpen Research. doi:10.14293/S2199-1006.1.SOR-COMPSCI.AT3KLK.v1 [PDF][supplement]*

[5] Eberhardt, S., Fahle, M., & Zetzsche, C. (2013). Stereo features in a hierarchical feed-forward model. In 16. Anwendungsbezogener Workshop zur Erfassung, Modellierung, Verarbeitung und Auswertung von 3D-Daten (p. 86–93).[PDF]*

[6] Eberhardt, S., & Zetzsche, C. (2013). "Low-level global features for vision-based localization." In M. Ragni, M. Raschke, & F. Stolzenburg (Eds.), on Visual and Spatial Cognition (Vol. 1055, pp. 5–12). Koblenz: CEUR Workshop Proceedings.[PDF]*

[7] Eberhardt, S., Kluth, T., Zetzsche, C., & Schill, K. (2012). "From pattern recognition to place identification." In M. Vasardani, S. Winter, K.-F. Richter, J. Krzysztof, & W. Mackaness (Eds.), Spatial cognition, international workshop on place-related knowledge acquisition research (pp. 39–44). Monastery Seeon, Germany: CEUR Workshop Proceedings.[PDF]*

[8] Arévalo, O., Bornschlegl, M. A., Eberhardt, S., Ernst, U., Pawelzik, K., & Fahle, M. (2013). Dynamics of Dual Prism Adaptation: Relating Novel Experimental Results to a Minimalistic Neural Model. PloS one, 8(10), e76601.[Link]


[9] Eberhardt, S., & Christoph, Z. "Self-localization on texture statistics." Image Processing (ICIP), 2014 IEEE International Conference on. IEEE, 2014.[PDF]

[10] Eberhardt, S., Christoph, Z., Fahle, M., & Schill, K. (2013). "Merging color and shape in a hierarchical pattern recognition model." In Perception, ECVP (p. 153).[PDF]

[11] Eberhardt, S., Kluth, T., Christoph, Z., & Schill, K. (2013). "Slow features encode high-level concepts on HMAX outputs." Osnabrück Computational Cognition Alliance Meeting[PDF]

[12] Kluth, T., Eberhardt, S., Christoph, Z., & Schill, K. (2012). "Slow features between invariance and selectivity." Osnabrück Computational Cognition Alliance Meeting[PDF]

[13] Eberhardt, S., Kluth, T., Fahle, M., & Zetzsche, C. (2012). "The role of nonlinearities in hierarchical feed-forward models for pattern recognition." In Perception, ECVP (p. 241).[PDF]


Eberhardt, S. (2010). "Implementation and analysis of a neuromorphic system for invariant object recognition in real time." Diploma thesis, University of Bremen.

Eberhardt, S., Kluth, T., Reineking, T., Zetzsche, C., & Schill, K. (2012). "Models for invariant place recognition." In Proceedings of KogWis (p. 10).

Nguyen, T,, Eberhardt, S., Wilf, P., Wing, S. & Serre, T. (2016). "Automated Leaf Analysis with Deep Learning and its Potential for the Fossil Record." [GSA Abstract]

*PDF download is a preprint made available for research purposes only. Redistribution is prohibited. Publisher may hold copyrights on the final version.

**First authors contributed equally.