Selected Publications


Deep Quantitative Susceptibility Mapping by combined Background Field Removal and Dipole Inversion

Stefan Heber​, Christian Tinauer​, Steffen Bollmann​, Stefan Ropele​ and Christian Langkammer​
Proceedings of the 27th Annual Meeting of the ISMRM, 2019

Deep learning based on u-shaped architectures has been successfully used as a means for the dipole inversion crucial to Quantitative Susceptibility Mapping (QSM). In the present work we propose a novel deep regression network by stacking two u-shaped networks and consequently both, the background field removal and the dipole inversion can be performed in a single feed forward network architecture. Based on learning the theoretical forward model using synthetic data examples, we show a proof-of-concept for solving the background field problem and dipole inversion in a single end-to-end trained network using in vivo Magnetic Resonance Imaging (MRI) data.

@inproceedings{heberismrm2019,
booktitle={Proceedings of the 27th Annual Meeting of the ISMRM},
year={2019},
url={},
location={Montreal, QC, Canada},
title={Deep Quantitative Susceptibility Mapping by combined Background Field Removal and Dipole Inversion},
author={Stefan Heber​, Christian Tinauer​, Steffen Bollmann​, Stefan Ropele​ and Christian Langkammer​},
editor={ International Society for Magnetic Resonance in Medicine},
bibsource = {},
}
Illustration of the proposed network architecture.
Illustration of the proposed network architecture. The overall network structure builds upon u-shaped networks. Those networks involve two symmetric parts, an encoding and a decoding part, which are highlighted in red and green, respectively. The encoding and decoding parts split up into several levels, which are connected via down and up-convolutional layers, respectively. The individual levels consist of three convolutional layer (marked in blue), each followed by a leaky ReLU non-linearity. Note that we only use filter kernels of size $3 \times 3 \times 3$, and each convolutional layer employs $\max\{16,8 \cdot 2l\}$ filter, where l denotes the current level. Overall our network stacks two u-shaped sub-networks, where the first is trained to remove the background field and the second one performs the dipole inversion.
Result of the proposed method
input

input

intermediate (axial)

intermediate

prediction

prediction

ground truth

ground truth

Illustration of real world data from in vivo MRI scans. The figure shows axial middle slices from real world brain data taken from the 2016 QSM reconstruction challenge [1]. We can observe that the network is indeed able to first remove the low frequency background field and afterwards perform the dipole deconvolution.


[1] C. Langkammer, F. Schweser, K. Shmueli, C. Kames, X. Li, L. Guo, C. Milovic, J. Kim, H. Wei, K. Bredies, S. Buch, Y. Guo, Z. Liu, J. Meineke, A. Rauscher, J. P. Marques, and B. Bilgic, “Quantitative susceptibility mapping: Report from the 2016 reconstruction challenge,” Magnetic Resonance in Medicine, vol. 79, no. 3, pp. 1661–1673.

Neural EPI-volume Networks for Shape from Light Field

Stefan Heber, Wei Yu and Thomas Pock
IEEE International Conference on Computer Vision (ICCV), 2017

This paper presents a novel deep regression network to extract geometric information from Light Field (LF) data. Our network builds upon u-shaped network architectures. Those networks involve two symmetric parts, an encoding and a decoding part. In the first part the network encodes relevant information from the given input into a set of high-level feature maps. In the second part the generated feature maps are then decoded to the desired output. To predict reliable and robust depth information the proposed network examines 3D subsets of the 4D LF called Epipolar Plane Image (EPI) volumes. An important aspect of our network is the use of 3D convolutional layers, that allow to propagate information from two spatial dimensions and one directional dimension of the LF. Compared to previous work this allows for an additional spatial regularization, which reduces depth artifacts and simultaneously maintains clear depth discontinuities. Experimental results show that our approach allows to create high-quality reconstruction results, which outperform current state-of-the-art Shape from Light Field (SfLF) techniques. The main advantage of the proposed approach is the ability to provide those high-quality reconstructions at a low computation time.

@INPROCEEDINGS{8237509,
author={S. Heber and W. Yu and T. Pock},
booktitle={2017 IEEE International Conference on Computer Vision (ICCV)},
title={Neural EPI-Volume Networks for Shape from Light Field},
year={2017},
volume={},
number={},
pages={2271-2279},
keywords={decoding;estimation theory;image coding;image reconstruction;neural nets;regression analysis;set theory;reliable depth information;robust depth information;3D convolutional layers;spatial dimensions;directional dimension;depth artifacts;high-quality reconstructions;neural EPI-volume networks;deep regression network;u-shaped network architectures;encoding;high-level feature maps;spatial regularization;depth discontinuities;high-quality reconstruction;decoding;generated feature maps;epipolar plane image volumes;light field techniques;light field data;geometric information extraction;LF data;3D subsets;4D LF;SfLF techniques;Three-dimensional displays;Two dimensional displays;Image reconstruction;Network architecture;Machine learning;Shape},
doi={10.1109/ICCV.2017.247},
ISSN={2380-7504},
month={Oct},}
Qualitative result of the proposed method
input

input

ground truth

ground truth

proposed

proposed

The figure shows the center view of the Light Field (LF), the color-coded ground truth, and the result of the proposed method.

Quantitative results for various Shape from Light Field methods averaged over $50$ synthetic Light Fields.

Wanner [6] Tao [5] Heber [1] Jeon [4] Heber [2] Heber [3] proposed
RMSE $3.91$ $2.33$ $2.50$ $2.49$ $1.87$ $0.80$ $0.83$
MAE $2.94$ $1.06$ $0.79$ $0.75$ $1.13$ $0.35$ $0.34$
$0.5\%$ $22.00$ $16.32$ $8.47$ $9.64$ $17.96$ $7.34$ $7.28$
$0.2\%$ $35.22$ $28.48$ $13.20$ $16.46$ $31.61$ $14.76$ $14.42$
Time $3$min$18$s $23$min$4$s $4$min$44$s $2$h$12$min$30$s $35$s $2$s $0.8$s
GPU

The table provides the Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), the percentage of pixels with a relative disparity error larger than $0.2\%$ and $0.5\%$, and the computational time of the method.


[1] Stefan Heber and Thomas Pock. Shape from light field meets robust PCA. In Proceedings of the 13th European Conference on Computer Vision, 2014.
[2] Stefan Heber and Thomas Pock. Convolutional Networks for Shape from Light Field. In IEEE Conference on Computer Vision and Pattern Recognition, 2016. to appear.
[3] S. Heber, W. Yu, and T. Pock. U-shaped networks for shape from light field. In Proc. British Machine Vision Conf., 2016.
[4] H. G. Jeon, J. Park, G. Choe, J. Park, Y. Bok, Y. W. Tai, and I. S. Kweon. Accurate depth map estimation from a lenslet light field camera. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1547–1555, June 2015.
[5] Michael W. Tao, Sunil Hadap, Jitendra Malik, and Ravi Ramamoorthi. Depth from combining defocus and correspondence using light-field cameras. In International Conference on Computer Vision (ICCV), December 2013.
[6] S. Wanner and B. Goldluecke. Globally consistent depth labeling of 4D lightfields. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.

Machine Learning Applications in Computer Vision

Wei Yu
PhD Thesis, 2017

Machine Learning (ML) is a branch of Artificial Intelligence (AI), that describes methods to automate analytical model building. Related methods provide computers with the ability to learn without the need of explicit programming, i.e. these methods learn from and make predictions on data. Because of the ability to tackle problems involving large-scale data, ML methods can be applied across many industrial applications, like for instance, fraud detection, driving assistance, image classification and image captioning. The first ML algorithm, the so-called perceptron algorithm, was introduced in 1957 for the task of binary classification. Later it was generalized to the multi-layer perceptron, which is the prototype of the neural network. ML algorithms, like e.g. Deep Neural Network (DNN) or Support Vector Machines (SVMs), are based on large-scale optimization problems, where the choice of the optimization method has an influence on the training and consequently on the provided solution for the corresponding ML tasks, which should not be underestimated.

The two main topics of this thesis are ML methods and large-scale optimization techniques. Besides giving an overview on theoretical foundations, including topics like optimization algorithms and inverse problems, this thesis reviews several classical and popular ML problems including dimensionality reduction, multi-task learning, feature selection, SVMs, matrix completion and matrix factorization. We compare the different ML problems based on a large variety of first-order optimization algorithms, including the Forward-backward Splitting Method (FBS), the Fast Iterative Shrinkage-thresholding Algorithm (FISTA), the Optimal Subgradient Algorithm (OSGA), and the Primal-Dual Algorithm (PDCP). We also present the Online PDCP which provides a faster empirical convergence than PDCP. In the remaining part of the thesis, we apply ML methods to three fundamental Computer Vision (CV) problems: Single-Image Inpainting, Shape from Light Field (SfLF) and Light Field Super Resolution (LFSR).

Firstly, we present a diffusion model for image inpainting. The model is based on the so-called trained Reaction-Diffusion Model (RDM), which can also be seen as a Recurrent Neural Network (RNN). We train two diffusion models based on the Structural Similarity Image Measure (SSIM) for two types of inpainting tasks. The first one is a generic model, which aims to recover large regions of scattered pixels or to restore small connected regions on images. The second model is referred to as a specific model, which aims to inpaint the texture of a given image. Adequate inpainting results show that both inpainting models can transform corrupted images to visually pleasing images.

Secondly, we use u-shaped Convolutional Neural Networks (CNNs) for the task of depth estimation in the Light Field (LF) setting, which is referred to as Shape from Light Field. We train our network on so-called Epipolar Plane Images (EPIs), rather than the entire LF. As a result, our network is able to provide efficient and accurate reconstructions. We demonstrate the superior performance of our network on synthetic and real-world data. In addition, we present a comprehensive comparison to the state of the art.

Finally, we continue to the problem of LFSR. Here we extend the trained RDM to perform 3D filtering in the LF setting, i.e. the trained models explore more than just one single view. To be efficient, we only recover the high-resolution center view of a so-called EPI volume. By comparing the results of our 3D RDM to the result of a correspond 2D RDM, we are able to show the benefit of our proposed approach.

U-shaped Networks for Shape from Light Field

Stefan Heber, Wei Yu and Thomas Pock
Proceedings of the British Machine Vision Conference (BMVC), 2016

This paper presents a novel technique for Shape from Light Field (SfLF), that utilizes deep learning strategies. Our model is based on a fully convolutional network, that involves two symmetric parts, an encoding and a decoding part, leading to a u-shaped network architecture. By leveraging a recently proposed Light Field (LF) dataset, we are able to effectively train our model using supervised training. To process an entire LF we split the LF data into the corresponding Epipolar Plane Image (EPI) representation and predict each EPI separately. This strategy provides good reconstruction results combined with a fast prediction time. In the experimental section we compare our method to the state of the art. The method performs well in terms of depth accuracy, and is able to outperform competing methods in terms of prediction time by a large margin.

@inproceedings{BMVC2016_37,
title={U-shaped Networks for Shape from Light Field},
author={Stefan Heber, Wei Yu and Thomas Pock},
year={2016},
month={September},
pages={37.1-37.12},
articleno={37},
numpages={12},
booktitle={Proceedings of the British Machine Vision Conference (BMVC)},
publisher={BMVA Press},
editor={Richard C. Wilson, Edwin R. Hancock and William A. P. Smith},
doi={10.5244/C.30.37},
isbn={1-901725-59-6},
url={https://dx.doi.org/10.5244/C.30.37} }
Illustration of the proposed u-shaped network architecture.
The encoding and decoding parts of the network are highlighted in purple and green, respectively. The pinhole connections are marked in blue.
Qualitative result of the proposed method
input
input

input

proposed
proposed

ground truth

proposed
proposed

proposed

The figure shows, from left to right, the center view of the Light Field (LF) and the color-coded result of the proposed method.

Quantitative results for various Shape from Light Field methods averaged over $50$ synthetic Light Fields.

Wanner [5] Tao [4] Heber [1] Jeon [3] Heber [2] proposed
RMSE $3.91$ $2.33$ $2.50$ $2.49$ $1.87$ $0.80$
MAE $2.94$ $1.06$ $0.79$ $0.75$ $1.13$ $0.35$
$0.5\%$ $22.00$ $16.32$ $8.47$ $9.64$ $17.96$ $7.34$
$0.2\%$ $35.22$ $28.48$ $13.20$ $16.46$ $31.61$ $14.76$
Time $3$min$18$s $23$min$4$s $4$min$44$s $2$h$12$min$30$s $35$s $2$s
GPU

The table provides the Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), the percentage of pixels with a relative disparity error larger than $0.2\%$ and $0.5\%$, and the computational time of the method.


[1] Stefan Heber and Thomas Pock. Shape from light field meets robust PCA. In Proceedings of the 13th European Conference on Computer Vision, 2014.
[2] Stefan Heber and Thomas Pock. Convolutional Networks for Shape from Light Field. In IEEE Conference on Computer Vision and Pattern Recognition, 2016. to appear.
[3] H. G. Jeon, J. Park, G. Choe, J. Park, Y. Bok, Y. W. Tai, and I. S. Kweon. Accurate depth map estimation from a lenslet light field camera. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1547–1555, June 2015.
[4] Michael W. Tao, Sunil Hadap, Jitendra Malik, and Ravi Ramamoorthi. Depth from combining defocus and correspondence using light-field cameras. In International Conference on Computer Vision (ICCV), December 2013.
[5] S. Wanner and B. Goldluecke. Globally consistent depth labeling of 4D lightfields. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.

Convolutional Networks for Shape From Light Field

Stefan Heber and Thomas Pock
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

Convolutional Neural Networks (CNNs) have recently been successfully applied to various Computer Vision (CV) applications. In this paper we utilize CNNs to predict depth information for given Light Field (LF) data. The proposed method learns an end-to-end mapping between the 4D light field and a representation of the corresponding 4D depth field in terms of 2D hyperplane orientations. The obtained prediction is then further refined in a post processing step by applying a higher-order regularization. Existing LF datasets are not sufficient for the purpose of the training scheme tackled in this paper. This is mainly due to the fact that the ground truth depth of existing datasets is inaccurate and/or the datasets are limited to a small number of LFs. This made it necessary to generate a new synthetic LF dataset, which is based on the raytracing software POV-Ray. This new dataset provides floating point accurate ground truth depth fields, and due to a random scene generator the dataset can be scaled as required.

@INPROCEEDINGS{7780776,
author={S. Heber and T. Pock},
booktitle={2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
title={Convolutional Networks for Shape from Light Field},
year={2016},
volume={},
number={},
pages={3746-3754},
keywords={computer vision;image representation;neural nets;ray tracing;shape recognition;shape;light field;convolutional neural networks;CNN;computer vision;depth information;LF data;end-to-end mapping;image representation;2D hyperplane orientations;ray tracing software;POV-Ray;Two dimensional displays;Three-dimensional displays;Shape;Cameras;Data visualization;Tensile stress;Network architecture},
doi={10.1109/CVPR.2016.407},
ISSN={1063-6919},
month={June},}
Illustration of the patch extraction.
The figure to the left illustrates the Light Field (LF) as a direction major 2D array of 2D arrays, where the coordinate $(p_0, q_0)$ is marked. The corresponding vertical and horizontal patches at that location are shown to the right. Those two patches are the input of the proposed network.
Illustration of the network architecture.
The network consists of five layers, denoted as $L_i$, $i ∈ [5]$. The first four layers are convolutional layers, followed by one fully-connected layer. Each convolutional layer is followed by a Rectified Linear Unit (ReLU) nonlinearity. The first and third layer is padded such that the width and height between input and output is not changing. The kernel size of the convolutional layers decreases towards deeper layers. More precisely, we us kernels of size $7 \times 7$ for the first two layers, and kernels of size $5 \times 5$ for the layers three and four. The number of feature maps also increases towards deeper layers, i.e. we use 64 feature maps for the first two layers and double them for the following two layers. Note, that there is no pooling involved in the used network architecture, and the inputs are two RGB image patches of size $31 \times 11$.
Qualitative result of the proposed method
input

input

ground truth

ground truth

wanner

Wanner et al. [1]

proposed

proposed

The figure shows, from left to right, the center view of the Light Field (LF), the color-coded ground truth, the result of a state-of-the-art Shape from Light Field (SfLF) method [1], followed by the refinement result of the proposed method.


[1] S. Wanner and B. Goldluecke. Globally consistent depth labeling of 4D lightfields. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.

On learning optimized reaction diffusion processes for effective image restoration

Yunjin Chen, Wei Yu and Thomas Pock
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015

For several decades, image restoration remains an active research topic in low-level computer vision and hence new approaches are constantly emerging. However, many recently proposed algorithms achieve state-of-the-art performance only at the expense of very high computation time, which clearly limits their practical relevance. In this work, we propose a simple but effective approach with both high computational efficiency and high restoration quality. We extend conventional nonlinear reaction diffusion models by several parametrized linear filters as well as several parametrized influence functions. We propose to train the parameters of the filters and the influence functions through a loss based approach. Experiments show that our trained nonlinear reaction diffusion models largely benefit from the training of the parameters and finally lead to the best reported performance on common test datasets for image restoration. Due to their structural simplicity, our trained models are highly efficient and are also well-suited for parallel computation on GPUs.

@inproceedings{b19a3f50a45847d3ab313be0d4d5ba57,
title = "On learning optimized reaction diffusion processes for effective image restoration",
author = "Yunjin Chen and Wei Yu and Thomas Pock",
year = "2015",
language = "English",
booktitle = "International Conference on Computer Vision",
publisher = ".", }
Illustration of the structure of the convolutional network of trained RDM.
Diffusion model in the discrete setting $$ \def\norm#1{\left\lVert#1\right\rVert} \newcommand{\bra}[1]{\left\{ #1 \right\}} \newcommand{\sbra}[1]{\left[ #1 \right]} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\avg}[1]{\left\langle #1 \right\rangle} \newcommand{\vectortrans}[1]{ #1 ^\top} \def\vec{\mathbf} \mathbf{\vec{u}}_{t} = \mathbf{\vec{u}}_{t-1} -\sum_{i=1}^{ n_k }{\mathbf{K}_i^{t}}^\top{\boldsymbol\phi}_i^t(\mathbf{K}_i^t\mathbf{\vec{u}}_{t-1})- \psi(\mathbf{\vec{u}}_{t-1},\mathbf{\mathcal{F}})\, , $$ where $\mathbf{\vec{u}}_t\in\mathbb{R}^d$ denotes the vectorized representation of an image at stage $t$, $n_k$ denotes the number of filters, $\mathbf{\mathcal{F}} \in \mathbb{R}^d$ is the vectorized corrupted input image, and $\mathbf{K}_i^t\in \mathbb{R}^{d\times d}$ is a sparse matrix representing a convolution operation with the kernel $\mathbf{k}_i^t$ ( $\mathbf{K}_i^t \mathbf{\vec{u}}=\mathbf{k}_i^t\ast \mathbf{\vec{u}}$). The reaction term changes for different applications, like Gaussian denoising, or JPEG deblocking. Note that the influence functions ${\boldsymbol\phi}_i^t(\cdot)$ and filters $\mathbf{K}_i^t$ vary for different stages $t$. In the training, the energy is formulated by the similarity of the ground truth and the output $\mathbf{\vec{u}}_{t}$. Thus the output of each gradient descent step should always be close to the unique ground truth. The principle of measuring the similarity between the ground truth and the output $\mathbf{\vec{u}}_{t}$ includes PSNR, SSIM. The proposed trained RDM is one specific type of convolutional networks which are popular models in the field of computer vision and pattern recognition. There are two convolutional layers which are associated with constrains. One constrain is that the corresponding kernels in the first and the second convolutional layers are opposite. The second constrain is that the activation functions of the second convolutional layer are set to identity functions.
One Denoising Example of TRD57x7
Original Before
Recovered After

Average PSNR (dB) on 68 images for image denoising (σ deontes the standard deviation)

Method Gaussian noise
with σ=15
Gaussian noise
with σ=25
BM3D 31.08 28.56
LSSC 31.27 28.70
EPLL 31.19 28.68
opt-MRF 31.18 28.66
RTF5 - 28.75
WNNM 31.37 28.83
CSF57x7 31.24 28.72
TRD57x7 31.42 28.92

The Quantitative results of the proposed method outperforms the competing algorithms.

Learning Reaction-Diffusion Models for Image Inpainting

Wei Yu, Stefan Heber and Thomas Pock
German Conference on Pattern Recognition (GCPR), 2015

In this paper we present a trained diffusion model for image inpainting based on the structural similarity measure. The proposed diffusion model uses several parametrized linear filters and influence functions. Those parameters are learned in a loss based approach, where we first perform a greedy training before conducting a joint training to further improve the inpainting performance. We provide a detailed comparison to state-of-the-art inpainting algorithms based on the TUM-image inpainting database. The experimental results show that the proposed diffusion model is efficient and achieves superior performance. Moreover, we also demonstrate that the proposed method has a texture preserving property, that makes it stand out from previous PDE based methods.


Supp of "Learning Reaction-Diffusion Models for Image Inpainting"

@inproceedings{82c7e803c7354e299b958c978ae066d9,
title = "Learning Reaction-Diffusion Models for Image Inpainting",
author = "Wei Yu and Stefan Heber and Thomas Pock",
year = "2015",
doi = "10.1007/978-3-319-24947-6_29",
language = "English",
isbn = "978-3-319-24946-9",
volume = "9358",
pages = "356--367",
booktitle = "Pattern Recognition",
publisher = "Springer International Publishing AG",
address = "Switzerland", }
Illustration of the RD Network of Inpainting.
The main idea of the proposed model is to propagate the known information from outside the inpainting region in a meaningful way in order to reconstruct the values inside the unknown image regions. For this purpose we define the following diffusion model $$ \mathbf{u}_{t}= \mathbf{u}_{t-1}-\mathbf{m}\cdot \sum_{i=1}^{n_k}{\mathbf{K}_i^{t}}^\top{\boldsymbol\phi}_i^t(\mathbf{K}_i^t\mathbf{u}_{t-1})\,, $$ where $\cdot$ denotes the Hadamard product (i.e pointwise multiplication), and $\mathbf{m} \in \mathbb{R}^n$ is a vectorized mask indicating the inpainting domain $\mathcal{A}$, i.e. $$ {m}_j = \begin{cases} 1 & j\in \mathcal{A}\,,\\ 0 & \mathrm{otherwise}\,, \end{cases} $$ where ${m}_j$ denotes the value at the $j^\text{th}$ position of the vector $\mathbf{m}$. The influence functions ${\boldsymbol\phi}_i^t: \mathbb{R}^n\rightarrow \mathbb{R}^n$ are formulated as linear combinations of triangular-shaped basis functions, \begin{equation}\label{eq:phi} \phi_i^t(\mathbf{v})_j=\xi_i^t({v}_j), \quad j\in\left\{ 1,\ldots,n \right\} \end{equation} where $\xi_i^t: \mathbb{R}\rightarrow \mathbb{R}$ is a piecewise linear function using the efficient technique of slice transform. The other option is Gaussian basis functions which can provide a smooth solution. Although it is sensible to the initialization when triangular-shaped basis functions use slice transform, it can provide competitive results with fast calculation in the cases of image inpainting and image denoising. The output value of the influence function $\phi_i^t$ at the $j^\text{th}$ position only related to the $j^\text{th}$ component of the input variable.The main reason for using those simple basis functions is to reduce the computational complexity.
Qualitative inpainting result for $80\%$ random missing pixels.
Lights

clean image

Nature

mask

Fjords

Chen

Fjords

TD

Fjords

clean

Fjords

Laplacian

Fjords

TV

Fjords

Chen

Fjords

Schmidt MRF

Fjords

Schmidt FoE

Fjords

TD

The closeup views clearly show the advantage of the proposed trained diffusion (TD) model.

Quantitative inpainting results for $80\%$ random missing pixels

Laplacian TV Chen Schmidt MRF Schmidt FoE TD
$80\%$ PSNR 21.5998 20.8886 22.5123 21.4210 21.8630 22.6301
$80\%$ SSIM 0.7864 0.7376 0.8185 0.7781 0.8028 0.8267
$80\%$ GSIM 0.7417 0.7041 0.7947 0.7451 0.7844 0.8048

The above table provides quantitative results in terms of the mean PSNR, SSIM, and GSIM. We observe that the proposed diffusion model achieves excellent results and is able to outperform the competing methods.

Scene Flow Estimation from Light Fields via the Preconditioned Primal-Dual Algorithm

Stefan Heber and Thomas Pock
German Conference on Pattern Recognition (GCPR), 2014

In this paper we present a novel variational model to jointly estimate geometry and motion from a sequence of light fields captured with a plenoptic camera. The proposed model uses the so-called sub-aperture representation of the light field. Sub-aperture images represent images with slightly different viewpoints, which can be extracted from the light field. The sub-aperture representation allows us to formulate a convex global energy functional, which enforces multi-view geometry consistency, and piecewise smoothness assumptions on the scene flow variables. We optimize the proposed scene flow model by using an efficient preconditioned primal-dual algorithm. Finally, we also present synthetic and real world experiments.

@InProceedings{10.1007/978-3-319-11752-2_1,
author="Heber, Stefan and Pock, Thomas",
editor="Jiang, Xiaoyi and Hornegger, Joachim and Koch, Reinhard",
title="Scene Flow Estimation from Light Fields via the Preconditioned Primal-Dual Algorithm",
booktitle="Pattern Recognition",
year="2014",
publisher="Springer International Publishing",
address="Cham",
pages="3--14",
abstract="In this paper we present a novel variational model to jointly estimate geometry and motion from a sequence of light fields captured with a plenoptic camera. The proposed model uses the so-called sub-aperture representation of the light field. Sub-aperture images represent images with slightly different viewpoints, which can be extracted from the light field. The sub-aperture representation allows us to formulate a convex global energy functional, which enforces multi-view geometry consistency, and piecewise smoothness assumptions on the scene flow variables. We optimize the proposed scene flow model by using an efficient preconditioned primal-dual algorithm. Finally, we also present synthetic and real world experiments.",
isbn="978-3-319-11752-2"
}
Qualitative results for a synthetic scene
Lights

illustration

Nature

ground truth $d_1$

Fjords

ground truth $d_2$

Fjords

ground truth $\boldsymbol{u}$

Lights

center view

Nature

proposed $d_1$

Fjords

proposed $d_2$

Fjords

proposed $\boldsymbol{u}$

The figure shows from left to right, an illustration of the motion and the center view of the light field at time $t_1$, the two disparity maps $d_1$ and $d_2$, and the color coded optical flow $\boldsymbol{u}$ (Middlebury color code). For the variables $d_1$, $d_2$ and $\boldsymbol{u}$ we present the result of the proposed model, as well as the corresponding ground truth.

A Comparison of First-order Algorithms for Machine Learning

Yu Wei and Pock Thomas
The 38th Annual Workshop of the Austrian Association for Pattern Recognition

Machine Learning problems concerned in this paper can be formulated as a convex minimization problem consisting of a loss function $\mathcal{F}(\vec{x})$ and a regularizer $\mathcal{G}(\vec{x})$. We define the energy \begin{equation} \mathcal{E}(\vec{x}) = \mathcal{F}(\vec{x})+\lambda \mathcal{G}(\vec{x}), \end{equation} where $\lambda$ is a parameter controlling the tradeoff between a good generalization performance and over-fitting. The loss function calculates the disparity between the prediction of a solution and the ground truth. This paper tackles several smooth and non-smooth Machine Learning problems that include a loss function plus a regularizer. We present tasks within dimensionality reduction via compressive sensing, Support Vector Machines, group lasso regularizer for grouped feature selection, $\mathcal{L}_{1,\infty}$ regularization for multi-task learning, trace norm regularization for max-margin matrix completion and matrix factorization. The primal-dual algorithm [1] has the best performance with a fast empirical convergence rate in all problems concerned in this paper and experiments show that the primal-dual algorithm [1] is an effcient and robust solver for a ML problems.


[1] Chambolle, A. and Pock, T. (2011). A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40(1):120-145. (page )


Supp of "A Comparison of First-order Algorithms for Machine Learning"

@article{DBLP:journals/corr/WeiT14,
author = {Wei Yu and Thomas Pock},
title = {A Comparison of First-order Algorithms for Machine Learning},
journal = {CoRR},
volume = {abs/1404.6674},
year = {2014} }

The primal-dual algorithm solves kernel SVM

The primal-dual model for the kernel SVM is, \begin{equation} \min\limits_\vec{x}\max\limits_\vec{y}\left\{\left\langle\vec{x},\vec{y}\right\rangle-\frac{\vec{y}^{\top}\widehat{\mathbf{H}}^{-1}\vec{y}}{4p-2}-q\sum\limits_{i=1}^{\mathrm{n}_\mathrm{trn}}x_i+\delta(\vec{x})\right\}. \end{equation} The primal-dual algorithm algorithm then be summarized as \begin{equation} \left\{ \begin{array}{rcl} \vec{y}^{n}&=&(\mathbf{I}+\frac{\sigma\widehat{\mathbf{H}}^{-1}}{2p-1})^{-1}(\vec{y}^{n-1}+\sigma \overline{\vec{x}}^{n-1} ) \\ {\vec{x}}^{n}&=&\left[\vec{x}^{n-1}-\tau \vec{y}^{n}+\tau q\right]_P. \end{array} \right. \end{equation} To calculate $(\mathbf{I}+\frac{\sigma\widehat{\mathbf{H}}^{-1}}{2q-1})^{-1}$, we use the Woodbury matrix identity, i.e., \begin{equation} (\mathbf{I}+\tfrac{\sigma\widehat{\mathbf{H}}^{-1}}{2q-1})^{-1}=\tfrac{(2q-1)}{\sigma}\widehat{\mathbf{H}}-\tfrac{(2q-1)}{\sigma}\widehat{\mathbf{H}}\left(\mathbf{I}+\tfrac{(2q-1)}{\sigma}\widehat{\mathbf{H}}\right)^{-1}\tfrac{(2q-1)}{\sigma}\widehat{\mathbf{H}} \end{equation} to avoid calculating the inverse matrix of $\widehat{\mathbf{H}}$.

Log-log plot of the error $e^n-\widehat{e}$ to the number of iterations $n$ in the experiment of kernel SVM.

We can see that the convergence rate of FISTA coincides with its theoretical rate after the $100^{th}$ iteration. The primal-dual algorithm and Online primal-dual algorithm converging with a much better rate outperform FISTA and OSGA in the later iterations.

Shape from Light Field Meets Robust PCA

Stefan Heber and Thomas Pock
European Conference on Computer Vision (ECCV), 2014

In this paper we propose a new type of matching term for multi-view stereo reconstruction. Our model is based on the assumption, that if one warps the images of the various views to a common warping center and considers each warped image as one row in a matrix, then this matrix will have low rank. This also implies, that we assume a certain amount of overlap between the views after the warping has been performed. Such an assumption is obviously met in the case of light field data, which motivated us to demonstrate the proposed model for this type of data. Our final model is a large scale convex optimization problem, where the low rank minimization is relaxed via the nuclear norm. We present qualitative and quantitative experiments, where the proposed model achieves excellent results.

@InProceedings{10.1007/978-3-319-10599-4_48,
author="Heber, Stefan and Pock, Thomas",
editor="Fleet, David and Pajdla, Tomas and Schiele, Bernt and Tuytelaars, Tinne",
title="Shape from Light Field Meets Robust PCA",
booktitle="Computer Vision -- ECCV 2014",
year="2014",
publisher="Springer International Publishing",
address="Cham",
pages="751--767",
abstract="In this paper we propose a new type of matching term for multi-view stereo reconstruction. Our model is based on the assumption, that if one warps the images of the various views to a common warping center and considers each warped image as one row in a matrix, then this matrix will have low rank. This also implies, that we assume a certain amount of overlap between the views after the warping has been performed. Such an assumption is obviously met in the case of light field data, which motivated us to demonstrate the proposed model for this type of data. Our final model is a large scale convex optimization problem, where the low rank minimization is relaxed via the nuclear norm. We present qualitative and quantitative experiments, where the proposed model achieves excellent results.",
isbn="978-3-319-10599-4"
}
Qualitative results of the proposed method
Lights

center view

Nature

depth map

Fjords

center view

Fjords

depth map

Qualitative results, for light fields captured with a Lytro camera. The figure shows color coded disparity maps as well as the according center views of the light field.

Variational Shape from Light Field

Stefan Heber, Rene Ranftl and Thomas Pock
International Conference on Energy Minimization Methods in Computer Vision and Pattern Recognition (EMMCVPR), 2013

In this paper we propose an efficient method to calculate a high-quality depth map from a single raw image captured by a light field or plenoptic camera. The proposed model combines the main idea of Active Wavefront Sampling (AWS) with the light field technique, i.e. we extract so-called sub-aperture images out of the raw image of a plenoptic camera, in such a way that the virtual view points are arranged on circles around a fixed center view. By tracking an imaged scene point over a sequence of sub-aperture images corresponding to a common circle, one can observe a virtual rotation of the scene point on the image plane. Our model is able to measure a dense field of these rotations, which are inversely related to the scene depth.


In this paper we make use of the primal-dual algorithm proposed by Chambolle and Pock [1]. Given a continuous linear operator $K:X\rightarrow Y$ with an induced norm \begin{equation*} \norm{K}= \max\bra{\norm{\mat{K}\vec{x}}_2:\vec{x}\in X,\ \norm{\vec{x}}_2\le 1}, \end{equation*} where $X,Y$ are two finite dimensional real vector spaces. Further given $\mathcal{G}$ and $\mathcal{F}^*$, both are proper convex, lower-semicontinuous functions the primal-dual algorithm aims to solve the generic saddle-point problem \begin{equation} \min\limits_{\vec{x}\in X}\max\limits_{\vec{y}\in Y}\bra{\avg{\mat{K}\vec{x},\vec{y}}+\mathcal{G}(\vec{x})-\mathcal{F}^*(\vec{y})}. \end{equation} By convention $\vec{x}$ is called a primal variable and $\vec{y}$ is called a dual variable. The corresponding primal form of last equation is \begin{equation} \underset{\vec{x}\in X}{\text{minimize}}\quad F(\mat{K}\vec{x})+G(\vec{x}). \end{equation} To introduce the dual variables $\vec{y}$, one could use the Lagrange formalism, e.g., applicable when the function represents hard constraints. Or one could calculate the convex conjugate, $ \mathcal{F}(\mat{K}\vec{x})=\max\limits_\vec{y} \avg{\mat{K}\vec{x},\vec{y}}-F^*(\vec{y}).$
The primal-dual algorithm iteratively maximizes with respect to the dual variables and minimizes with respect to the primal variables. The condition for convergence of the primal-dual algorithm is $\tau\sigma\norm{\mat{K}}^2\le1$. We summarize the primal-dual algorithm as follows.

$\bullet$ Initialization: $\tau\sigma{\norm{\mat{K}}^2}\le{1}$,$ \theta\in\sbra{0,1}$, $(\vec{x}^0, \vec{y}^0)\in X\times Y$, $\overline{\vec{x}}^0 = \vec{x}^0,\ \lambda \in \mathbb{R}$

$\bullet$ Iterations $n\ge 1$: Update $\vec{x}^n,\ \vec{y}^n,\ \overline{\vec{x}}^n$ as follows,
\begin{equation} \left\{ \begin{array}{lll} \vec{y}^{n}&=(\mat{I}+ \sigma \nabla \mathcal{F}^*)^{-1}(\vec{y}^{n-1}+\sigma \mat{K} \overline{\vec{x}}^{n-1}) \\ \vec{x}^{n}&=(\mat{I}+\tau \nabla \mathcal{G})^{-1}({\vec{x}}^{n-1}-\tau \vectortrans{\mat{K}}\vec{y}^{n}) \\ \overline{\vec{x}}^{n} &= \vec{x}^{n}+\theta (\vec{x}^{n}-\vec{x}^{n-1}) \end{array} \right. \end{equation} The primal-dual algorithm calculates the proximal operator of $\sigma \mathcal{F}^*$ followed by the proximal operator of $\tau \mathcal{G}$. Then the primal-dual algorithm makes its scheme semi-implicit by letting $\overline{\vec{x}}^{n} = \vec{x}^{n}+\theta (\vec{x}^{n}-\vec{x}^{n-1})$. This operation makes one more step in the direction of $\vec{x}^{n}-\vec{x}^{n-1}$ scaled by $\theta$.


[1] Chambolle, A. and Pock, T. (2011). A rst-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40(1):120-145. (page )

@InProceedings{10.1007/978-3-642-40395-8_6,
author="Heber, Stefan and Ranftl, Rene and Pock, Thomas",
editor="Heyden, Anders and Kahl, Fredrik and Olsson, Carl and Oskarsson, Magnus and Tai, Xue-Cheng",
title="Variational Shape from Light Field",
booktitle="Energy Minimization Methods in Computer Vision and Pattern Recognition",
year="2013",
publisher="Springer Berlin Heidelberg",
address="Berlin, Heidelberg",
pages="66--79",
abstract="In this paper we propose an efficient method to calculate a high-quality depth map from a single raw image captured by a light field or plenoptic camera. The proposed model combines the main idea of Active Wavefront Sampling (AWS) with the light field technique, i.e. we extract so-called sub-aperture images out of the raw image of a plenoptic camera, in such a way that the virtual view points are arranged on circles around a fixed center view. By tracking an imaged scene point over a sequence of sub-aperture images corresponding to a common circle, one can observe a virtual rotation of the scene point on the image plane. Our model is able to measure a dense field of these rotations, which are inversely related to the scene depth.",
isbn="978-3-642-40395-8"
}
Qualitative result of the proposed method
input

3D view

ground truth

center view

proposed

depth map

input

3D view

ground truth

center view

proposed

depth map

The figure shows depth map results in terms of 3D views and color-coded disparity maps as well as all-in-focus results for various scenes.

Approximate Envelope Minimization for Curvature Regularity

Stefan Heber, Rene Ranftl and Thomas Pock
European Conference on Computer Vision (ECCV), Workshops and Demonstrations, 2012

We propose a method for minimizing a non-convex function, which can be split up into a sum of simple functions. The key idea of the method is the approximation of the convex envelopes of the simple functions, which leads to a convex approximation of the original function. A solution is obtained by minimizing this convex approximation. Cost functions, which fulfill such a splitting property are ubiquitous in computer vision, therefore we explain the method based on such a problem, namely the non-convex problem of binary image segmentation based on Euler’s Elastica.

@InProceedings{10.1007/978-3-642-33885-4_29,
author="Heber, Stefan and Ranftl, Rene and Pock, Thomas",
editor="Fusiello, Andrea and Murino, Vittorio and Cucchiara, Rita",
title="Approximate Envelope Minimization for Curvature Regularity",
booktitle="Computer Vision -- ECCV 2012. Workshops and Demonstrations",
year="2012",
publisher="Springer Berlin Heidelberg",
address="Berlin, Heidelberg",
pages="283--292",
abstract="We propose a method for minimizing a non-convex function, which can be split up into a sum of simple functions. The key idea of the method is the approximation of the convex envelopes of the simple functions, which leads to a convex approximation of the original function. A solution is obtained by minimizing this convex approximation. Cost functions, which fulfill such a splitting property are ubiquitous in computer vision, therefore we explain the method based on such a problem, namely the non-convex problem of binary image segmentation based on Euler's Elastica.",
isbn="978-3-642-33885-4"
}
Qualitative segmentation results
Lights

input

Nature

$\alpha=0.5$, $\beta=0$

Fjords

$\alpha=1$, $\beta=0$

Fjords

$\alpha=2$, $\beta=0$

Lights

$\alpha=0.1$, $\beta=0.5$

Nature

$\alpha=0.1$, $\beta=2$

Fjords

$\alpha=0.1$, $\beta=4$

Fjords

$\alpha=0.1$, $\beta=8$

Segmentation results for the Don Quixote image by Pablo Picasso ($256 \times 216$ px) using a cell-complex, that corresponds to a $8$-connected graph in graph cut frameworks. We set $\lambda=1$ and show results for different $\alpha$ and $\beta$ values. The processing time is between $1$ and $5$ minutes, depending on the strength of the regularization.