Our approach models the extraction and classification of death causes as
two-step process. First, we employ a neural, multi-language sequence-to-sequence
model to receive a death cause description for a given death certificate line. We will then
use a second classification model to assign the respective ICD-10 codes to the
obtained death cause. The remainder of this section gives a short introduction
to recurrent neural networks, followed by a detailed explanation of our two models.

\subsection{Recurrent neural networks}
Recurrent neural networks (RNNs) are a widely used technique for sequence
learning problems such as machine translation
\cite{bahdanau_neural_2014,cho_learning_2014}, image captioning
\cite{bengio_scheduled_2015}, named entity recognition
\cite{lample_neural_2016,wei_disease_2016}, dependency parsing
\cite{dyer_transition-based_2015} and POS-tagging \cite{wang_part--speech_2015}.
RNNs model dynamic temporal behaviour in sequential data through recurrent
units, i.e. the hidden, internal state of a unit in one time step depends on the
internal state of the unit in the previous time step. These feedback connections
enable the network to memorize information from recent time steps and capture
long-term dependencies.

However, training of RNNs can be difficult due to the vanishing gradient problem
\cite{hochreiter_gradient_2001,bengio_learning_1994}. The most widespread
modifications of RNNs to overcome this problem are Long Short-Term Memory
networks (LSTM) \cite{hochreiter_long_1997} and Gated Recurrent Units (GRU)
\cite{cho_learning_2014}. Both modifications use gated memories which control
and regulate the information flow between two recurrent units. A common LSTM
unit consists of a cell and three gates, an input gate, an output gate and a
forget gate. In general, LSTMs are chained together by connecting the outputs of
the previous unit to the inputs of the next one.
A further extension of the general RNN architecture are bidirectional networks,
which make the past and future context available in every time step. A
bidirectional LSTM model consists of a forward chain, which processes the input
data from left to right, and and backward chain, consuming the data in the
opposite direction. The final representation is typically the concatenation or a
linear combination of both states.