Our approach models the extraction and classification of death causes as two-step process. First, we employ a neural, multi-language sequence-to-sequence model to receive a death cause description for a given death certificate line. We will then use a second classification model to assign the respective ICD-10 codes to the obtained death cause. The remainder of this section gives a short introduction to recurrent neural networks, followed by a detailed explanation of our two models. \subsection{Recurrent neural networks} Recurrent neural networks (RNNs) are a widely used technique for sequence learning problems such as machine translation \cite{bahdanau_neural_2014,cho_learning_2014}, image captioning \cite{bengio_scheduled_2015}, named entity recognition \cite{lample_neural_2016,wei_disease_2016}, dependency parsing \cite{dyer_transition-based_2015} and POS-tagging \cite{wang_part--speech_2015}. RNNs model dynamic temporal behaviour in sequential data through recurrent units, i.e. the hidden, internal state of a unit in one time step depends on the internal state of the unit in the previous time step. These feedback connections enable the network to memorize information from recent time steps and capture long-term dependencies. However, training of RNNs can be difficult due to the vanishing gradient problem \cite{hochreiter_gradient_2001,bengio_learning_1994}. The most widespread modifications of RNNs to overcome this problem are Long Short-Term Memory networks (LSTM) \cite{hochreiter_long_1997} and Gated Recurrent Units (GRU) \cite{cho_learning_2014}. Both modifications use gated memories which control and regulate the information flow between two recurrent units. A common LSTM unit consists of a cell and three gates, an input gate, an output gate and a forget gate. In general, LSTMs are chained together by connecting the outputs of the previous unit to the inputs of the next one. A further extension of the general RNN architecture are bidirectional networks, which make the past and future context available in every time step. A bidirectional LSTM model consists of a forward chain, which processes the input data from left to right, and and backward chain, consuming the data in the opposite direction. The final representation is typically the concatenation or a linear combination of both states.