In this section we will present experiments and obtained results for the two developed models, both individually as well as combined in a pipeline setting.

\subsection{Training Data and Experiment Setup}
The CLEF e-Health 2018 Task 1 participants where provided with annotated death certificates for the three selected languages: French, Italian and Hungarian.
Each of the languages is supported by training certificate lines as well as a dictionary with death cause descriptions resp. diagnosis for the different ICD-10 codes. 
The provided training data sets were imbalanced concerning the different languages: the Italian corpora consists of 49,823, French corpora of 77,348\footnote{For French we only took the provided data set from 2014.} and Hungarian corpora 323,175 certificate lines. 
We split each data set into a training and a hold-out evaluation set. 
The complete training data set was then created by combining the certificate lines of all three languages into one data set. 
Beside the provided certificate data we used, no additional knowledge resources or annotated texts were used.

Due to time constraints during development no cross-validation to optimize the (hyper-) parameters and the individual layers of our models was performed. 
We either keep the default values of the hyper-parameters or set them to reasonable values according to existing work. 
During model training we shuffle the training instances and use varying validation instances to perform a validation of the epoch.

%As representation for the input tokens of the model we use 
Pre-trained fastText word embeddings % \cite{bojanowski_enriching\_2016}. The embeddings were trained on Common Crawl and Wikipedia articles. Embeddings' 
were trained using the following parameter settings: CBOW with position-weights, embedding dimension size 300, with character n-grams of length 5, a window of size 5 and 10 negatives. 
Unfortunately, they are trained on corpora not related with the biomedical domain and therefore do not represent the best possible textual basis for an embedding space for biomedical information extraction. 
Final embedding space used by our models is created by concatenating individual embedding vectors for all three languages. 
Thus the input of our model is embedding vector of size 900. 
All models were implemented with the Keras\footnote{\url{https://keras.io/}} library.% in Version X.X.

\subsection{Death cause extraction model} 
To identify possible candidates for a death cause description, we focus on the use of an encoder-decoder model. 
The encoder model uses an embedding layer with input masking on zero values and a LSTM layer with 256 units. 
The encoders' output is used as the initial state of the decoder model.

Based on the input description from the dictionary and a special start token, the decoder generates a death cause word by word. 
This decoding process continues until a special end token is generated. 
The entire model is optimized using the Adam optimization algorithm \cite{kingma_adam:_2014} and a batch size of 700. 
Model training was performed either for 100 epochs or until an early stopping criteria is met (no change in validation loss for two epochs).

As the provided dataset are imbalanced regarding the tasks' languages, we devised two different evaluation settings: (1) DCEM-Balanced, where each language was supported by 49.823 randomly drawn instances (size of the smallest corpus) and (2) DCEM-Full, where all available data is used. 
Table \ref{tab:s2s} shows the results obtained on the training and validation set.
The figures reveal that distribution of training instances per language have a huge influence on the performance of the model. 
The model trained on the full training data achieves an accuracy of 0.678 on the validation set. 
In contrast using the balanced data set the model reaches an accuracy of 0.899 (+ 32.5\%).

\begin{table}[]
\label{tab:s2s}
\centering
\begin{tabularx}{0.9\textwidth}{p{3cm}|c|c|c|c|c}
\toprule
\multirow{2}{*}{\textbf{Setting}} & \multirow{2}{*}{\textbf{Trained Epochs}}&\multicolumn{2}{c|}{\textbf{Train}}&\multicolumn{2}{c}{\textbf{Validation}} \\ 
\cline{3-6}
&&\textbf{Accuracy}&\textbf{Loss}&\textbf{Accuracy}&\textbf{Loss} \\
\hline
DCEM-Balanced &  18 & 0.958 & 0.205 & 0.899 & 0.634 \\
\hline
DCEM-Full &  9 &0.709 & 0.098 & 0.678 & 0.330  \\
\bottomrule
\end{tabularx}
\caption{Experiment results of our death cause extraction sequence-to-sequence
model concerning balanced (equal number of training instances per language) and full
data set setting.}
\end{table}

\subsection{ICD-10 Classification Model}
The classification model is responsible for assigning a ICD-10 code to death cause description obtained during the first step. 
Our model uses an embedding layer with input masking on zero values, followed by and bidirectional LSTM layer with 256 dimension hidden layer. 
Thereafter an attention layer builds an adaptive weighted average over all LSTM states. 
The respective ICD-10 code will be determined by a dense layer with softmax activation function. 
We use the Adam optimizer to perform model training. 
The model was validated on 25\% of the data. 
As for the extraction model, no cross-validation or hyper-parameter optimization was performed.% due to time constraints during development.

Once again, we devised two approaches. This was mainly caused by the lack of adequate training data in terms of coverage for individual ICD-10 codes.
Therefore, we once again defined two training data settings: (1) minimal (ICD-10\_Minimal), where only ICD-10 codes with two or more supporting training instances are used. 
This leaves us with 6.857 unique ICD-10 codes and discards 2.238 unique ICD-10 codes with support of one. 
This, of course, minimizes the number of ICD-10 codes in the label space. 
Therefore, (2) an extended (ICD-10\_Extended) dataset was defined. Here, the original ICD-10 code mappings, found in the supplied dictionaries, are extended with the training instances from individual certificate data from the three languages. 
This generates 9.591 unique ICD-10 codes. 
Finally, for the remaining ICD-10 codes that have only one supporting description, we duplicate those data points. 

The goal of this approach is to extend our possible label space to all available ICD-10 codes. 
The results obtained from the two approaches on the validation set are shown in Table \ref{tab:icd10Classification}. 
Using the \textit{minimal} data set the model achieves an accuracy of 0.937. 
In contrast, using the extended data set the model reaches an accuracy of 0.954 which represents an improvement of 1.8\%.

\begin{table}[]
\label{tab:icd10Classification}
\centering
\begin{tabularx}{0.9\textwidth}{p{2.25cm}|c|c|c|c|c} 
\toprule
%\multirow{2}{*}{\textbf{Tokenization}}&\multirow{2}{*}{\textbf{Model}}&\multirow{2}{*}{\textbf{Trained Epochs}}&\multicolumn{2}{c|}{\textbf{Train}}&\multicolumn{2}{c}{\textbf{Validation}} \\
%\cline{4-7} 
\multirow{2}{*}{\textbf{Setting}}&\multirow{2}{*}{\textbf{Trained Epochs}}&\multicolumn{2}{c|}{\textbf{Train}}&\multicolumn{2}{c}{\textbf{Validation}} \\
\cline{3-6}
&&\textbf{Accuracy}&\textbf{Loss}&\textbf{Accuracy}&\textbf{Loss} \\
\hline
ICD-10\_Minimal &  69 & 0.925 & 0.190 & 0.937 & 0.169 \\
\hline
ICD-10\_Extended\textbf{*} &  41 & 0.950 & 0.156 & 0.954 & 0.141 \\
%Character & Minimal &   91 & 0.732 & 1.186 & 0.516 & 2.505 \\
\bottomrule
\end{tabularx}
\caption{Experiment results for our ICD-10 classification model regarding different data settings. 
The \textit{Minimal} setting uses only ICD-10 codes with two or more training instances in the supplied dictionary. 
In contrast, \textit{Extended} additionally takes the diagnosis texts from the certificate data and duplicates ICD-10 training instances with only one diagnosis text in the dictionary and certificate lines. \textbf{*} Used in final pipeline.}
\end{table}

\subsection{Complete Pipeline}
\label{tab:final_train}
The two models where combined to create the final pipeline. 
We tested both death cause extraction models (based on the balanced and unbalanced data set) in the final pipeline, as their performance differs greatly. 
On the contrary, both ICD-10 classification models perform similarly, so we just used the extended ICD-10 classification model, with word level tokens\footnote{Although models supporting character level tokens were developed and evaluated, their performance faired poorly compared to the word level tokens.}, in the final pipeline. 
To evaluate the pipeline we build a training and a hold-out validation set during development. 
The obtained results on the validation set are presented in Table \ref{tab:final_train}. 
The scores are calculated using a prevalence-weighted macro-average across the output classes, i.e. we calculated precision, recall and F-score for each ICD-10 code and build the average by weighting the scores by the number occurrences of the code in the gold standard into account. 

\begin{table}[t!]
\centering
\begin{tabular}{l|c|c|c}
\toprule
\textbf{Model} &  \textbf{Precision} & \textbf{Recall} & \textbf{F-score} \\
\hline
Final-Balanced & 0.73 & 0.61 & 0.61 \\
\hline
Final-Full & 0.74 & 0.62 & 0.63 \\
\bottomrule
\end{tabular}
\caption{Evaluation results of the final pipeline on the validation set of the training data. Reported figures represent
the prevalence-weighted macro-average across the output classes. Final-Balanced = DCEM-Balanced + ICD-10\_Extended. 
Final-Full = DCEM-Full + ICD-10\_Extended}
\end{table}

Although the individual models, as shown in Tables \ref{tab:s2s} and \ref{tab:icd10Classification} are promising, the performance decreases considerably in a pipeline setting . %, by roughly a third.
The pipeline model based on the balanced data set reaches a F-score of 0.61, whereas the full model achieves a slightly higher value of 0.63. 
Both model configurations have a higher precision than recall (0.73/0.61 resp. 0.74/0.62). 

This can be contributed to several factors.
First of all, a pipeline architecture always suffers from error-propagation, i.e. errors in a previous step will influence the performance of the following layers and generally lower the performance of the overall system. 
Investigating the obtained results, we found that the imbalanced distribution of ICD-10 codes represents one the main problem. 
This severely impacts the decoder-encoder architecture used here as the token generation is biased towards the available data points.
Therefore the models misclassify certificate lines associated with ICD-10 codes that only have a small number of supporting training instances very often. 

Results obtained on the test data set, resulting from the two submitted official runs, are shown in Table \ref{tab:final_test}. 
Similiar to the evaluation results during development, the model based on the full data set performs slighly better than the model trained on the balanced data set.
The full model reaches a F-score of 0.34 for French, 0.45 for Hungarian and 0.77 for Italian. 
All of our approaches perform below the mean and median averages of all participants. 

Surprisingly, there is a substantial difference in results obtained between the individual languages. 
%This hints towards the unsuitability of out-of-domain word embeddings. 
This confirms our assumptions about the (un)suitability of the proposed multi-lingual embedding space for this task. 
The results also suggeest that the size of the training corpora is not influencing the final results. 
As seen, best results were obtained on the Italian dataset were trained on the smallest corpora. 
Worst results were obtained on the middle, French, corpus while the biggest corpus, Hungarian, is in second place. 

\begin{table}[]
\centering
\begin{tabularx}{0.8\textwidth}{p{2cm}|p{3cm}|c|c|c}
\toprule
\textbf{Language} & \textbf{Model} & \textbf{Precision} & \textbf{Recall} & \textbf{F-score}\\
\hline
\multirow{2}{*}{French}
& Final-Balanced & 0.494 & 0.246 & 0.329 \\
& Final-Full     & 0.512 & 0.253 & 0.339 \\
\cline{2-5}
& Baseline      & 0.341 & 0.200 & 0.253 \\
& Average       & 0.723 & 0.410 & 0.507 \\
& Median        & 0.798 & 0.475 & 0.579 \\
\hline

\multirow{2}{*}{Hungarian}
& Final-Balanced & 0.518 & 0.384 & 0.441 \\
& Final-Full     & 0.522 & 0.388 & 0.445 \\
\cline{2-5}
& Baseline      & 0.243 & 0.174 & 0.202 \\
& Average       & 0.827 & 0.783 & 0.803 \\
& Median        & 0.922 & 0.897 & 0.910 \\
\hline

\multirow{3}{*}{Italian} 
& Final-Balanced & 0.857 & 0.685 & 0.761 \\
& Final-Full     & 0.862 & 0.689 & 0.766 \\
\cline{2-5}
& Baseline      & 0,165 & 0.172 & 0.169 \\
& Average       & 0.844 & 0.760 & 0.799 \\
& Median        & 0,900 & 0.824 & 0.863 \\
\bottomrule
\end{tabularx}
\caption{Test results of the final pipeline. Final-Balanced = DCEM-Balanced + ICD-10\_Extended. Final-Full = DCEM-Full + ICD-10\_Extended}
\label{tab:final_test}
\end{table}

We identified several possible reasons for the obtained results.  
These also represent (possible) points for future work. 
One of the main disadvantages of our approach is the quality of the used word embeddings as well as the properties of the proposed language-independent embedding space.
The usage of out-of-domain word embeddings which aren't targeted to the biomedical domain are likely a suboptimal solution to this problem.
We tried to alleviate this by finding suitable external corpora to train domain-dependent word embeddings for each of the supported languages, however we were unable to find any significant amount of in-domain documents (e.g. PubMed search for abstracts in either French, Hungarian or Italian found 7.843, 786 and 1659 articles respectively). 
Furthermore, we used a simple, heuristic solution by just concatenating the embeddings of all three languages to build a shared vector space.
%This will be the main focus of future investigations on this problem.

%Combined with concatenating the three word embeddings representation of individual tokens in to an language-independent embedding space, we see work on language-independent word embeddings as the main focus point for future work. 
%This point will be the main focus of future work on this problem. 
Besides the issues with the used word embeddings, the inability to obtain full ICD-10 dictionaries for the selected languages has also negatively influenced the results. 
As a final limitation to our approach, lack of multi-label classification support has also been identified (i.e. not recognizing more than one death cause in a single input text). 

%Probleme bei uns:
%1 out of domain WE
%2 language-independent embedding space is very elementary; this will be the emphasis for our future work
%3 no additional training data (either for word embeddings or as datapoints) found and incorporated
%%4 kein attentoon beim s2s. Wir koennen noch capsnet probieren 
%4 somewhat limited set of supproted ICD-1o codes (e.g. we didnt have the full dictionaries)
%5 no multilabel classification supported