From 8872096b23acfd23cc878d8495bcea908ab952e7 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mario=20Sa=CC=88nger?= <mario.saenger@student.hu-berlin.de>
Date: Thu, 31 May 2018 14:51:32 +0200
Subject: [PATCH] Minor changes in experiments 4.4

---
 paper/31_methods_seq2seq.tex |  2 +-
 paper/40_experiments.tex     | 61 +++++++++++++++++++++---------------
 2 files changed, 37 insertions(+), 26 deletions(-)

diff --git a/paper/31_methods_seq2seq.tex b/paper/31_methods_seq2seq.tex
index 282d185..f9c986a 100644
--- a/paper/31_methods_seq2seq.tex
+++ b/paper/31_methods_seq2seq.tex
@@ -19,7 +19,7 @@ Encoders' final state represents the semantic representation of the certificate
 \caption{Illustration of the encoder-decoder model for death cause extraction. The encoder processes a death certificate line token-wise from left to right. The final state of the encoder forms a semantic representation of the line and serves as initial input for the decoding process. The decoder will be trained to predict the death cause text from the provided ICD-10 dictionaries word by word (using special tags \textbackslash s and \textbackslash e for start resp. end of a sequence). All input tokens will be represented using the concatenation of the fastText embeddings %\cite{bojanowski_enriching_2016} 
 of all three languages.}
 \label{fig:encoder_decoder}
-\end{figure} 
+\end{figure}  
 
 For the decoder we utilize another LSTM model. The initial input of the decoder is the final state of the encoder model. 
 Moreover, each token of the dictionary death cause text (padded with special start and end tag) serves as (sequential) input. 
diff --git a/paper/40_experiments.tex b/paper/40_experiments.tex
index 35a5dd1..29072e3 100644
--- a/paper/40_experiments.tex
+++ b/paper/40_experiments.tex
@@ -101,6 +101,11 @@ In contrast, \textit{Extended} additionally takes the diagnosis texts from the c
 
 \subsection{Complete Pipeline}
 \label{tab:final_train}
+The two models where combined to create the final pipeline. 
+We tested both death cause extraction models (based on the balanced and unbalanced data set) in the final pipeline, as their performance differs greatly. 
+On the contrary, both ICD-10 classification models perform similarly, so we just used the extended ICD-10 classification model, with word level tokens\footnote{Although models supporting character level tokens were developed and evaluated, their performance faired poorly compared to the word level tokens.}, in the final pipeline. 
+To evaluate the pipeline we build a training and a hold-out validation set during development. 
+The obtained results on the validation set are presented in Table \ref{tab:final_train}. 
 
 \begin{table}[t!]
 \centering
@@ -113,18 +118,30 @@ Final-Balanced & 0.73 & 0.61 & 0.61 \\
 Final-Full & 0.74 & 0.62 & 0.63 \\
 \bottomrule
 \end{tabular}
-\caption{Final Pipeline Performance - Training Data. Final-Balanced = DCEM-Balanced + ICD-10\_Extended. Final-Full = DCEM-Full + ICD-10\_Extended}
+\caption{Evaluation results of the final pipeline on the validation set of the training data. Final-Balanced = DCEM-Balanced + ICD-10\_Extended. Final-Full = DCEM-Full + ICD-10\_Extended}
+\end{table}
 
+Although the individual models, as shown in Tables \ref{tab:s2s} and \ref{tab:icd10Classification} are promising, the performance decreases considerably in a pipeline setting . %, by roughly a third.
+The pipeline model based on the balanced data set reaches a F-score of 0.61, whereas the full model achieves a slightly higher value of 0.63. 
+Both model configurations have a higher precision than recall (0.73/0.61 resp. 0.74/0.62). 
 
-\end{table}
-The two models where combined to create the final pipeline. 
-We tested both neural models in the final pipeline, as their performance differs greatly. 
-As both ICD-10 classification models perform similarly, we used the extended ICD-10 classification model, with word level tokens\footnote{Although models supporting character level tokens were developed and evaluated, their performance faired poorly compared to the word level tokens.}, in the final pipeline. 
-The results obtained during training are presented in Table \ref{tab:final_train}. 
+This can be contributed to several factors.
+First of all, a pipeline architecture always suffers from error-propagation, i.e. errors in a previous step will influence the performance of the following layers and generally lower the performance of the overall system. 
+Investigating the errors of the models, we found that the imbalanced distribution of ICD-10 codes represents a main problem. 
+This severely impacts the decoder-encoder architecture used here as the token generation is biased towards the available data points.
+Therefore the models misclassify certificate lines associated with ICD-10 codes that only have a small number of supporting training instances very often. 
 
-Although the individual models, as shown in Tables \ref{tab:s2s} and \ref{tab:icd10Classification} are promising, the final pipeline decreases their performance on hold-out dataset created during training. %, by roughly a third. 
-This can be contributed to several factors with the very imbalanced distribution of supporting ICD-10 codes, provided by the Organizers, the most  influential reason. 
-This severely impacts the decoder-encoder architecture used here as the token generation is biased towards the available data points. 
+Results obtained on the test data set, resulting from the two submitted official runs, are shown in Table \ref{tab:final_test}. 
+Similiar to the evaluation results during development, the model based on the full data set performs slighly better than the model trained on the balanced data set.
+The full model reaches a F-score of 0.34 for French, 0.45 for Hungarian and 0.77 for Italian. 
+All of our approaches perform below the mean and median averages of all participants. 
+
+Surprisingly, there is a substantial difference in results obtained between the individual languages. 
+%This hints towards the unsuitability of out-of-domain WEs. 
+This confirms our assumptions about the (un)suitability of the proposed multi-lingual embedding space for this task. 
+The results also suggeest that the size of the training corpora is not influencing the final results. 
+As seen, best results were obtained on the Italian dataset were trained on the smallest corpora. 
+Worst results were obtained on the middle, French, corpus while the biggest corpus, Hungarian, is in second place. 
 
 \begin{table}[]
 \centering
@@ -159,27 +176,21 @@ This severely impacts the decoder-encoder architecture used here as the token ge
 & Median        & 0,900 & 0.824 & 0.863 \\
 \bottomrule
 \end{tabularx}
-\caption{Final Pipeline Perfromance - Evaluation Data. Final-Balanced = DCEM-Balanced + ICD-10\_Extended. Final-Full = DCEM-Full + ICD-10\_Extended}
+\caption{Test results of the final pipeline. Final-Balanced = DCEM-Balanced + ICD-10\_Extended. Final-Full = DCEM-Full + ICD-10\_Extended}
 \label{tab:final_test}
 \end{table}
 
-Results obtained on the evaluation dataset, resulting from the two submitted official runs, are shown in Table \ref{tab:final_test}. 
-All of our approaches perform below the mean and median averages of all participants. 
-Surprisingly, there is a substantial difference in results obtained between the individual languages. 
-%This hints towards the unsuitability of out-of-domain WEs. 
-This confirms our assumptions about the (un)suitability of the proposed multi-lingual embedding space for this task. 
-The results also point that the size of the training corpora is not influencing the final results. 
-As seen, best results were obtained on the Italian dataset were trained on the smallest corpora. 
-Worst results were obtained on the middle, French, corpus while the biggest corpus, Hungarian, is in second place. 
-
-We identified several possible reasons for the obtained results. 
+We identified several possible reasons for the obtained results.  
 These also represent (possible) points for future work. 
-As the main disadvantage of our approach the quality of the used WEs as well as the properties of the proposed language-independent embedding space are identified. 
-The use of out-of-domain WEs, as expected, proved to be suboptimal solution to this problem. 
-Although we tried to alleviate this by finding suitable external corpora to train domain-dependent WEs for each of the supported languages, we were unable to find any significant amount of in-domain documents (e.g. PubMed search for abstracts in either French, Hungarian or Italian found 7.843, 786 and 1659 articles respectively). 
-Combined with concatenating the three WEs representation of individual tokens in to an language-independent embedding space, we see work on language-independent WEs as the main focus point for future work. 
+One of the main disadvantages of our approach is the quality of the used WEs as well as the properties of the proposed language-independent embedding space.
+The usage of out-of-domain WEs which aren't targeted to the biomedical domain are likely a suboptimal solution to this problem.
+Although we tried to alleviate this by finding suitable external corpora to train domain-dependent WEs for each of the supported languages, however we were unable to find any significant amount of in-domain documents (e.g. PubMed search for abstracts in either French, Hungarian or Italian found 7.843, 786 and 1659 articles respectively). 
+Furthermore, we used a simple, heuristic solution by just concatenating the embeddings of all three languages to build a shared vector space.
+To fix this problem will be the main focus of future investigations on this problem.
+
+%Combined with concatenating the three WEs representation of individual tokens in to an language-independent embedding space, we see work on language-independent WEs as the main focus point for future work. 
 %This point will be the main focus of future work on this problem. 
-Besides the issues with the used WEs, inability to obtain full ICD-10 dictionaries for the selected languages has also negatively influenced the results. 
+Besides the issues with the used WEs, the inability to obtain full ICD-10 dictionaries for the selected languages has also negatively influenced the results. 
 As a final limitation to our approach, lack of multi-label classification support has also been identified (i.e. not recognizing more than one death cause in a single input text). 
 
 %Probleme bei uns:
-- 
GitLab