From 8749ec3c557a5c9404d2bc8d65fc52a0db8391aa Mon Sep 17 00:00:00 2001
From: Jurica Seva <seva@informatik.hu-berlin.de>
Date: Thu, 24 May 2018 17:22:52 +0200
Subject: [PATCH] Started writeup.

---
 paper/40_experiments.tex | 82 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 78 insertions(+), 4 deletions(-)

diff --git a/paper/40_experiments.tex b/paper/40_experiments.tex
index 9c3016a..0a32f23 100644
--- a/paper/40_experiments.tex
+++ b/paper/40_experiments.tex
@@ -2,14 +2,88 @@ In this section we will present experminets and obtained results for the two dev
 As mentioned in Section \ref{sec:methods}, The proposed pipeline combined two NN models.
 
 \subsection{Available datasets}
-The CLEF e-Health 2018 Task 1 participants where provided with annotated death certificates for the three
+The CLEF e-Health 2018 Task 1 participants where provided with annotated death certificates for the three selected languages: french, italian and hungarian.
+Each of the languages is supported by several datasources.
+Provided data sets are imbalanced; the italian corpora consists of 49.823, french corpora of 77.348 and hungarian corpora 323.175 datapoints.
+The data used in this approach was created by combining available datasources and will be explained for each of the models.
+No external data was used.
+Each dataset was split in to a train and evaluation part.
+Although no cross-valiation was performed during training, our models shuffeled the train dataset before each epoch.
+Additionally, no hyperparameter optimization was performed during training, with the default parameters values for individual layers being used.
+We used pretrained fastText\footnote{https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md}[CITATION] word embeddings. The embeddings were trained on Common Crawl and Wikipedia.
+The embeddings were trained with the following parameters: CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives.
+Unfortunatelly, they are trained on corpora not related with the biomedical domain and do not represent the best possible embedding space.
+Final embedding space used by our models is created by concatenating individual embedding vectors.
+All models were implemented with the Keras[CITATION] library.
 
-\subsection{Named Entity Recognition}
+\subsection{Named Entity Recognition with Sequence2Sequence model}
 To identify possible tokens as candidates for death cause, we focused on the use of a sequence to sequence model.
-The
+The generated sequence of tokens in then passed on to the next step for normalization to a ICD-10 code.
+This model consists of two parts: the encoder and the decoder.
+The encoder uses an embedding layer with input masking on zero values and an LSTM with 256 dimensions.
+The encoders output is used as the initial state of the decoder.
+the decoder employs the same arhitecture, followed by a dense layer and a softmax activation function.
+The model, based on the input sentence and a start token, generates tokens out of the vocabulary until it generated the end token.
+The entire model is optimized using the Adam optimizer, with a batch size of 700.
+The model trains either for 100 eopchs or if an eraly stoppping criteria is met (no change in validation loss for two epochs).
 
-\subsection{Named Entity Normalization}
+As the available dataset is highly imbalanced, we devised two approaches: (1) balanced, where each language was supproted by 49.823 randomly drawn datapoints (lenght of the smallest corpus) and (2) extended, where all available data is used.
+The results, obtained on the validation set, are shown in Table \ref{tab:s2s}.
+
+\begin{table}[]
+\centering
+\begin{tabular}{l|l|l|l|l|l}
+Model & Trained for epochs & Train Accuracy & Train Loss & Validation Accuracy & Validation Loss \\
+Balanced &  18 & 0.958 & 0.205 & 0.899 & 0.634 \\
+Extended &  9 &0.709 & 0.098 & 0.678 & 0.330  \\
+\end{tabular}
+\caption{Named Entity Recgonition: S2S model evaluation}
+\label{tab:s2s}
+\end{table}
+
+\subsection{Named Entity Normalization with ICD-10 Classification}
+As input the model described here expects a string, which we generate in the previous step.
+The model itself uses an embedding layer with input masking on zero values, followed by and bidirectional LSTM layer with 256 dimension hidden layer.
+It is followed by an attention layer and a dense layer with a softmax activation function.
+Adam was used as the optimizer.
+The model was validated on 25\% od the data.
+Again, no cross-validation or hyperparamter was performed.
+Once again, we devised two approahces.
+This was manly influenced by the lack of adequate training data in terms of coverage for individual ICD-10 codes.
+Therefore, we once again defined two datasets: (1) minimal, where only ICD-10 codes with 2 or more supporting data points are used.
+This, ofcourse, minimizes the number of ICD-10 codes in the label space.
+Therefore, (2) extended dataset was defined.
+Here, the original ICD-10 codes mappings, found in the supplied dictionaries, are extended with the data from individual langugae Causes Calcules.
+Finally, for the remaining ICD-10 codes with support of 1 we duplicate those datapoints.
+The goal of this approach is to extend our possible label space to all available ICD-1o labels.
+The results obtained from the two approaches are shown in Table \ref{tab:icd10Classification}.
+
+\begin{table}[]
+\centering
+\begin{tabular}{l|l|l|l|l|l}
+Mode & Model & Trained for epochs & Train Accuracy & Train Loss & Validation Accuracy & Validation Loss \\
+Word & Minimal &  69 & 0.925 & 0.190 & 0.937 & 0.169 \\
+Word & Extended &  41 & 0.950 & 0.156 & 0.954 & 0.141 \\
+Character & Minimal &   &  &  &  &  \\
+\end{tabular}
+\caption{Named Entity Normalization: ICD-10 Classification }
+\label{tab:icd10Classification}
+\end{table}
 
 \subsection{Final Pipeline}
+The two modeles where combined to create the final pipeline.
+We tested both NER models in the final pipeline, as their performance differs significantly.
+As both NEN models performe similary, we used the word and extended ICD-10 classification model in the final pipeline.
+The results obtained during training are presented in Table \ref{tab:final_train}.
 
+\begin{table}[]
+\centering
+\begin{tabular}{l|l|l|l|l|l}
+Model & Trained for epochs & Train Accuracy & Train Loss & Validation Accuracy & Validation Loss \\
+S2S_balanced + ICD-10_extended &  & & & & \\
+S2S_extended + ICD-10_extended &  & & & & \\
+\end{tabular}
+\caption{Final Pipeline Evaluation}
+\label{tab:final_train}
+\end{table}
 
-- 
GitLab