From 8749ec3c557a5c9404d2bc8d65fc52a0db8391aa Mon Sep 17 00:00:00 2001 From: Jurica Seva <seva@informatik.hu-berlin.de> Date: Thu, 24 May 2018 17:22:52 +0200 Subject: [PATCH] Started writeup. --- paper/40_experiments.tex | 82 ++++++++++++++++++++++++++++++++++++++-- 1 file changed, 78 insertions(+), 4 deletions(-) diff --git a/paper/40_experiments.tex b/paper/40_experiments.tex index 9c3016a..0a32f23 100644 --- a/paper/40_experiments.tex +++ b/paper/40_experiments.tex @@ -2,14 +2,88 @@ In this section we will present experminets and obtained results for the two dev As mentioned in Section \ref{sec:methods}, The proposed pipeline combined two NN models. \subsection{Available datasets} -The CLEF e-Health 2018 Task 1 participants where provided with annotated death certificates for the three +The CLEF e-Health 2018 Task 1 participants where provided with annotated death certificates for the three selected languages: french, italian and hungarian. +Each of the languages is supported by several datasources. +Provided data sets are imbalanced; the italian corpora consists of 49.823, french corpora of 77.348 and hungarian corpora 323.175 datapoints. +The data used in this approach was created by combining available datasources and will be explained for each of the models. +No external data was used. +Each dataset was split in to a train and evaluation part. +Although no cross-valiation was performed during training, our models shuffeled the train dataset before each epoch. +Additionally, no hyperparameter optimization was performed during training, with the default parameters values for individual layers being used. +We used pretrained fastText\footnote{https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md}[CITATION] word embeddings. The embeddings were trained on Common Crawl and Wikipedia. +The embeddings were trained with the following parameters: CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives. +Unfortunatelly, they are trained on corpora not related with the biomedical domain and do not represent the best possible embedding space. +Final embedding space used by our models is created by concatenating individual embedding vectors. +All models were implemented with the Keras[CITATION] library. -\subsection{Named Entity Recognition} +\subsection{Named Entity Recognition with Sequence2Sequence model} To identify possible tokens as candidates for death cause, we focused on the use of a sequence to sequence model. -The +The generated sequence of tokens in then passed on to the next step for normalization to a ICD-10 code. +This model consists of two parts: the encoder and the decoder. +The encoder uses an embedding layer with input masking on zero values and an LSTM with 256 dimensions. +The encoders output is used as the initial state of the decoder. +the decoder employs the same arhitecture, followed by a dense layer and a softmax activation function. +The model, based on the input sentence and a start token, generates tokens out of the vocabulary until it generated the end token. +The entire model is optimized using the Adam optimizer, with a batch size of 700. +The model trains either for 100 eopchs or if an eraly stoppping criteria is met (no change in validation loss for two epochs). -\subsection{Named Entity Normalization} +As the available dataset is highly imbalanced, we devised two approaches: (1) balanced, where each language was supproted by 49.823 randomly drawn datapoints (lenght of the smallest corpus) and (2) extended, where all available data is used. +The results, obtained on the validation set, are shown in Table \ref{tab:s2s}. + +\begin{table}[] +\centering +\begin{tabular}{l|l|l|l|l|l} +Model & Trained for epochs & Train Accuracy & Train Loss & Validation Accuracy & Validation Loss \\ +Balanced & 18 & 0.958 & 0.205 & 0.899 & 0.634 \\ +Extended & 9 &0.709 & 0.098 & 0.678 & 0.330 \\ +\end{tabular} +\caption{Named Entity Recgonition: S2S model evaluation} +\label{tab:s2s} +\end{table} + +\subsection{Named Entity Normalization with ICD-10 Classification} +As input the model described here expects a string, which we generate in the previous step. +The model itself uses an embedding layer with input masking on zero values, followed by and bidirectional LSTM layer with 256 dimension hidden layer. +It is followed by an attention layer and a dense layer with a softmax activation function. +Adam was used as the optimizer. +The model was validated on 25\% od the data. +Again, no cross-validation or hyperparamter was performed. +Once again, we devised two approahces. +This was manly influenced by the lack of adequate training data in terms of coverage for individual ICD-10 codes. +Therefore, we once again defined two datasets: (1) minimal, where only ICD-10 codes with 2 or more supporting data points are used. +This, ofcourse, minimizes the number of ICD-10 codes in the label space. +Therefore, (2) extended dataset was defined. +Here, the original ICD-10 codes mappings, found in the supplied dictionaries, are extended with the data from individual langugae Causes Calcules. +Finally, for the remaining ICD-10 codes with support of 1 we duplicate those datapoints. +The goal of this approach is to extend our possible label space to all available ICD-1o labels. +The results obtained from the two approaches are shown in Table \ref{tab:icd10Classification}. + +\begin{table}[] +\centering +\begin{tabular}{l|l|l|l|l|l} +Mode & Model & Trained for epochs & Train Accuracy & Train Loss & Validation Accuracy & Validation Loss \\ +Word & Minimal & 69 & 0.925 & 0.190 & 0.937 & 0.169 \\ +Word & Extended & 41 & 0.950 & 0.156 & 0.954 & 0.141 \\ +Character & Minimal & & & & & \\ +\end{tabular} +\caption{Named Entity Normalization: ICD-10 Classification } +\label{tab:icd10Classification} +\end{table} \subsection{Final Pipeline} +The two modeles where combined to create the final pipeline. +We tested both NER models in the final pipeline, as their performance differs significantly. +As both NEN models performe similary, we used the word and extended ICD-10 classification model in the final pipeline. +The results obtained during training are presented in Table \ref{tab:final_train}. +\begin{table}[] +\centering +\begin{tabular}{l|l|l|l|l|l} +Model & Trained for epochs & Train Accuracy & Train Loss & Validation Accuracy & Validation Loss \\ +S2S_balanced + ICD-10_extended & & & & & \\ +S2S_extended + ICD-10_extended & & & & & \\ +\end{tabular} +\caption{Final Pipeline Evaluation} +\label{tab:final_train} +\end{table} -- GitLab