@@ -2,14 +2,88 @@ In this section we will present experminets and obtained results for the two dev
As mentioned in Section \ref{sec:methods}, The proposed pipeline combined two NN models.
\subsection{Available datasets}
The CLEF e-Health 2018 Task 1 participants where provided with annotated death certificates for the three
The CLEF e-Health 2018 Task 1 participants where provided with annotated death certificates for the three selected languages: french, italian and hungarian.
Each of the languages is supported by several datasources.
Provided data sets are imbalanced; the italian corpora consists of 49.823, french corpora of 77.348 and hungarian corpora 323.175 datapoints.
The data used in this approach was created by combining available datasources and will be explained for each of the models.
No external data was used.
Each dataset was split in to a train and evaluation part.
Although no cross-valiation was performed during training, our models shuffeled the train dataset before each epoch.
Additionally, no hyperparameter optimization was performed during training, with the default parameters values for individual layers being used.
We used pretrained fastText\footnote{https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md}[CITATION] word embeddings. The embeddings were trained on Common Crawl and Wikipedia.
The embeddings were trained with the following parameters: CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives.
Unfortunatelly, they are trained on corpora not related with the biomedical domain and do not represent the best possible embedding space.
Final embedding space used by our models is created by concatenating individual embedding vectors.
All models were implemented with the Keras[CITATION] library.
\subsection{Named Entity Recognition}
\subsection{Named Entity Recognition with Sequence2Sequence model}
To identify possible tokens as candidates for death cause, we focused on the use of a sequence to sequence model.
The
The generated sequence of tokens in then passed on to the next step for normalization to a ICD-10 code.
This model consists of two parts: the encoder and the decoder.
The encoder uses an embedding layer with input masking on zero values and an LSTM with 256 dimensions.
The encoders output is used as the initial state of the decoder.
the decoder employs the same arhitecture, followed by a dense layer and a softmax activation function.
The model, based on the input sentence and a start token, generates tokens out of the vocabulary until it generated the end token.
The entire model is optimized using the Adam optimizer, with a batch size of 700.
The model trains either for 100 eopchs or if an eraly stoppping criteria is met (no change in validation loss for two epochs).
\subsection{Named Entity Normalization}
As the available dataset is highly imbalanced, we devised two approaches: (1) balanced, where each language was supproted by 49.823 randomly drawn datapoints (lenght of the smallest corpus) and (2) extended, where all available data is used.
The results, obtained on the validation set, are shown in Table \ref{tab:s2s}.
\begin{table}[]
\centering
\begin{tabular}{l|l|l|l|l|l}
Model & Trained for epochs & Train Accuracy & Train Loss & Validation Accuracy & Validation Loss \\
Balanced & 18 & 0.958 & 0.205 & 0.899 & 0.634 \\
Extended & 9 &0.709 & 0.098 & 0.678 & 0.330 \\
\end{tabular}
\caption{Named Entity Recgonition: S2S model evaluation}
\label{tab:s2s}
\end{table}
\subsection{Named Entity Normalization with ICD-10 Classification}
As input the model described here expects a string, which we generate in the previous step.
The model itself uses an embedding layer with input masking on zero values, followed by and bidirectional LSTM layer with 256 dimension hidden layer.
It is followed by an attention layer and a dense layer with a softmax activation function.
Adam was used as the optimizer.
The model was validated on 25\% od the data.
Again, no cross-validation or hyperparamter was performed.
Once again, we devised two approahces.
This was manly influenced by the lack of adequate training data in terms of coverage for individual ICD-10 codes.
Therefore, we once again defined two datasets: (1) minimal, where only ICD-10 codes with 2 or more supporting data points are used.
This, ofcourse, minimizes the number of ICD-10 codes in the label space.
Therefore, (2) extended dataset was defined.
Here, the original ICD-10 codes mappings, found in the supplied dictionaries, are extended with the data from individual langugae Causes Calcules.
Finally, for the remaining ICD-10 codes with support of 1 we duplicate those datapoints.
The goal of this approach is to extend our possible label space to all available ICD-1o labels.
The results obtained from the two approaches are shown in Table \ref{tab:icd10Classification}.
\begin{table}[]
\centering
\begin{tabular}{l|l|l|l|l|l}
Mode & Model & Trained for epochs & Train Accuracy & Train Loss & Validation Accuracy & Validation Loss \\