Updated introduction + added explanation of ICD-10 model + updated tables in experiments

b2e0e924 · Mario Sänger · ff54c4bd · b2e0e924 · b2e0e924 · b2e0e924
Commit b2e0e924 authored 6 years ago by Mario Sänger
--- a/paper/10_introduction.tex
+++ b/paper/10_introduction.tex
@@ -30,6 +30,15 @@ Tutbalina \cite{miftakhutdinov_kfu_2017} in the last year's competition we opt
 for the development of a deep learning model for this year's task. Our work
 introduces a language independent approach for ICD-10 classification using
 multi-language word embeddings and LSTM-based recurrent models. We divide the
-the classification into two tasks. First, we extract symptoms from a certificate
-line backed by an encoder-decoder model. Given the symptoms the actual ICD-10
-classification will be performed by a separate LSTM model.
\ No newline at end of file
+the classification into two tasks. First, we extract the death cause description
+from a certificate line backed by an encoder-decoder model. Given the death cause
+the actual ICD-10 classification will be performed by a separate LSTM model. Our
+work focus on the introduction of and the experiment with an
+language-independent approach which requires as little as possible additional
+resources and only needs one single model for all three languages.
+
+ 
+
+
+
+
--- a/paper/20_related_work.tex
+++ b/paper/20_related_work.tex
@@ -24,12 +24,12 @@ which achieved the best results for English certificates in the last year's
 competition. They use a neural LSTM-based encoder-decoder model that processes the raw
 certificate text as input and encodes it into a vector representation.
 Furthermore a vector which captures the textual similarity between the
-certificate line and the symptons resp. diagnosis of the individual ICD-10 codes
+certificate line and the death causes resp. diagnosis texts of the individual ICD-10 codes
 is used to integrate prior knowledge into the model. The concatenation of both
 vector representations is then used to output the characters and numbers of the
 ICD-10 code in the decoding step. In contrast to their work, our approach 
 introduces a model for multi-language ICD-10 classification. We utilitize two
-separate recurrent neural networks, one sequence to sequence model for symptom
+separate recurrent neural networks, one sequence to sequence model for death cause
 extraction and one for classification, to predict the ICD-10 codes for a
 certificate text independent from which language they originate.  


--- a/paper/30_methods_intro.tex
+++ b/paper/30_methods_intro.tex
 Our approach models the extraction and classification of death causes as
 two-step process. First, we employ a neural, multi-language sequence-to-sequence
-model to receive a symptom name for a given death certificate line. We will then
+model to receive a death cause description for a given death certificate line. We will then
 use a second classification model to assign the respective ICD-10 codes to the
-obtained symptom names. The remainder of this section gives a short introduction
+obtained death cause. The remainder of this section gives a short introduction
 to recurrent neural networks, followed by a detailed explanation of our two models. 

 \subsection{Recurrent neural networks}

--- a/paper/31_methods_seq2seq.tex
+++ b/paper/31_methods_seq2seq.tex
-\subsection{Symptom Model}
-The first step in our pipeline is the extraction of a symptom name from a given
-death certificate line. We use the training certificate lines (with their
-corresponding ICD-10 codes) and the ICD-10 dictionaries as basis for
-our model. The dictionaries provide us with a symptom name for each ICD-10 code.
-The goal of the model is to reassemble the dictionary symptom name from the 
-certificate line.
+\subsection{Death Cause Extraction Model}
+The first step in our pipeline is the extraction of the death cause description
+from a given certificate line. We use the training certificate lines (with their
+corresponding ICD-10 codes) and the ICD-10 dictionaries as basis for our model.
+The dictionaries provide us with death causes resp. diagnosis for each ICD-10
+code. The goal of the model is to reassemble the dictionary death cause
+description text from the certificate line.

 For this we adopt the encoder-decoder architecture proposed in
 \cite{sutskever_sequence_2014}. Figure \ref{fig:encoder_decoder} illustrates the
@@ -23,24 +23,27 @@ the word. The encoders final state represents the semantic meaning of the
 certificate line and serves as intial input for decoding process.

 \begin{figure}
-\includegraphics[width=\textwidth,trim={0 17cm 0 3cm},clip=true]{encoder-decoder-model.pdf}
-\caption{Illustration of the neural encoder-decoder model for symptom
-extraction. The encoder processes a death certificate line token-wise from left
-to right. The final state of the encoder forms a semantic representation of the
-line and serves as initial input for the decoding process. The decoder will be
-trained to predict the symptom name word by word.  All input tokens will be
-represented using the concatenation of the FastText embeddings of all three
-languages.}
+\includegraphics[width=\textwidth,trim={0 17cm 0
+3cm},clip=true]{encoder-decoder-model.pdf} \caption{Illustration of the neural
+encoder-decoder model for death cause extraction. The encoder processes a death
+certificate line token-wise from left to right. The final state of the encoder
+forms a semantic representation of the line and serves as initial input for the
+decoding process. The decoder will be trained to predict the death cause
+description text from the provided ICD-10 dictionaries word by word (using
+special tags \textbackslash s and \textbackslash e for start resp. end of a
+sequence). All input tokens will be represented using the concatenation of the
+FastText embeddings of all three languages.}
 \label{fig:encoder_decoder}
 \end{figure}

 As decoder with utilize another LSTM model. The initial input of the decoder is
-the final state of the encoder. Moreover, each token of the dictionary symptom
-name (padded with special start and end tag) serves as input for the different
-time steps. Again, we use FastEmbeddngs of all three languages to represent the
-token. The decoder predicts one-hot-encoded words of the symptom name. During
-test time we use the encoder to obtain a semantic representation of the
-certificate line and decode the symptom name word by word starting with the
-special start tag. The decoding process finishs when the decoder outputs the end
-tag.
+the final state of the encoder. Moreover, each token of the dictionary death
+cause description name (padded with special start and end tag) serves as input
+for the different time steps. Again, we use FastEmbeddngs of all three languages
+to represent the token. The decoder predicts one-hot-encoded words of the
+symptom name. During test time we use the encoder to obtain a semantic
+representation of the certificate line and decode the death cause description
+word by word starting with the special start tag. The decoding process finishs
+when the decoder outputs the end tag.
+

--- a/paper/32_methods_icd10.tex
+++ b/paper/32_methods_icd10.tex
 \subsection{ICD-10 Classification Model}  
-The second step in our pipeline is to assign a ICD-10 code to the obtained
-symptom name. For this purpose we employ a bidirectional LSTM model which is
-able to capture the past and future context for each token of a symptom name.
-Just as in our encoder-decoder disease name model we encode each token using the
+The second step in our pipeline is to assign a ICD-10 code to the obtained death
+cause description. For this purpose we employ a bidirectional LSTM model which
+is able to capture the past and future context for each token of a death cause description.
+Just as in our encoder-decoder model we encode each token using the
 concatenation of the FastText embeddings of the word from all three languages.
-To enable our model to attend to different parts of a disease name we add an
-extra attention layer \cite{raffel_feed-forward_2015} to the model. We train the
-model using the provided ICD-10 dictionaries from all three languages.
+To enable our model to attend to different parts of the death cause description
+we add an extra attention layer \cite{raffel_feed-forward_2015} to the model.
+Through the attention mechanism our model learns a fixed-sized embedding of the
+death cause description by computing an adaptive weighted average of the state
+sequence of the LSTM model. This allows the model to better integrate
+information over time. Figure \ref{fig:classification-model} presents the
+architecture of our ICD-10 classification model.

-During development we also experimented with character-level RNNs for better
-ICD-10 classification, however couldn't achieve any performance approvements.
+\begin{figure}
+\centering
+\includegraphics[width=\textwidth,trim={0cm 16.5cm 0cm
+3cm},clip=true]{classification-model.pdf} \caption{Illustration of the neural
+ICD-10 classification model. The model utilizes a bi-directional LSTM layer,
+which processes the death cause description from left to right and vice versa.
+The attention layer summarizes the whole description by computing an adaptive
+weighted average over the LSTM states. The resulting death cause embedding will
+be feed through a softmax layer to get the final classification. Equivalent to
+our encoder-decoder model all input tokens will be represented using the
+concatenation of the FastText embeddings of all three languages.}
+\label{fig:classification-model}
+\end{figure}
+ 
+We train the model using the provided ICD-10 dictionaries from all three
+languages. During development we also experimented with character-level RNNs for
+better ICD-10 classification, however couldn't achieve any performance
+approvements. 

--- a/paper/40_experiments.tex
+++ b/paper/40_experiments.tex
-In this section we will present experminets and obtained results for the two developed models, both individually as well as combined in the proposed pipeline.
-As mentioned in Section \ref{sec:methods}, The proposed pipeline combined two NN models.
-
-\subsection{Available datasets}
-The CLEF e-Health 2018 Task 1 participants where provided with annotated death certificates for the three selected languages: french, italian and hungarian.
-Each of the languages is supported by several datasources.
-Provided data sets are imbalanced; the italian corpora consists of 49.823, french corpora of 77.348 and hungarian corpora 323.175 datapoints.
-The data used in this approach was created by combining available datasources and will be explained for each of the models.
-No external data was used.
-Each dataset was split in to a train and evaluation part.
-Although no cross-valiation was performed during training, our models shuffeled the train dataset before each epoch.
-Additionally, no hyperparameter optimization was performed during training, with the default parameters values for individual layers being used.
-We used pretrained fastText\footnote{https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md}[CITATION] word embeddings. The embeddings were trained on Common Crawl and Wikipedia.
-The embeddings were trained with the following parameters: CBOW with position-weights, in dimension 300, with character n-grams of length 5, a window of size 5 and 10 negatives.
-Unfortunatelly, they are trained on corpora not related with the biomedical domain and do not represent the best possible embedding space.
-Final embedding space used by our models is created by concatenating individual embedding vectors.
-All models were implemented with the Keras[CITATION] library.
-
-\subsection{Named Entity Recognition with Sequence2Sequence model}
-To identify possible tokens as candidates for death cause, we focused on the use of a sequence to sequence model.
-The generated sequence of tokens in then passed on to the next step for normalization to a ICD-10 code.
-This model consists of two parts: the encoder and the decoder.
-The encoder uses an embedding layer with input masking on zero values and an LSTM with 256 dimensions.
-The encoders output is used as the initial state of the decoder.
-the decoder employs the same architecture, followed by a dense layer and a softmax activation function.
-The model, based on the input sentence and a start token, generates tokens out of the vocabulary until it generated the end token.
-The entire model is optimized using the Adam optimizer, with a batch size of 700.
-The model trains either for 100 epochs or if an early stopping criteria is met (no change in validation loss for two epochs).
-
-As the available dataset is highly imbalanced, we devised two approaches: (1) balanced, where each language was supported by 49.823 randomly drawn datapoints (lenght of the smallest corpus) and (2) extended, where all available data is used.
-The results, obtained on the validation set, are shown in Table \ref{tab:s2s}.
+In this section we will present experiments and obtained results for the two
+developed models, both individually as well as combined in a pipeline setting.
+
+\subsection{Training Data and Experiment Setup}
+The CLEF e-Health 2018 Task 1 participants where provided with annotated death
+certificates for the three selected languages: French, Italian and Hungarian.
+Each of the languages is supported by several data sources. Provided data sets
+were imbalanced concerning the different languages: the Italian corpora consists
+of 49,823, French corpora of 77,348\footnote{For French we only took the
+provided data from 2014.} and Hungarian corpora 323,175 certificate lines.
+
+The training data used in this approach was created by combining the data
+sources of all three languages. Despite the provided certificate data we used no
+further, external data sources. Each dataset was split into a train and
+a hold-out evaluation set. We didn't perform cross-validation during development, however
+we shuffle the train and validation dataset before each training epoch.
+Moreover, no hyperparameter optimization was performed due to time constraints
+during the development phase. Instead we set default 
+the default parameters values for individual layers being used.
+
+We used pre-trained fastText\footnote{https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md}
+word embeddings \cite{bojanowski_enriching_2016}. The embeddings were trained on
+Common Crawl and a Wikipedia dump. The embeddings were trained with the
+following parameters: CBOW with position-weights, embedding dimension size 300,
+with character n-grams of length 5, a window of size 5 and 10 negatives.
+Unfortunately, they are trained on corpora not related with the biomedical
+domain and therefore do not represent the best possible embedding space for
+biomedical information extraction. Final embedding space used by our models is
+created by concatenating individual embedding vectors for all three languages.
+Thus the input of our model is embedding vector of size 900. All models were
+implemented with the Keras library \footnote{https://keras.io/}.
+
+\subsection{Death cause extraction model} 
+To identify possible tokens as candidates for a death cause description, we
+focused on the use of an encoder-decoder model. The encoder uses an embedding
+layer with input masking on zero values and a LSTM layer with 256 units. The 
+encoders output is used as the initial state of the decoder.
+
+The decoder generates, based on the input description from the dictionary and a
+special start token, a death cause word by word. This decoding process continues
+until a special end token is generated. The entire model is optimized using the
+Adam optimizer and a batch size of 700. Model training was performed either for
+100 epochs or if an early stopping criteria is met (no change in validation loss
+for two epochs).
+
+As the available dataset are imbalanced concerning the different languages, we
+devised two approaches: (1) DCEM-Balanced, where each language was supported by
+49.823 randomly drawn data points (size of the smallest corpus) and (2) DCEM-Full,
+where all available data is used. The results, obtained on the validation set,
+are shown in Table \ref{tab:s2s}.

 \begin{table}[]
 \centering
-\begin{tabular}{l|l|l|l|l|l}
-Model & Trained for epochs & Train Accuracy & Train Loss & Validation Accuracy & Validation Loss \\
-Balanced &  18 & 0.958 & 0.205 & 0.899 & 0.634 \\
-Extended &  9 &0.709 & 0.098 & 0.678 & 0.330  \\
-\end{tabular}
-\caption{Named Entity Recgonition: S2S model evaluation}
+\begin{tabularx}{0.9\textwidth}{p{3cm}|c|c|c|c|c}
+\toprule
+\multirow{2}{*}{\textbf{Setting}} & \multirow{2}{*}{\textbf{Trained Epochs}}&\multicolumn{2}{c|}{\textbf{Train}}&\multicolumn{2}{c}{\textbf{Validation}} \\ 
+\cline{3-6}
+&&\textbf{Accuracy}&\textbf{Loss}&\textbf{Accuracy}&\textbf{Loss} \\
+\hline
+DCEM-Balanced &  18 & 0.958 & 0.205 & 0.899 & 0.634 \\
+DCEM-Full &  9 &0.709 & 0.098 & 0.678 & 0.330  \\
+\bottomrule
+\end{tabularx}
+\caption{Experiment results of our death cause extraction sequence-to-sequence
+model concerning balanced (equal number of training data per language) and full
+data set setting.}
 \label{tab:s2s}
 \end{table}

-\subsection{Named Entity Normalization with ICD-10 Classification}
-As input the model described here expects a string, which we generate in the previous step.
-The model itself uses an embedding layer with input masking on zero values, followed by and bidirectional LSTM layer with 256 dimension hidden layer.
-It is followed by an attention layer and a dense layer with a softmax activation function.
-Adam was used as the optimizer.
-The model was validated on 25\% od the data.
-Again, no cross-validation or hyperparameter was performed.
-Once again, we devised two approaches.
-This was manly influenced by the lack of adequate training data in terms of coverage for individual ICD-10 codes.
-Therefore, we once again defined two datasets: (1) minimal, where only ICD-10 codes with 2 or more supporting data points are used.
-This, of course, minimizes the number of ICD-10 codes in the label space.
-Therefore, (2) extended dataset was defined.
-Here, the original ICD-10 codes mappings, found in the supplied dictionaries, are extended with the data from individual langugae Causes Calcules.
-Finally, for the remaining ICD-10 codes with support of 1 we duplicate those datapoints.
-The goal of this approach is to extend our possible label space to all available ICD-1o labels.
-The results obtained from the two approaches are shown in Table \ref{tab:icd10Classification}.
+\subsection{ICD-10 Classification Model}
+The classification model is responsible for assigning a ICD-10 code to death
+cause description obtained during the first step. Our model uses an embedding
+layer with input masking on zero values, followed by and bidirectional LSTM
+layer with 256 dimension hidden layer. Thereafter a attention layer builds an
+adaptive weighted average over all LSTM states. They ICD-10 code will be
+determined by a dense layer with softmax activation function.
+
+We use the Adam optimizer to perform model training. The model was validated on
+25\% od the data. As for the extraction model, no cross-validation or
+hyperparameter was performed due to time contraints during development. Once
+again, we devised two approaches. This was manly influenced by the lack of
+adequate training data in terms of coverage for individual ICD-10 codes.
+Therefore, we once again defined two datasets: (1) minimal, where only ICD-10
+codes with 2 or more supporting data points are used. This, of course, minimizes
+the number of ICD-10 codes in the label space. Therefore, (2) an extended
+dataset was defined. Here, the original ICD-10 codes mappings, found in the
+supplied dictionaries, are extended with the data from individual langugae
+Causes Calcules. Finally, for the remaining ICD-10 codes with support of 1 we
+duplicate those datapoints. The goal of this approach is to extend our possible
+label space to all available ICD-10 codes. The results obtained from the two
+approaches are shown in Table \ref{tab:icd10Classification}.

 \begin{table}[]
 \centering
-\begin{tabular}{l|l|l|l|l|l|l}
-Tokenization & Model & Trained for epochs & Train Accuracy & Train Loss & Validation Accuracy & Validation Loss \\
+\begin{tabularx}{\textwidth}{p{2.25cm}|p{1.75cm}|c|c|c|c|c}
+\toprule
+\multirow{2}{*}{\textbf{Tokenization}}&\multirow{2}{*}{\textbf{Model}}&\multirow{2}{*}{\textbf{Trained Epochs}}&\multicolumn{2}{c|}{\textbf{Train}}&\multicolumn{2}{c}{\textbf{Validation}} \\
+\cline{4-7} 
+&&&\textbf{Accuracy}&\textbf{Loss}&\textbf{Accuracy}&\textbf{Loss} \\
+\hline
 Word & Minimal &  69 & 0.925 & 0.190 & 0.937 & 0.169 \\
 Word & Extended &  41 & 0.950 & 0.156 & 0.954 & 0.141 \\
 Character & Minimal &   91 & 0.732 & 1.186 & 0.516 & 2.505 \\
-\end{tabular}
-\caption{Named Entity Normalization: ICD-10 Classification }
+\bottomrule
+\end{tabularx}
+\caption{Experiment results for our ICD-10 classification model regarding different settings.}
 \label{tab:icd10Classification}
 \end{table}

-\subsection{Final Pipeline}
-The two modeles where combined to create the final pipeline.
-We tested both NER models in the final pipeline, as their performance differs significantly.
-As both NEN models performe similary, we used the word and extended ICD-10 classification model in the final pipeline.
-The results obtained during training are presented in Table \ref{tab:final_train}.
-Results obtained on the evaluation dataset are shown in Table \ref{tab:final_test}
+\subsection{Complete Pipeline}
+The two models where combined to create the final pipeline. We tested both
+neural models in the final pipeline, as their performance differs greatly.
+As both ICD-10 classification models perform similarly, we used the word and
+extended ICD-10 classification model in the final pipeline. The results obtained
+during training are presented in Table \ref{tab:final_train}. Results obtained
+on the evaluation dataset are shown in Table \ref{tab:final_test}.

 \begin{table}[]
 \centering
 \begin{tabular}{|l|l|l|l|}
-Model &  Precision & Recall & F-1\\
+Model &  Precision & Recall & F-score \\
 S2S balanced + ICD-10 extended & 0.73 & 0.61 & 0.61 \\
 S2S extended + ICD-10 extended & 0.74 & 0.62 & 0.63 \\
 \end{tabular}
 \caption{Final Pipeline Evaluation}
 \label{tab:final_train}
-\begin{tabular}{|l|l|l|l|}
-Model &  Precision & Recall & F-1\\
-S2S balanced + ICD-10 extended &  &  & \\
-S2S extended + ICD-10 extended &  &  & \\
-\end{tabular}
+\end{table}
+
+\begin{table}[]
+\centering
+\begin{tabularx}{0.8\textwidth}{p{2cm}|p{3cm}|c|c|c}
+\toprule
+\textbf{Language} & \textbf{Model} & \textbf{Precision} & \textbf{Recall} & \textbf{F-score}\\
+\hline
+\multirow{2}{*}{French}
+& DCEM-Balanced & 0.494 & 0.246 & 0.329 \\
+& DCEM-Full     & 0.512 & 0.253 & 0.339 \\
+\cline{2-5}
+& Baseline      & 0.341 & 0.200 & 0.253 \\
+& Average       & 0.723 & 0.410 & 0.507 \\
+& Median        & 0.798 & 0.475 & 0.579 \\
+\hline
+
+\multirow{2}{*}{Hungarian}
+& DCEM-Balanced & 0.518 & 0.384 & 0.441 \\
+& DCEM-Full     & 0.522 & 0.388 & 0.445 \\
+\cline{2-5}
+& Baseline      & 0.243 & 0.174 & 0.202 \\
+& Average       & 0.827 & 0.783 & 0.803 \\
+& Median        & 0.922 & 0.897 & 0.910 \\
+\hline
+
+\multirow{3}{*}{Italian} 
+& DCEM-Balanced & 0.857 & 0.685 & 0.761 \\
+& DCEM-Full     & 0.862 & 0.689 & 0.766 \\
+\cline{2-5}
+& Baseline      & 0,165 & 0.172 & 0.169 \\
+& Average       & 0.844 & 0.760 & 0.799 \\
+& Median        & 0,900 & 0.824 & 0.863 \\
+\bottomrule
+\end{tabularx}
 \caption{Final Pipeline Evaluation}
 \label{tab:final_test}
-
 \end{table}

+
--- a/paper/50_conclusion.tex
+++ b/paper/50_conclusion.tex
-In this paper we tackled the problem of information extraction of death causes in an multilingual environment.
-The proposed solution was focused in language-independent models and relies on word embeddings for each of the languages.
+In this paper we tackled the problem of information extraction of death causes
+in an multilingual environment. The proposed solution was focused in language-independent models and relies on
+word embeddings for each of the languages.
 The proposed pipeline is divided in two steps: (1) first, possible token describing the death cause are generated by using a sequence to sequence model with attention mechanism; then, (2) generated token sequence is normalized to a possible ICD-10 code.

 We detected several issues with the proposed pipeline. These issues also serve as prospecitve future work.
@@ -13,7 +14,7 @@ Creating an unifying embeddings space would create a truly language-independent
 Additionally, it was shown that in-domain embeddings improve the quality of achieved results. This will be the main focus on our future work.
 The normalization step also suffered from lack of adequate training data.
 Unfortunately, we were unable to obtain ICD-10 dictinaries for all languages and can, therefore, not guarantee the completeness of the ICD-10 label space.
-Final downside of the proposed pipeline is the lack fo support for mutli-label classification.
+Another downside of the proposed pipeline is the lack fo support for mutli-label classification.




--- a/paper/classification-model.docx
+++ b/paper/classification-model.docx
--- a/paper/classification-model.pdf
+++ b/paper/classification-model.pdf
--- a/paper/encoder-decoder-model.docx
+++ b/paper/encoder-decoder-model.docx
--- a/paper/encoder-decoder-model.pdf
+++ b/paper/encoder-decoder-model.pdf
--- a/paper/references.bib
+++ b/paper/references.bib

+@article{bojanowski_enriching_2016,
+	title = {Enriching {Word} {Vectors} with {Subword} {Information}},
+	url = {http://arxiv.org/abs/1607.04606},
+	abstract = {Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character \$n\$-grams. A vector representation is associated to each character \$n\$-gram; words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks.},
+	urldate = {2018-03-12},
+	journal = {arXiv:1607.04606 [cs]},
+	author = {Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
+	month = jul,
+	year = {2016},
+	note = {arXiv: 1607.04606},
+	keywords = {Read, Embeddings, Word Embeddings, FastText},
+	file = {arXiv\:1607.04606 PDF:/Users/mario/Zotero/storage/9WC5C7M6/Bojanowski et al. - 2016 - Enriching Word Vectors with Subword Information.pdf:application/pdf;arXiv.org Snapshot:/Users/mario/Zotero/storage/YPS6YZHR/1607.html:text/html}
+}
+
 @inproceedings{neveol_clef_2017,
 	title = {{CLEF} {eHealth} 2017 {Multilingual} {Information} {Extraction} task overview: {ICD}10 coding of death certificates in {English} and {French}},
 	shorttitle = {{CLEF} {eHealth} 2017 {Multilingual} {Information} {Extraction} task overview},
@@ -16,7 +30,7 @@
 	publisher = {CEUR-WS},
 	author = {Miftakhutdinov, Zulfat and Tutubalina, Elena},
 	year = {2017},
-	keywords = {CLEF, ICD-10-Classification, Read},
+	keywords = {Read, CLEF, ICD-10-Classification},
 	file = {Fulltext:/Users/mario/Zotero/storage/HRZ6Q8Q6/Miftakhutdinov und Tutubalina - 2017 - Kfu at clef ehealth 2017 task 1 Icd-10 coding of .pdf:application/pdf;Snapshot:/Users/mario/Zotero/storage/J8TXTUNT/Miftakhutdinov und Tutubalina - 2017 - Kfu at clef ehealth 2017 task 1 Icd-10 coding of .pdf:application/pdf}
 }

@@ -318,4 +332,25 @@ The system proposed in this study provides automatic identification and characte
 	author = {Ebersbach, Mike and Herms, Robert and Eibl, Maximilian},
 	year = {2017},
 	file = {Fulltext:/Users/mario/Zotero/storage/LKIZA2P4/Ebersbach et al. - 2017 - Fusion Methods for ICD10 Code Classification of De.pdf:application/pdf;Snapshot:/Users/mario/Zotero/storage/CIX48RIC/Ebersbach et al. - 2017 - Fusion Methods for ICD10 Code Classification of De.pdf:application/pdf}
+}
+
+@inproceedings{xu_show_2015,
+	title = {Show, attend and tell: {Neural} image caption generation with visual attention},
+	shorttitle = {Show, attend and tell},
+	booktitle = {International {Conference} on {Machine} {Learning}},
+	author = {Xu, Kelvin and Ba, Jimmy and Kiros, Ryan and Cho, Kyunghyun and Courville, Aaron and Salakhudinov, Ruslan and Zemel, Rich and Bengio, Yoshua},
+	year = {2015},
+	pages = {2048--2057},
+	file = {Fulltext:/Users/mario/Zotero/storage/QASCM4G3/Xu et al. - 2015 - Show, attend and tell Neural image caption genera.pdf:application/pdf;Snapshot:/Users/mario/Zotero/storage/VILIPKYC/Xu et al. - 2015 - Show, attend and tell Neural image caption genera.pdf:application/pdf}
+}
+
+@inproceedings{chan_listen_2016,
+	title = {Listen, attend and spell: {A} neural network for large vocabulary conversational speech recognition},
+	shorttitle = {Listen, attend and spell},
+	booktitle = {Acoustics, {Speech} and {Signal} {Processing} ({ICASSP}), 2016 {IEEE} {International} {Conference} on},
+	publisher = {IEEE},
+	author = {Chan, William and Jaitly, Navdeep and Le, Quoc and Vinyals, Oriol},
+	year = {2016},
+	pages = {4960--4964},
+	file = {Fulltext:/Users/mario/Zotero/storage/ZV5B2GQJ/Chan et al. - 2016 - Listen, attend and spell A neural network for lar.pdf:application/pdf;Snapshot:/Users/mario/Zotero/storage/RS8MBCM8/7472621.html:text/html}
 }
\ No newline at end of file
--- a/paper/wbi-eclef18.tex
+++ b/paper/wbi-eclef18.tex
@@ -6,6 +6,8 @@
 \usepackage[utf8]{inputenc} 
 \usepackage[english]{babel} 
 \usepackage{color}
+\usepackage{multirow,tabularx}
+\usepackage{booktabs}
 
 % Used for displaying a sample figure. If possible, figure files should
 % be included in EPS format.