according to a rule framework. For example, Di Nunzio et al.
\cite{dyer_transition-based_2015} and POS-tagging \cite{wang_part--speech_2015}.
\cite{di_nunzio_lexicon_2017} calculate a score for each ICD-10 dictionary entry
RNNs model dynamic temporal behaviour in sequential data through recurrent
by summing the binary or tf-idf weights of each term of a certificate line
units, i.e. the hidden, internal state of a unit in one time step depends on the
segment and assign the ICD-10 code with the highest score. In contrast, Ho-Dac
internal state of the unit in the previous time step. These feedback connections
et al. \cite{ho-dac_litl_2017} treat the problem as information retrieval task
enable the network to memorize information from recent time steps and capture
and utilize the Apache Solr search engine\footnote{\url{http://lucene.apache.org/solr/}}.
long-term dependencies.
The ML-based approaches employ a variety of techniques, e.g.
However, training of RNNs can be difficult due to the vanishing gradient problem
Conditional Random Fields (CRFs) \cite{ho-dac_litl_2016}, Labeled Latent
\cite{hochreiter_gradient_2001,bengio_learning_1994}. The most widespread
Dirichlet Analysis (LDA) \cite{dermouche_ecstra-inserm_2016} and Support Vector
modifications of RNNs to overcome this problem are Long Short-Term Memory
Machines (SVMs) \cite{ebersbach_fusion_2017} with diverse hand-crafted features.
networks (LSTM) \cite{hochreiter_long_1997} and Gated Recurrent Units (GRU)
\cite{cho_learning_2014}. Both modifications use gated memories which control
Most similar to our approach is the work from Miftahutdinov and Tutbalina
and regulate the information flow between two recurrent units. A common LSTM
\cite{miftakhutdinov_kfu_2017}, which achieved the best results for English
unit consists of a cell and three gates, an input gate, an output gate and a
certificates in the last year's competition. They use a neural LSTM-based
forget gate. In general, LSTMs are chained together by connecting the outputs of
encoder-decoder model that processes the raw certificate text as input and
the previous unit to the inputs of the next one.
encodes it into a vector representation. Furthermore a vector which captures the
textual similarity between the certificate line and the death causes resp.
A further extension of the general RNN architecture are bidirectional networks,
diagnosis texts of the individual ICD-10 codes is used to integrate prior
which make the past and future context available in every time step. A
knowledge into the model. The concatenation of both vector representations is
bidirectional LSTM model consists of a forward chain, which processes the input
then used to output the characters and numbers of the ICD-10 code in the
data from left to right, and and backward chain, consuming the data in the
decoding step. In contrast to their work, our approach introduces a model for
opposite direction. The final representation is typically the concatenation or a
multi-language ICD-10 classification. We utilize two separate recurrent neural
linear combination of both states.
networks, one sequence to sequence model for death cause extraction and one for
classification, to predict the ICD-10 codes for a certificate text independent
\subsection{Word Embeddings}
from which language they originate.
Distributional semantic models have been researched for decades in the area of natural language processing (NLP) \cite{}.
The investigated models aim to represent words using a real-valued vector (also called embedding) based on a huge amount of unlabeled texts which captures syntactic and semantic similarities between words.
Starting with the publication of the work from Collobert et al. \cite{} in 2008, word embeddings are one of the hot topics in NLP and a plethora of appraoches have been proposed \cite{}.
The majority of todays embedding models are based on deep learning models trained to perform some kind of language modeling task \cite{}.
The most popular embedding model is the Word2Vec model introduced by Mikolov et. al \cite{}.
They propose two shallow neural network models, continuous bag-of-words (CBOW) and SkipGram, that are trained to reconstruct the context given a center word and vice versa.
In contrast, Pennington et al. \cite{} use the ratio between co-occurrence probabilities of two words with another one to learn a vector representation.
The most recent models focus on the integration of subword and morphological information to provide suitable representations even for unseen, out-of-vocabulary words.
For example, Pinter et al. \cite{} try to reconstruct a pre-trained word embedding by learning a bi-directional LSTM model on character level.
Similarily, Bojanowski et al. \cite{bojanowski_enriching_2016} adapt the SkipGram by taking character n-grams into account.
They assign a vector representation to each character n-gram and represent words by summing over all of these representations of a word.
\subsection{ICD-10 Classification}
The ICD-10 coding task has already been carried out in the 2016 \cite{neveol_clinical_2016} and 2017 \cite{neveol_clef_2017} edition of the eHealth lab.
Participating teams used a plethora of different approaches to tackle the classification problem.
The methods can essentially be divided into two categories: knowledge-based \cite{cabot_sibm_2016,jonnagaddala_automatic_2017,van_mulligen_erasmus_2016} and machine learning (ML) approaches \cite{dermouche_ecstra-inserm_2016,ebersbach_fusion_2017,ho-dac_litl_2016,miftakhutdinov_kfu_2017}.
The former relies on lexical sources, medical terminologies and other ontologies to match (parts of) the certificate text with entries from the knowledge-bases according to a rule framework.
For example, Di Nunzio et al. \cite{di_nunzio_lexicon_2017} calculate a score for each ICD-10 dictionary entry by summing the binary or tf-idf weights of each term of a certificate line segment and assign the ICD-10 code with the highest score.
In contrast, Ho-Dac et al. \cite{ho-dac_litl_2017} treat the problem as information retrieval task and utilize the Apache Solr search engine\footnote{\url{http://lucene.apache.org/solr/}}.
The ML-based approaches employ a variety of techniques, e.g. Conditional Random Fields (CRFs) \cite{ho-dac_litl_2016}, Labeled Latent Dirichlet Analysis (LDA) \cite{dermouche_ecstra-inserm_2016} and Support Vector Machines (SVMs) \cite{ebersbach_fusion_2017} with diverse hand-crafted features.
Most similar to our approach is the work from Miftahutdinov and Tutbalina \cite{miftakhutdinov_kfu_2017}, which achieved the best results for English certificates in the last year's competition.
They use a neural LSTM-based encoder-decoder model that processes the raw certificate text as input and encodes it into a vector representation.
Furthermore a vector which captures the textual similarity between the certificate line and the death causes resp. diagnosis texts of the individual ICD-10 codes is used to integrate prior knowledge into the model.
The concatenation of both vector representations is then used to output the characters and numbers of the ICD-10 code in the decoding step.
In contrast to their work, our approach introduces a model for multi-language ICD-10 classification.
We utilize two separate recurrent neural networks, one sequence to sequence model for death cause extraction and one for classification, to predict the ICD-10 codes for a certificate text independent from which language they originate.
@@ -6,8 +6,7 @@ The goal of the model is to reassemble the dictionary death cause description te
...
@@ -6,8 +6,7 @@ The goal of the model is to reassemble the dictionary death cause description te
For this we adopt the encoder-decoder architecture proposed in \cite{sutskever_sequence_2014}. Figure \ref{fig:encoder_decoder} illustrates the architecture of the model.
For this we adopt the encoder-decoder architecture proposed in \cite{sutskever_sequence_2014}. Figure \ref{fig:encoder_decoder} illustrates the architecture of the model.
As encoder we utilize a forward LSTM model, which takes the single words of a certificate line as inputs and scans the line from left to right.
As encoder we utilize a forward LSTM model, which takes the single words of a certificate line as inputs and scans the line from left to right.
Each token is represented using pre-trained fastText\footnote{https://github.com/facebookresearch/fastText/} word embeddings\cite{bojanowski_enriching_2016}.
Each token is represented using pre-trained fastText\footnote{https://github.com/facebookresearch/fastText/} word embeddings\cite{bojanowski_enriching_2016}.
Word embedding models represent words using a real-valued vector and capture syntactic and semantic similarities between them.
fastText embeddings take sub-word information into account during training whereby the model is able to provide suitable representations even for unseen, out-of-vocabulary (OOV) words.
fastText embeddings take sub-word information into account during training whereby the model is able to provide suitable representations even for unseen, out-of-vocabulary (OOV) words.
We utilize fastText embeddings for French, Italian and Hungarian trained on Common Crawl and Wikipedia articles\footnote{\url{https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md}}.
We utilize fastText embeddings for French, Italian and Hungarian trained on Common Crawl and Wikipedia articles\footnote{\url{https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md}}.
Independently from the language a word originates from, we lookup the word in all three embedding models and concatenate the obtained vectors.
Independently from the language a word originates from, we lookup the word in all three embedding models and concatenate the obtained vectors.