20_related_work.tex

  
 This section highlights previous work related to our approach.
We give a brief introduction to the methodical foundations of our work, RNNs and word embeddings.
The section concludes with a summary of ICD-10 classification approaches used in previous eHealth Lab competitions. 

\subsection{Recurrent neural networks (RNN)}
RNNs are a widely used technique for sequence learning problems such as machine translation
\cite{bahdanau_neural_2018,cho_learning_2014}, image captioning
\cite{bengio_scheduled_2015}, NER \cite{lample_neural_2016,wei_disease_2016}, dependency parsing
\cite{dyer_transition-based_2015} and POS-tagging \cite{wang_part--speech_2015}.
RNNs model dynamic temporal behaviour in sequential data through recurrent
units, i.e. the hidden, internal state of a unit in one time step depends on the
internal state of the unit in the previous time step. These feedback connections
enable the network to memorize information from recent time steps and add the ability to capture
long-term dependencies. 

However, training of RNNs can be difficult due to the vanishing gradient problem
\cite{hochreiter_gradient_2001,bengio_learning_1994}. The most widespread
modifications of RNNs to overcome this problem are LSTMs \cite{hochreiter_long_1997} and gated recurrent units (GRU)
\cite{cho_learning_2014}. Both modifications use gated memories which control
and regulate the information flow between two recurrent units. A common LSTM
unit consists of a cell and three gates, an input gate, an output gate and a
forget gate. In general, LSTMs are chained together by connecting the outputs of
the previous unit to the inputs of the next one.

A further extension of the general RNN architecture are bidirectional networks,
which make the past and future context available in every time step. A
bidirectional LSTM model consists of a forward chain, which processes the input
data from left to right, and and backward chain, consuming the data in the
opposite direction. The final representation is typically the concatenation or a
linear combination of both states. 

\subsection{Word Embeddings} 
Distributional semantic models (DSMs) have been researched for decades in NLP \cite{turney_frequency_2010}.
Based on a huge amount of unlabeled texts, DSMs aim to represent words using a real-valued vector (also called embedding) which captures syntactic and semantic similarities between the words.
Starting with the publication of the work from Collobert et al. \cite{collobert_natural_2011} in 2011, learning embeddings for linguistic units, such as words, sentences or paragraphs, ist one of the hot topics in NLP and a plethora of approaches have been proposed \cite{bojanowski_enriching_2017,mikolov_distributed_2013,peters_deep_2018,pennington_glove_2014}.
 
The majority of todays embedding models are based on deep learning models trained to perform some kind of language modeling task \cite{peters_semi-supervised_2017,peters_deep_2018,pinter_mimicking_2017}. 
The most popular embedding model is the Word2Vec model introduced by Mikolov et. al \cite{mikolov_distributed_2013,mikolov_efficient_2013}. 
They propose two shallow neural network models, continuous bag-of-words (CBOW) and SkipGram, that are trained to reconstruct the context given a center word and vice versa.
In contrast, Pennington et al. \cite{pennington_glove_2014} use the ratio between co-occurrence probabilities of two words with another one to learn a vector representation.
In \cite{peters_deep_2018} multi-layer, bi-directional LSTM models are utilized to learn word embeddings that also capture different contexts of it. 

Several recent models focus on the integration of subword and morphological information to provide suitable representations even for unseen, out-of-vocabulary words. 
For example, Pinter et al. \cite{pinter_mimicking_2017} try to reconstruct a pre-trained word embedding by learning a bi-directional LSTM model on character level. 
Similarily, Bojanowski et al. \cite{bojanowski_enriching_2017} adapt the SkipGram by taking character n-grams into account. 
Their fastText model assigns a vector representation to each character n-gram and represents words by summing over all of these representations of a word.

In addition to embeddings that capture word similarities in one language, multi-/cross-lingual approaches have also been investigated.
Proposed methods either learn a linear mapping between monolingual representations \cite{faruqui_improving_2014,xing_normalized_2015} or utilize word- \cite{guo_cross-lingual_2015,vyas_sparse_2016}, sentence- \cite{pham_learning_2015} or document-aligned \cite{sogaard_inverted_2015} corpora to build a shared embedding space.
    
\subsection{ICD-10 Classification}
The ICD-10 coding task has already been carried out in the 2016 \cite{neveol_clinical_2016} and 2017 \cite{neveol_clef_2017} edition of the eHealth lab. 
Participating teams used a plethora of different approaches to tackle the classification problem. 
The methods can essentially be divided into two categories: knowledge-based \cite{cabot_sibm_2016,jonnagaddala_automatic_2017,van_mulligen_erasmus_2016} and machine learning (ML) approaches \cite{dermouche_ecstra-inserm_2016,ebersbach_fusion_2017,ho-dac_litl_2016,miftakhutdinov_kfu_2017}.
The former relies on lexical sources, medical terminologies and other ontologies to match (parts of) the certificate text with entries from the knowledge-bases according to a rule framework. 
For example, Di Nunzio et al. \cite{di_nunzio_lexicon_2017} calculate a score for each ICD-10 dictionary entry by summing the binary or tf-idf weights of each term of a certificate line segment and assign the ICD-10 code with the highest score. 
In contrast, Ho-Dac et al. \cite{ho-dac_litl_2017} treat the problem as information retrieval task and utilize the Apache Solr search engine\footnote{\url{http://lucene.apache.org/solr/}} to classify the individual lines.

The ML-based approaches employ a variety of techniques, e.g. Conditional Random Fields (CRFs) \cite{ho-dac_litl_2016}, Labeled Latent Dirichlet Analysis (LDA) \cite{dermouche_ecstra-inserm_2016} and Support Vector Machines (SVMs) \cite{ebersbach_fusion_2017} with diverse hand-crafted features.
Most similar to our approach is the work from Miftahutdinov and Tutbalina \cite{miftakhutdinov_kfu_2017}, which achieved the best results for English certificates in the last year's competition. 
They use a neural LSTM-based encoder-decoder model that processes the raw certificate text as input and encodes it into a vector representation. 
Additionally, a vector which captures the textual similarity between the certificate line and the death causes of the individual ICD-10 codes is used to integrate prior knowledge into the model. 
The concatenation of both vector representations is then used to output the characters and numbers of the ICD-10 code in the decoding step. 
In contrast to their work, our approach introduces a model for multi-language ICD-10 classification. 
Moreover, we divide the task into two distinct steps: death cause extraction and ICD-10 classification.