diff --git a/paper/20_related_work.tex b/paper/20_related_work.tex index d63f3ab2f49447e316c263c48e91c4b1df7b3ee9..3c2aa48a85505f80df4f83ce4007a6f897b56fe1 100644 --- a/paper/20_related_work.tex +++ b/paper/20_related_work.tex @@ -1,39 +1,69 @@ -The ICD-10 coding task has already been carried out in the 2016 -\cite{neveol_clinical_2016} and 2017 \cite{neveol_clef_2017} edition of the -eHealth lab. Participating teams used a plethora of different approaches to -tackle the classification problem. The methods can essentially be divided into -two categories: knowledge-based -\cite{cabot_sibm_2016,jonnagaddala_automatic_2017,van_mulligen_erasmus_2016} and -machine learning (ML) approaches -\cite{dermouche_ecstra-inserm_2016,ebersbach_fusion_2017,ho-dac_litl_2016,miftakhutdinov_kfu_2017}. -The former relies on lexical sources, medical terminologies and other ontologies -to match (parts of) the certificate text with entries from the knowledge-bases -according to a rule framework. For example, Di Nunzio et al. -\cite{di_nunzio_lexicon_2017} calculate a score for each ICD-10 dictionary entry -by summing the binary or tf-idf weights of each term of a certificate line -segment and assign the ICD-10 code with the highest score. In contrast, Ho-Dac -et al. \cite{ho-dac_litl_2017} treat the problem as information retrieval task -and utilize the Apache Solr search engine\footnote{\url{http://lucene.apache.org/solr/}}. - -The ML-based approaches employ a variety of techniques, e.g. -Conditional Random Fields (CRFs) \cite{ho-dac_litl_2016}, Labeled Latent -Dirichlet Analysis (LDA) \cite{dermouche_ecstra-inserm_2016} and Support Vector -Machines (SVMs) \cite{ebersbach_fusion_2017} with diverse hand-crafted features. - -Most similar to our approach is the work from Miftahutdinov and Tutbalina -\cite{miftakhutdinov_kfu_2017}, which achieved the best results for English -certificates in the last year's competition. They use a neural LSTM-based -encoder-decoder model that processes the raw certificate text as input and -encodes it into a vector representation. Furthermore a vector which captures the -textual similarity between the certificate line and the death causes resp. -diagnosis texts of the individual ICD-10 codes is used to integrate prior -knowledge into the model. The concatenation of both vector representations is -then used to output the characters and numbers of the ICD-10 code in the -decoding step. In contrast to their work, our approach introduces a model for -multi-language ICD-10 classification. We utilize two separate recurrent neural -networks, one sequence to sequence model for death cause extraction and one for -classification, to predict the ICD-10 codes for a certificate text independent -from which language they originate. +This section highlights previous work related to our approach. +We give a brief introduction to the methodical foundations of our work, recurrent neural networks and word embeddings. +The section concludes with a summary of ICD-10 classification approaches used in previous eHealth Lab competitions. + +\subsection{Recurrent neural networks} +Recurrent neural networks (RNNs) are a widely used technique for sequence +learning problems such as machine translation +\cite{bahdanau_neural_2014,cho_learning_2014}, image captioning +\cite{bengio_scheduled_2015}, named entity recognition +\cite{lample_neural_2016,wei_disease_2016}, dependency parsing +\cite{dyer_transition-based_2015} and POS-tagging \cite{wang_part--speech_2015}. +RNNs model dynamic temporal behaviour in sequential data through recurrent +units, i.e. the hidden, internal state of a unit in one time step depends on the +internal state of the unit in the previous time step. These feedback connections +enable the network to memorize information from recent time steps and capture +long-term dependencies. + +However, training of RNNs can be difficult due to the vanishing gradient problem +\cite{hochreiter_gradient_2001,bengio_learning_1994}. The most widespread +modifications of RNNs to overcome this problem are Long Short-Term Memory +networks (LSTM) \cite{hochreiter_long_1997} and Gated Recurrent Units (GRU) +\cite{cho_learning_2014}. Both modifications use gated memories which control +and regulate the information flow between two recurrent units. A common LSTM +unit consists of a cell and three gates, an input gate, an output gate and a +forget gate. In general, LSTMs are chained together by connecting the outputs of +the previous unit to the inputs of the next one. + +A further extension of the general RNN architecture are bidirectional networks, +which make the past and future context available in every time step. A +bidirectional LSTM model consists of a forward chain, which processes the input +data from left to right, and and backward chain, consuming the data in the +opposite direction. The final representation is typically the concatenation or a +linear combination of both states. + +\subsection{Word Embeddings} +Distributional semantic models have been researched for decades in the area of natural language processing (NLP) \cite{}. +The investigated models aim to represent words using a real-valued vector (also called embedding) based on a huge amount of unlabeled texts which captures syntactic and semantic similarities between words. +Starting with the publication of the work from Collobert et al. \cite{} in 2008, word embeddings are one of the hot topics in NLP and a plethora of appraoches have been proposed \cite{}. + +The majority of todays embedding models are based on deep learning models trained to perform some kind of language modeling task \cite{}. +The most popular embedding model is the Word2Vec model introduced by Mikolov et. al \cite{}. +They propose two shallow neural network models, continuous bag-of-words (CBOW) and SkipGram, that are trained to reconstruct the context given a center word and vice versa. +In contrast, Pennington et al. \cite{} use the ratio between co-occurrence probabilities of two words with another one to learn a vector representation. + +The most recent models focus on the integration of subword and morphological information to provide suitable representations even for unseen, out-of-vocabulary words. +For example, Pinter et al. \cite{} try to reconstruct a pre-trained word embedding by learning a bi-directional LSTM model on character level. +Similarily, Bojanowski et al. \cite{bojanowski_enriching_2016} adapt the SkipGram by taking character n-grams into account. +They assign a vector representation to each character n-gram and represent words by summing over all of these representations of a word. + + +\subsection{ICD-10 Classification} +The ICD-10 coding task has already been carried out in the 2016 \cite{neveol_clinical_2016} and 2017 \cite{neveol_clef_2017} edition of the eHealth lab. +Participating teams used a plethora of different approaches to tackle the classification problem. +The methods can essentially be divided into two categories: knowledge-based \cite{cabot_sibm_2016,jonnagaddala_automatic_2017,van_mulligen_erasmus_2016} and machine learning (ML) approaches \cite{dermouche_ecstra-inserm_2016,ebersbach_fusion_2017,ho-dac_litl_2016,miftakhutdinov_kfu_2017}. +The former relies on lexical sources, medical terminologies and other ontologies to match (parts of) the certificate text with entries from the knowledge-bases according to a rule framework. +For example, Di Nunzio et al. \cite{di_nunzio_lexicon_2017} calculate a score for each ICD-10 dictionary entry by summing the binary or tf-idf weights of each term of a certificate line segment and assign the ICD-10 code with the highest score. +In contrast, Ho-Dac et al. \cite{ho-dac_litl_2017} treat the problem as information retrieval task and utilize the Apache Solr search engine\footnote{\url{http://lucene.apache.org/solr/}}. + +The ML-based approaches employ a variety of techniques, e.g. Conditional Random Fields (CRFs) \cite{ho-dac_litl_2016}, Labeled Latent Dirichlet Analysis (LDA) \cite{dermouche_ecstra-inserm_2016} and Support Vector Machines (SVMs) \cite{ebersbach_fusion_2017} with diverse hand-crafted features. + +Most similar to our approach is the work from Miftahutdinov and Tutbalina \cite{miftakhutdinov_kfu_2017}, which achieved the best results for English certificates in the last year's competition. +They use a neural LSTM-based encoder-decoder model that processes the raw certificate text as input and encodes it into a vector representation. +Furthermore a vector which captures the textual similarity between the certificate line and the death causes resp. diagnosis texts of the individual ICD-10 codes is used to integrate prior knowledge into the model. +The concatenation of both vector representations is then used to output the characters and numbers of the ICD-10 code in the decoding step. +In contrast to their work, our approach introduces a model for multi-language ICD-10 classification. +We utilize two separate recurrent neural networks, one sequence to sequence model for death cause extraction and one for classification, to predict the ICD-10 codes for a certificate text independent from which language they originate. diff --git a/paper/30_methods_intro.tex b/paper/30_methods_intro.tex index 1f4649c32602151a718891b28e6094820af6d640..44b90df52141eaffc368fe2f516375c73151ed7d 100644 --- a/paper/30_methods_intro.tex +++ b/paper/30_methods_intro.tex @@ -2,37 +2,5 @@ Our approach models the extraction and classification of death causes as two-step process. First, we employ a neural, multi-language sequence-to-sequence model to receive a death cause description for a given death certificate line. We will then use a second classification model to assign the respective ICD-10 -codes to the obtained death cause. The remainder of this section gives a short -introduction to recurrent neural networks, followed by a detailed explanation of -our two models. - -\subsection{Recurrent neural networks} -Recurrent neural networks (RNNs) are a widely used technique for sequence -learning problems such as machine translation -\cite{bahdanau_neural_2014,cho_learning_2014}, image captioning -\cite{bengio_scheduled_2015}, named entity recognition -\cite{lample_neural_2016,wei_disease_2016}, dependency parsing -\cite{dyer_transition-based_2015} and POS-tagging \cite{wang_part--speech_2015}. -RNNs model dynamic temporal behaviour in sequential data through recurrent -units, i.e. the hidden, internal state of a unit in one time step depends on the -internal state of the unit in the previous time step. These feedback connections -enable the network to memorize information from recent time steps and capture -long-term dependencies. - -However, training of RNNs can be difficult due to the vanishing gradient problem -\cite{hochreiter_gradient_2001,bengio_learning_1994}. The most widespread -modifications of RNNs to overcome this problem are Long Short-Term Memory -networks (LSTM) \cite{hochreiter_long_1997} and Gated Recurrent Units (GRU) -\cite{cho_learning_2014}. Both modifications use gated memories which control -and regulate the information flow between two recurrent units. A common LSTM -unit consists of a cell and three gates, an input gate, an output gate and a -forget gate. In general, LSTMs are chained together by connecting the outputs of -the previous unit to the inputs of the next one. -A further extension of the general RNN architecture are bidirectional networks, -which make the past and future context available in every time step. A -bidirectional LSTM model consists of a forward chain, which processes the input -data from left to right, and and backward chain, consuming the data in the -opposite direction. The final representation is typically the concatenation or a -linear combination of both states. - -AREN'T WE MOVING THIS TO RELATED WORK? \ No newline at end of file +codes to the obtained death cause. The remainder of this section detailed +explanation of our two models. \ No newline at end of file diff --git a/paper/31_methods_seq2seq.tex b/paper/31_methods_seq2seq.tex index 42cfde18daeedbffee0add5989b996cbafda651f..399b147e2bd49dfef2ed89047c7d6ce689523b3e 100644 --- a/paper/31_methods_seq2seq.tex +++ b/paper/31_methods_seq2seq.tex @@ -6,8 +6,7 @@ The goal of the model is to reassemble the dictionary death cause description te For this we adopt the encoder-decoder architecture proposed in \cite{sutskever_sequence_2014}. Figure \ref{fig:encoder_decoder} illustrates the architecture of the model. As encoder we utilize a forward LSTM model, which takes the single words of a certificate line as inputs and scans the line from left to right. -Each token is represented using pre-trained fastText\footnote{https://github.com/facebookresearch/fastText/} word embeddings\cite{bojanowski_enriching_2016}. -Word embedding models represent words using a real-valued vector and capture syntactic and semantic similarities between them. +Each token is represented using pre-trained fastText\footnote{https://github.com/facebookresearch/fastText/} word embeddings\cite{bojanowski_enriching_2016}. fastText embeddings take sub-word information into account during training whereby the model is able to provide suitable representations even for unseen, out-of-vocabulary (OOV) words. We utilize fastText embeddings for French, Italian and Hungarian trained on Common Crawl and Wikipedia articles\footnote{\url{https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md}}. Independently from the language a word originates from, we lookup the word in all three embedding models and concatenate the obtained vectors.