Add word embedding subsection

8cee2614 · Mario Sänger · 9671067a · 8cee2614 · 8cee2614 · 8cee2614
Commit 8cee2614 authored 6 years ago by Mario Sänger
--- a/paper/20_related_work.tex
+++ b/paper/20_related_work.tex
-The ICD-10 coding task has already been carried out in the 2016
+This section highlights previous work related to our approach.
-\cite{neveol_clinical_2016} and 2017 \cite{neveol_clef_2017} edition of the
+We give a brief introduction to the methodical foundations of our work, recurrent neural networks and word embeddings.
-eHealth lab. Participating teams used a plethora of different approaches to
+The section concludes with a summary of ICD-10 classification approaches used in previous eHealth Lab competitions. 
-tackle the classification problem. The methods can essentially be divided into
-two categories: knowledge-based
+\subsection{Recurrent neural networks}
-\cite{cabot_sibm_2016,jonnagaddala_automatic_2017,van_mulligen_erasmus_2016} and
+Recurrent neural networks (RNNs) are a widely used technique for sequence
-machine learning (ML) approaches
+learning problems such as machine translation
-\cite{dermouche_ecstra-inserm_2016,ebersbach_fusion_2017,ho-dac_litl_2016,miftakhutdinov_kfu_2017}.
+\cite{bahdanau_neural_2014,cho_learning_2014}, image captioning
-The former relies on lexical sources, medical terminologies and other ontologies
+\cite{bengio_scheduled_2015}, named entity recognition
-to match (parts of) the certificate text with entries from the knowledge-bases
+\cite{lample_neural_2016,wei_disease_2016}, dependency parsing
-according to a rule framework.  For example, Di Nunzio et al.
+\cite{dyer_transition-based_2015} and POS-tagging \cite{wang_part--speech_2015}.
-\cite{di_nunzio_lexicon_2017} calculate a score for each ICD-10 dictionary entry
+RNNs model dynamic temporal behaviour in sequential data through recurrent
-by summing the binary or tf-idf weights of each term of a certificate line
+units, i.e. the hidden, internal state of a unit in one time step depends on the
-segment and assign the ICD-10 code with the highest score. In contrast, Ho-Dac
+internal state of the unit in the previous time step. These feedback connections
-et al. \cite{ho-dac_litl_2017} treat the problem as information retrieval task
+enable the network to memorize information from recent time steps and capture
-and utilize the Apache Solr search engine\footnote{\url{http://lucene.apache.org/solr/}}.
+long-term dependencies. 
-The ML-based approaches employ a variety of techniques, e.g.
+However, training of RNNs can be difficult due to the vanishing gradient problem
-Conditional Random Fields (CRFs) \cite{ho-dac_litl_2016}, Labeled Latent
+\cite{hochreiter_gradient_2001,bengio_learning_1994}. The most widespread
-Dirichlet Analysis (LDA) \cite{dermouche_ecstra-inserm_2016} and Support Vector
+modifications of RNNs to overcome this problem are Long Short-Term Memory
-Machines (SVMs) \cite{ebersbach_fusion_2017} with diverse hand-crafted features.
+networks (LSTM) \cite{hochreiter_long_1997} and Gated Recurrent Units (GRU)
+\cite{cho_learning_2014}. Both modifications use gated memories which control
-Most similar to our approach is the work from Miftahutdinov and Tutbalina
+and regulate the information flow between two recurrent units. A common LSTM
-\cite{miftakhutdinov_kfu_2017}, which achieved the best results for English
+unit consists of a cell and three gates, an input gate, an output gate and a
-certificates in the last year's competition. They use a neural LSTM-based
+forget gate. In general, LSTMs are chained together by connecting the outputs of
-encoder-decoder model that processes the raw certificate text as input and
+the previous unit to the inputs of the next one.
-encodes it into a vector representation. Furthermore a vector which captures the
-textual similarity between the certificate line and the death causes resp.
+A further extension of the general RNN architecture are bidirectional networks,
-diagnosis texts of the individual ICD-10 codes is used to integrate prior
+which make the past and future context available in every time step. A
-knowledge into the model. The concatenation of both vector representations is
+bidirectional LSTM model consists of a forward chain, which processes the input
-then used to output the characters and numbers of the ICD-10 code in the
+data from left to right, and and backward chain, consuming the data in the
-decoding step. In contrast to their work, our approach introduces a model for
+opposite direction. The final representation is typically the concatenation or a
-multi-language ICD-10 classification. We utilize two separate recurrent neural
+linear combination of both states. 
-networks, one sequence to sequence model for death cause extraction and one for
-classification, to predict the ICD-10 codes for a certificate text independent
+\subsection{Word Embeddings}
-from which language they originate.
+Distributional semantic models have been researched for decades in the area of natural language processing (NLP) \cite{}.
+The investigated models aim to represent words using a real-valued vector (also called embedding) based on a huge amount of unlabeled texts which captures syntactic and semantic similarities between words.
+Starting with the publication of the work from Collobert et al. \cite{} in 2008, word embeddings are one of the hot topics in NLP and a plethora of appraoches have been proposed \cite{}.
+The majority of todays embedding models are based on deep learning models trained to perform some kind of language modeling task \cite{}. 
+The most popular embedding model is the Word2Vec model introduced by Mikolov et. al \cite{}. 
+They propose two shallow neural network models, continuous bag-of-words (CBOW) and SkipGram, that are trained to reconstruct the context given a center word and vice versa.
+In contrast, Pennington et al. \cite{} use the ratio between co-occurrence probabilities of two words with another one to learn a vector representation.
+The most recent models focus on the integration of subword and morphological information to provide suitable representations even for unseen, out-of-vocabulary words. 
+For example, Pinter et al. \cite{} try to reconstruct a pre-trained word embedding by learning a bi-directional LSTM model on character level. 
+Similarily, Bojanowski et al. \cite{bojanowski_enriching_2016} adapt the SkipGram by taking character n-grams into account. 
+They assign a vector representation to each character n-gram and represent words by summing over all of these representations of a word.
+\subsection{ICD-10 Classification}
+The ICD-10 coding task has already been carried out in the 2016 \cite{neveol_clinical_2016} and 2017 \cite{neveol_clef_2017} edition of the eHealth lab. 
+Participating teams used a plethora of different approaches to tackle the classification problem. 
+The methods can essentially be divided into two categories: knowledge-based \cite{cabot_sibm_2016,jonnagaddala_automatic_2017,van_mulligen_erasmus_2016} and machine learning (ML) approaches \cite{dermouche_ecstra-inserm_2016,ebersbach_fusion_2017,ho-dac_litl_2016,miftakhutdinov_kfu_2017}.
+The former relies on lexical sources, medical terminologies and other ontologies to match (parts of) the certificate text with entries from the knowledge-bases according to a rule framework. 
+For example, Di Nunzio et al. \cite{di_nunzio_lexicon_2017} calculate a score for each ICD-10 dictionary entry by summing the binary or tf-idf weights of each term of a certificate line segment and assign the ICD-10 code with the highest score. 
+In contrast, Ho-Dac et al. \cite{ho-dac_litl_2017} treat the problem as information retrieval task and utilize the Apache Solr search engine\footnote{\url{http://lucene.apache.org/solr/}}.
+The ML-based approaches employ a variety of techniques, e.g. Conditional Random Fields (CRFs) \cite{ho-dac_litl_2016}, Labeled Latent Dirichlet Analysis (LDA) \cite{dermouche_ecstra-inserm_2016} and Support Vector Machines (SVMs) \cite{ebersbach_fusion_2017} with diverse hand-crafted features.
+Most similar to our approach is the work from Miftahutdinov and Tutbalina \cite{miftakhutdinov_kfu_2017}, which achieved the best results for English certificates in the last year's competition. 
+They use a neural LSTM-based encoder-decoder model that processes the raw certificate text as input and encodes it into a vector representation. 
+Furthermore a vector which captures the textual similarity between the certificate line and the death causes resp. diagnosis texts of the individual ICD-10 codes is used to integrate prior knowledge into the model. 
+The concatenation of both vector representations is then used to output the characters and numbers of the ICD-10 code in the decoding step. 
+In contrast to their work, our approach introduces a model for multi-language ICD-10 classification. 
+We utilize two separate recurrent neural networks, one sequence to sequence model for death cause extraction and one for classification, to predict the ICD-10 codes for a certificate text independent from which language they originate.

--- a/paper/30_methods_intro.tex
+++ b/paper/30_methods_intro.tex
@@ -2,37 +2,5 @@ Our approach models the extraction and classification of death causes as
 two-step process. First, we employ a neural, multi-language sequence-to-sequence
 model to receive a death cause description for a given death certificate line.
 We will then use a second classification model to assign the respective ICD-10
-codes to the obtained death cause. The remainder of this section gives a short
+codes to the obtained death cause. The remainder of this section detailed
-introduction to recurrent neural networks, followed by a detailed explanation of
+explanation of our two models.
-our two models.
\ No newline at end of file
-\subsection{Recurrent neural networks}
-Recurrent neural networks (RNNs) are a widely used technique for sequence
-learning problems such as machine translation
-\cite{bahdanau_neural_2014,cho_learning_2014}, image captioning
-\cite{bengio_scheduled_2015}, named entity recognition
-\cite{lample_neural_2016,wei_disease_2016}, dependency parsing
-\cite{dyer_transition-based_2015} and POS-tagging \cite{wang_part--speech_2015}.
-RNNs model dynamic temporal behaviour in sequential data through recurrent
-units, i.e. the hidden, internal state of a unit in one time step depends on the
-internal state of the unit in the previous time step. These feedback connections
-enable the network to memorize information from recent time steps and capture
-long-term dependencies. 
-However, training of RNNs can be difficult due to the vanishing gradient problem
-\cite{hochreiter_gradient_2001,bengio_learning_1994}. The most widespread
-modifications of RNNs to overcome this problem are Long Short-Term Memory
-networks (LSTM) \cite{hochreiter_long_1997} and Gated Recurrent Units (GRU)
-\cite{cho_learning_2014}. Both modifications use gated memories which control
-and regulate the information flow between two recurrent units. A common LSTM
-unit consists of a cell and three gates, an input gate, an output gate and a
-forget gate. In general, LSTMs are chained together by connecting the outputs of
-the previous unit to the inputs of the next one.
-A further extension of the general RNN architecture are bidirectional networks,
-which make the past and future context available in every time step. A
-bidirectional LSTM model consists of a forward chain, which processes the input
-data from left to right, and and backward chain, consuming the data in the
-opposite direction. The final representation is typically the concatenation or a
-linear combination of both states. 
-AREN'T WE MOVING THIS TO RELATED WORK?
\ No newline at end of file
--- a/paper/31_methods_seq2seq.tex
+++ b/paper/31_methods_seq2seq.tex
@@ -6,8 +6,7 @@ The goal of the model is to reassemble the dictionary death cause description te
 For this we adopt the encoder-decoder architecture proposed in \cite{sutskever_sequence_2014}. Figure \ref{fig:encoder_decoder} illustrates the architecture of the model. 
 As encoder we utilize a forward LSTM model, which takes the single words of a certificate line as inputs and scans the line from left to right.
-Each token is represented using pre-trained fastText\footnote{https://github.com/facebookresearch/fastText/} word embeddings\cite{bojanowski_enriching_2016}. 
+Each token is represented using pre-trained fastText\footnote{https://github.com/facebookresearch/fastText/} word embeddings\cite{bojanowski_enriching_2016}.  
-Word embedding models represent words using a real-valued vector and capture syntactic and semantic similarities between them. 
 fastText embeddings take sub-word information into account during training whereby the model is able to provide suitable representations even for unseen, out-of-vocabulary (OOV) words. 
 We utilize fastText embeddings for French, Italian and Hungarian trained on Common Crawl and Wikipedia articles\footnote{\url{https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md}}. 
 Independently from the language a word originates from, we lookup the word in all three embedding models and concatenate the obtained vectors.