Skip to content
Snippets Groups Projects
Commit 8cee2614 authored by Mario Sänger's avatar Mario Sänger
Browse files

Add word embedding subsection

parent 9671067a
No related merge requests found
The ICD-10 coding task has already been carried out in the 2016 This section highlights previous work related to our approach.
\cite{neveol_clinical_2016} and 2017 \cite{neveol_clef_2017} edition of the We give a brief introduction to the methodical foundations of our work, recurrent neural networks and word embeddings.
eHealth lab. Participating teams used a plethora of different approaches to The section concludes with a summary of ICD-10 classification approaches used in previous eHealth Lab competitions.
tackle the classification problem. The methods can essentially be divided into
two categories: knowledge-based \subsection{Recurrent neural networks}
\cite{cabot_sibm_2016,jonnagaddala_automatic_2017,van_mulligen_erasmus_2016} and Recurrent neural networks (RNNs) are a widely used technique for sequence
machine learning (ML) approaches learning problems such as machine translation
\cite{dermouche_ecstra-inserm_2016,ebersbach_fusion_2017,ho-dac_litl_2016,miftakhutdinov_kfu_2017}. \cite{bahdanau_neural_2014,cho_learning_2014}, image captioning
The former relies on lexical sources, medical terminologies and other ontologies \cite{bengio_scheduled_2015}, named entity recognition
to match (parts of) the certificate text with entries from the knowledge-bases \cite{lample_neural_2016,wei_disease_2016}, dependency parsing
according to a rule framework. For example, Di Nunzio et al. \cite{dyer_transition-based_2015} and POS-tagging \cite{wang_part--speech_2015}.
\cite{di_nunzio_lexicon_2017} calculate a score for each ICD-10 dictionary entry RNNs model dynamic temporal behaviour in sequential data through recurrent
by summing the binary or tf-idf weights of each term of a certificate line units, i.e. the hidden, internal state of a unit in one time step depends on the
segment and assign the ICD-10 code with the highest score. In contrast, Ho-Dac internal state of the unit in the previous time step. These feedback connections
et al. \cite{ho-dac_litl_2017} treat the problem as information retrieval task enable the network to memorize information from recent time steps and capture
and utilize the Apache Solr search engine\footnote{\url{http://lucene.apache.org/solr/}}. long-term dependencies.
The ML-based approaches employ a variety of techniques, e.g. However, training of RNNs can be difficult due to the vanishing gradient problem
Conditional Random Fields (CRFs) \cite{ho-dac_litl_2016}, Labeled Latent \cite{hochreiter_gradient_2001,bengio_learning_1994}. The most widespread
Dirichlet Analysis (LDA) \cite{dermouche_ecstra-inserm_2016} and Support Vector modifications of RNNs to overcome this problem are Long Short-Term Memory
Machines (SVMs) \cite{ebersbach_fusion_2017} with diverse hand-crafted features. networks (LSTM) \cite{hochreiter_long_1997} and Gated Recurrent Units (GRU)
\cite{cho_learning_2014}. Both modifications use gated memories which control
Most similar to our approach is the work from Miftahutdinov and Tutbalina and regulate the information flow between two recurrent units. A common LSTM
\cite{miftakhutdinov_kfu_2017}, which achieved the best results for English unit consists of a cell and three gates, an input gate, an output gate and a
certificates in the last year's competition. They use a neural LSTM-based forget gate. In general, LSTMs are chained together by connecting the outputs of
encoder-decoder model that processes the raw certificate text as input and the previous unit to the inputs of the next one.
encodes it into a vector representation. Furthermore a vector which captures the
textual similarity between the certificate line and the death causes resp. A further extension of the general RNN architecture are bidirectional networks,
diagnosis texts of the individual ICD-10 codes is used to integrate prior which make the past and future context available in every time step. A
knowledge into the model. The concatenation of both vector representations is bidirectional LSTM model consists of a forward chain, which processes the input
then used to output the characters and numbers of the ICD-10 code in the data from left to right, and and backward chain, consuming the data in the
decoding step. In contrast to their work, our approach introduces a model for opposite direction. The final representation is typically the concatenation or a
multi-language ICD-10 classification. We utilize two separate recurrent neural linear combination of both states.
networks, one sequence to sequence model for death cause extraction and one for
classification, to predict the ICD-10 codes for a certificate text independent \subsection{Word Embeddings}
from which language they originate. Distributional semantic models have been researched for decades in the area of natural language processing (NLP) \cite{}.
The investigated models aim to represent words using a real-valued vector (also called embedding) based on a huge amount of unlabeled texts which captures syntactic and semantic similarities between words.
Starting with the publication of the work from Collobert et al. \cite{} in 2008, word embeddings are one of the hot topics in NLP and a plethora of appraoches have been proposed \cite{}.
The majority of todays embedding models are based on deep learning models trained to perform some kind of language modeling task \cite{}.
The most popular embedding model is the Word2Vec model introduced by Mikolov et. al \cite{}.
They propose two shallow neural network models, continuous bag-of-words (CBOW) and SkipGram, that are trained to reconstruct the context given a center word and vice versa.
In contrast, Pennington et al. \cite{} use the ratio between co-occurrence probabilities of two words with another one to learn a vector representation.
The most recent models focus on the integration of subword and morphological information to provide suitable representations even for unseen, out-of-vocabulary words.
For example, Pinter et al. \cite{} try to reconstruct a pre-trained word embedding by learning a bi-directional LSTM model on character level.
Similarily, Bojanowski et al. \cite{bojanowski_enriching_2016} adapt the SkipGram by taking character n-grams into account.
They assign a vector representation to each character n-gram and represent words by summing over all of these representations of a word.
\subsection{ICD-10 Classification}
The ICD-10 coding task has already been carried out in the 2016 \cite{neveol_clinical_2016} and 2017 \cite{neveol_clef_2017} edition of the eHealth lab.
Participating teams used a plethora of different approaches to tackle the classification problem.
The methods can essentially be divided into two categories: knowledge-based \cite{cabot_sibm_2016,jonnagaddala_automatic_2017,van_mulligen_erasmus_2016} and machine learning (ML) approaches \cite{dermouche_ecstra-inserm_2016,ebersbach_fusion_2017,ho-dac_litl_2016,miftakhutdinov_kfu_2017}.
The former relies on lexical sources, medical terminologies and other ontologies to match (parts of) the certificate text with entries from the knowledge-bases according to a rule framework.
For example, Di Nunzio et al. \cite{di_nunzio_lexicon_2017} calculate a score for each ICD-10 dictionary entry by summing the binary or tf-idf weights of each term of a certificate line segment and assign the ICD-10 code with the highest score.
In contrast, Ho-Dac et al. \cite{ho-dac_litl_2017} treat the problem as information retrieval task and utilize the Apache Solr search engine\footnote{\url{http://lucene.apache.org/solr/}}.
The ML-based approaches employ a variety of techniques, e.g. Conditional Random Fields (CRFs) \cite{ho-dac_litl_2016}, Labeled Latent Dirichlet Analysis (LDA) \cite{dermouche_ecstra-inserm_2016} and Support Vector Machines (SVMs) \cite{ebersbach_fusion_2017} with diverse hand-crafted features.
Most similar to our approach is the work from Miftahutdinov and Tutbalina \cite{miftakhutdinov_kfu_2017}, which achieved the best results for English certificates in the last year's competition.
They use a neural LSTM-based encoder-decoder model that processes the raw certificate text as input and encodes it into a vector representation.
Furthermore a vector which captures the textual similarity between the certificate line and the death causes resp. diagnosis texts of the individual ICD-10 codes is used to integrate prior knowledge into the model.
The concatenation of both vector representations is then used to output the characters and numbers of the ICD-10 code in the decoding step.
In contrast to their work, our approach introduces a model for multi-language ICD-10 classification.
We utilize two separate recurrent neural networks, one sequence to sequence model for death cause extraction and one for classification, to predict the ICD-10 codes for a certificate text independent from which language they originate.
......
...@@ -2,37 +2,5 @@ Our approach models the extraction and classification of death causes as ...@@ -2,37 +2,5 @@ Our approach models the extraction and classification of death causes as
two-step process. First, we employ a neural, multi-language sequence-to-sequence two-step process. First, we employ a neural, multi-language sequence-to-sequence
model to receive a death cause description for a given death certificate line. model to receive a death cause description for a given death certificate line.
We will then use a second classification model to assign the respective ICD-10 We will then use a second classification model to assign the respective ICD-10
codes to the obtained death cause. The remainder of this section gives a short codes to the obtained death cause. The remainder of this section detailed
introduction to recurrent neural networks, followed by a detailed explanation of explanation of our two models.
our two models. \ No newline at end of file
\subsection{Recurrent neural networks}
Recurrent neural networks (RNNs) are a widely used technique for sequence
learning problems such as machine translation
\cite{bahdanau_neural_2014,cho_learning_2014}, image captioning
\cite{bengio_scheduled_2015}, named entity recognition
\cite{lample_neural_2016,wei_disease_2016}, dependency parsing
\cite{dyer_transition-based_2015} and POS-tagging \cite{wang_part--speech_2015}.
RNNs model dynamic temporal behaviour in sequential data through recurrent
units, i.e. the hidden, internal state of a unit in one time step depends on the
internal state of the unit in the previous time step. These feedback connections
enable the network to memorize information from recent time steps and capture
long-term dependencies.
However, training of RNNs can be difficult due to the vanishing gradient problem
\cite{hochreiter_gradient_2001,bengio_learning_1994}. The most widespread
modifications of RNNs to overcome this problem are Long Short-Term Memory
networks (LSTM) \cite{hochreiter_long_1997} and Gated Recurrent Units (GRU)
\cite{cho_learning_2014}. Both modifications use gated memories which control
and regulate the information flow between two recurrent units. A common LSTM
unit consists of a cell and three gates, an input gate, an output gate and a
forget gate. In general, LSTMs are chained together by connecting the outputs of
the previous unit to the inputs of the next one.
A further extension of the general RNN architecture are bidirectional networks,
which make the past and future context available in every time step. A
bidirectional LSTM model consists of a forward chain, which processes the input
data from left to right, and and backward chain, consuming the data in the
opposite direction. The final representation is typically the concatenation or a
linear combination of both states.
AREN'T WE MOVING THIS TO RELATED WORK?
\ No newline at end of file
...@@ -6,8 +6,7 @@ The goal of the model is to reassemble the dictionary death cause description te ...@@ -6,8 +6,7 @@ The goal of the model is to reassemble the dictionary death cause description te
For this we adopt the encoder-decoder architecture proposed in \cite{sutskever_sequence_2014}. Figure \ref{fig:encoder_decoder} illustrates the architecture of the model. For this we adopt the encoder-decoder architecture proposed in \cite{sutskever_sequence_2014}. Figure \ref{fig:encoder_decoder} illustrates the architecture of the model.
As encoder we utilize a forward LSTM model, which takes the single words of a certificate line as inputs and scans the line from left to right. As encoder we utilize a forward LSTM model, which takes the single words of a certificate line as inputs and scans the line from left to right.
Each token is represented using pre-trained fastText\footnote{https://github.com/facebookresearch/fastText/} word embeddings\cite{bojanowski_enriching_2016}. Each token is represented using pre-trained fastText\footnote{https://github.com/facebookresearch/fastText/} word embeddings\cite{bojanowski_enriching_2016}.
Word embedding models represent words using a real-valued vector and capture syntactic and semantic similarities between them.
fastText embeddings take sub-word information into account during training whereby the model is able to provide suitable representations even for unseen, out-of-vocabulary (OOV) words. fastText embeddings take sub-word information into account during training whereby the model is able to provide suitable representations even for unseen, out-of-vocabulary (OOV) words.
We utilize fastText embeddings for French, Italian and Hungarian trained on Common Crawl and Wikipedia articles\footnote{\url{https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md}}. We utilize fastText embeddings for French, Italian and Hungarian trained on Common Crawl and Wikipedia articles\footnote{\url{https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md}}.
Independently from the language a word originates from, we lookup the word in all three embedding models and concatenate the obtained vectors. Independently from the language a word originates from, we lookup the word in all three embedding models and concatenate the obtained vectors.
......
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment