diff --git a/paper/10_introduction.tex b/paper/10_introduction.tex index 1f014ca94ec69d3d56b8366282855f5173ba6c20..045cff6251abbe30cd1200d2801275576fd28fb8 100644 --- a/paper/10_introduction.tex +++ b/paper/10_introduction.tex @@ -1,26 +1,25 @@ Automatic extraction, classification and analysis of biological and medical concepts from unstructured texts, such as scientific publications or electronic health documents, is a highly important task to support many applications in research, daily clinical routine and policy-making. -Computer-aided approaches can improve decision making and support clinical processes, for example, by giving a more sophisticated overview about a research area, providing detailed information about the aetiopathology of a patient or disease patterns. -In the past years major advances have been made in the area of natural language processing. +Computer-assisted approaches can improve decision making and support clinical processes, for example, by giving a more sophisticated overview about a research area, providing detailed information about the aetiopathology of a patient or disease patterns. +In the past years major advances have been made in the area of natural-language processing (NLP). However, improvements in the field of biomedical text mining lag behind other domains mainly due to privacy issues and concerns regarding the processed data (e.g. electronic health records). The CLEF eHealth lab attends to circumvent this through organization of various shared tasks %which aid and support the development of approaches to exploit electronically available medical content \cite{suominen_overview_2018}. -In particular, Task 1\footnote{https://sites.google.com/view/clef-ehealth-2018/task-1-multilingual-information-extraction-icd10-coding} of the lab is concerned with the extraction and classification of death causes from death certificates originating from different languages \cite{neveol_clef_2018}. +In particular, Task 1\footnote{\url{https://sites.google.com/view/clef-ehealth-2018/task-1-multilingual-information-extraction-icd10-coding}} of the lab is concerned with the extraction and classification of death causes from death certificates originating from different languages \cite{neveol_clef_2018}. Participants were asked to classify the death causes mentioned in the certificates according to the International Classification of Disease version 10 (ICD-10). The task %has been carried out the last two years of the lab, however was concerned with French and English death certificates in previous years. -In contrast, this year the organizers provided annotated death reports as well as ICD-10 dictionaries for French, Italian and Hungarian this year. +In contrast, this year the organizers provided annotated death reports as well as ICD-10 dictionaries for French, Italian and Hungarian. The development of language-independent, multilingual approaches was encouraged. -Inspired by the recent success of recurrent neural network models \cite{cho_learning_2014,lample_neural_2016,dyer_transition-based_2015} in general and the convincing performance of the work from Miftahutdinov and Tutbalina \cite{miftakhutdinov_kfu_2017} in CLEF eHealth Task 1 2017 competition we opt for the development of a deep learning model for this year's task. +Inspired by the recent success of recurrent neural network models (RNN) \cite{cho_learning_2014,lample_neural_2016,dyer_transition-based_2015} in general and the convincing performance of the work from Miftahutdinov and Tutbalina \cite{miftakhutdinov_kfu_2017} in CLEF eHealth Task 1 2017 competition, we opt for the development of a deep learning model for this year's task. Our work introduces a language independent approach for ICD-10 classification using multi-language word embeddings and LSTM-based recurrent models. We divide the proposed pipeline %$classification into two tasks. First, we perform Name Entity Recognition (NER), i.e. extract the death cause description from a certificate line, with an an encoder-decoder model. -Given the death cause, Named Entity Normalization (NEN), i.e. assigning an ICD-10 code to extracted death cause, is performed by a separate LSTM model. +Given the death cause, Named Entity Normalization (NEN), i.e. assigning an ICD-10 code to extracted death cause, is performed by a separate long short-term memory model (LSTM). In this work we present the setup and evaluation of an initial, baseline language-independent approach which builds on a heuristic multi-language embedding space and therefore only needs one single model for all three data sets. Moreover, we tried to use as little as possible additional external resources. -PARAGRAPH ABOUT EMBEDDINGS. diff --git a/paper/20_related_work.tex b/paper/20_related_work.tex index c743cbbd0464c1b120ba7ee35ce1a271973828c8..d2011cb4910cb9986290e7db6d4e9b44d1388748 100644 --- a/paper/20_related_work.tex +++ b/paper/20_related_work.tex @@ -1,10 +1,9 @@ This section highlights previous work related to our approach. -We give a brief introduction to the methodical foundations of our work, recurrent neural networks and word embeddings. +We give a brief introduction to the methodical foundations of our work, RNNs and word embeddings. The section concludes with a summary of ICD-10 classification approaches used in previous eHealth Lab competitions. -\subsection{Recurrent neural networks} -Recurrent neural networks (RNNs) are a widely used technique for sequence -learning problems such as machine translation +\subsection{Recurrent neural networks (RNN)} +RNNs are a widely used technique for sequence learning problems such as machine translation \cite{bahdanau_neural_2014,cho_learning_2014}, image captioning \cite{bengio_scheduled_2015}, named entity recognition \cite{lample_neural_2016,wei_disease_2016}, dependency parsing @@ -17,8 +16,7 @@ long-term dependencies. However, training of RNNs can be difficult due to the vanishing gradient problem \cite{hochreiter_gradient_2001,bengio_learning_1994}. The most widespread -modifications of RNNs to overcome this problem are Long Short-Term Memory -networks (LSTM) \cite{hochreiter_long_1997} and Gated Recurrent Units (GRU) +modifications of RNNs to overcome this problem are LSTMs \cite{hochreiter_long_1997} and gated recurrent units (GRU) \cite{cho_learning_2014}. Both modifications use gated memories which control and regulate the information flow between two recurrent units. A common LSTM unit consists of a cell and three gates, an input gate, an output gate and a @@ -33,7 +31,7 @@ opposite direction. The final representation is typically the concatenation or a linear combination of both states. \subsection{Word Embeddings} -Distributional semantic models (DSMs) have been researched for decades in the area of natural language processing (NLP) \cite{turney_frequency_2010}. +Distributional semantic models (DSMs) have been researched for decades in NLP \cite{turney_frequency_2010}. Based on a huge amount of unlabeled texts, DSMs aim to represent words using a real-valued vector (also called embedding) which captures syntactic and semantic similarities between the units. Starting with the publication of the work from Collobert et al. \cite{collobert_natural_2011} in 2011, learning embeddings for linguistic units, such as words, sentences or paragraphs, are one of the hot topics in NLP and a plethora of appraoches have been proposed \cite{bojanowski_enriching_2016,mikolov_distributed_2013,peters_deep_2018,pennington_glove}. @@ -66,7 +64,7 @@ They use a neural LSTM-based encoder-decoder model that processes the raw certif Furthermore a vector which captures the textual similarity between the certificate line and the death causes resp. diagnosis texts of the individual ICD-10 codes is used to integrate prior knowledge into the model. The concatenation of both vector representations is then used to output the characters and numbers of the ICD-10 code in the decoding step. In contrast to their work, our approach introduces a model for multi-language ICD-10 classification. -We utilize two separate recurrent neural networks, one sequence to sequence model for death cause extraction and one for classification, to predict the ICD-10 codes for a certificate text independent from which language they originate. +We utilize two separate RNNs, a sequence-to-sequence model for death cause extraction and a for classification, to predict the ICD-10 codes for a certificate text independent from the original language. diff --git a/paper/30_methods_intro.tex b/paper/30_methods_intro.tex index 1ebc32cf18625fb95b4ddd3bee033a1b084c114b..ae1fbedfb8e723ca5134802f280c4028df5cf8e2 100644 --- a/paper/30_methods_intro.tex +++ b/paper/30_methods_intro.tex @@ -1,6 +1,4 @@ -Our approach models the extraction and classification of death causes as -two-step process. First, we employ a neural, multi-language sequence-to-sequence -model to receive a death cause description for a given death certificate line. -We will then use a second classification model to assign the respective ICD-10 -codes to the obtained death cause. The remainder of this section detailed -explanation of our two models. \ No newline at end of file +Our approach models the extraction and classification of death causes as two-step process. +First, we employ a neural, multi-language sequence-to-sequence model to receive a death cause description for a given death certificate line. +We then use a second classification model to assign the respective ICD-10 codes to the obtained death cause. +The remainder of this section detailed explanation of the architecture of the two models. \ No newline at end of file diff --git a/paper/31_methods_seq2seq.tex b/paper/31_methods_seq2seq.tex index 399b147e2bd49dfef2ed89047c7d6ce689523b3e..bc34bcfef7ebd9ef19e93028c86ada2a06370ec9 100644 --- a/paper/31_methods_seq2seq.tex +++ b/paper/31_methods_seq2seq.tex @@ -5,11 +5,10 @@ The dictionaries provide us with death causes resp. diagnosis for each ICD-10 co The goal of the model is to reassemble the dictionary death cause description text from the certificate line. For this we adopt the encoder-decoder architecture proposed in \cite{sutskever_sequence_2014}. Figure \ref{fig:encoder_decoder} illustrates the architecture of the model. -As encoder we utilize a forward LSTM model, which takes the single words of a certificate line as inputs and scans the line from left to right. -Each token is represented using pre-trained fastText\footnote{https://github.com/facebookresearch/fastText/} word embeddings\cite{bojanowski_enriching_2016}. -fastText embeddings take sub-word information into account during training whereby the model is able to provide suitable representations even for unseen, out-of-vocabulary (OOV) words. -We utilize fastText embeddings for French, Italian and Hungarian trained on Common Crawl and Wikipedia articles\footnote{\url{https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md}}. -Independently from the language a word originates from, we lookup the word in all three embedding models and concatenate the obtained vectors. +As encoder we utilize a unidirectional LSTM model, which takes the single words of a certificate line as inputs and scans the line from left to right. +Each token is represented using pre-trained fastText\footnote{\url{https://github.com/facebookresearch/fastText/}} word embeddings \cite{bojanowski_enriching_2016}. +We utilize fastText embedding models for French, Italian and Hungarian trained on Common Crawl and Wikipedia articles\footnote{\url{https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md}}. +Independently from the original language a word we represent it by looking up the word in all three embedding models and concatenate the obtained vectors. Through this we get a (basic) multi-language representation of the word. This heuristic composition constitutes a naive solution to build a multi-language embedding space. However we opted to evaluate this approach as a simple baseline for future work. @@ -22,7 +21,7 @@ of all three languages.} \label{fig:encoder_decoder} \end{figure} -For the decoder with utilize another LSTM model. The initial input of the decoder is the final state of the encoder model. +For the decoder we utilize another LSTM model. The initial input of the decoder is the final state of the encoder model. Moreover, each token of the dictionary death cause description name (padded with special start and end tag) serves as input for the different time steps. Again, we use fastText embeddings of all three languages to represent the token. The decoder predicts one-hot-encoded words of the symptom name.