Skip to content
Snippets Groups Projects
Commit f5caafac authored by Mario Sänger's avatar Mario Sänger
Browse files

Integrate minor feedback

parent afcb8d68
No related merge requests found
Automatic extraction, classification and analysis of biological and medical concepts from unstructured texts, such as scientific publications or electronic health documents, is a highly important task to support many applications in research, daily clinical routine and policy-making. Automatic extraction, classification and analysis of biological and medical concepts from unstructured texts, such as scientific publications or electronic health documents, is a highly important task to support many applications in research, daily clinical routine and policy-making.
Computer-aided approaches can improve decision making and support clinical processes, for example, by giving a more sophisticated overview about a research area, providing detailed information about the aetiopathology of a patient or disease patterns. Computer-assisted approaches can improve decision making and support clinical processes, for example, by giving a more sophisticated overview about a research area, providing detailed information about the aetiopathology of a patient or disease patterns.
In the past years major advances have been made in the area of natural language processing. In the past years major advances have been made in the area of natural-language processing (NLP).
However, improvements in the field of biomedical text mining lag behind other domains mainly due to privacy issues and concerns regarding the processed data (e.g. electronic health records). However, improvements in the field of biomedical text mining lag behind other domains mainly due to privacy issues and concerns regarding the processed data (e.g. electronic health records).
The CLEF eHealth lab attends to circumvent this through organization of various shared tasks %which aid and support the development of approaches The CLEF eHealth lab attends to circumvent this through organization of various shared tasks %which aid and support the development of approaches
to exploit electronically available medical content \cite{suominen_overview_2018}. to exploit electronically available medical content \cite{suominen_overview_2018}.
In particular, Task 1\footnote{https://sites.google.com/view/clef-ehealth-2018/task-1-multilingual-information-extraction-icd10-coding} of the lab is concerned with the extraction and classification of death causes from death certificates originating from different languages \cite{neveol_clef_2018}. In particular, Task 1\footnote{\url{https://sites.google.com/view/clef-ehealth-2018/task-1-multilingual-information-extraction-icd10-coding}} of the lab is concerned with the extraction and classification of death causes from death certificates originating from different languages \cite{neveol_clef_2018}.
Participants were asked to classify the death causes mentioned in the certificates according to the International Classification of Disease version 10 (ICD-10). Participants were asked to classify the death causes mentioned in the certificates according to the International Classification of Disease version 10 (ICD-10).
The task %has been carried out the last two years of the lab, however The task %has been carried out the last two years of the lab, however
was concerned with French and English death certificates in previous years. was concerned with French and English death certificates in previous years.
In contrast, this year the organizers provided annotated death reports as well as ICD-10 dictionaries for French, Italian and Hungarian this year. In contrast, this year the organizers provided annotated death reports as well as ICD-10 dictionaries for French, Italian and Hungarian.
The development of language-independent, multilingual approaches was encouraged. The development of language-independent, multilingual approaches was encouraged.
Inspired by the recent success of recurrent neural network models \cite{cho_learning_2014,lample_neural_2016,dyer_transition-based_2015} in general and the convincing performance of the work from Miftahutdinov and Tutbalina \cite{miftakhutdinov_kfu_2017} in CLEF eHealth Task 1 2017 competition we opt for the development of a deep learning model for this year's task. Inspired by the recent success of recurrent neural network models (RNN) \cite{cho_learning_2014,lample_neural_2016,dyer_transition-based_2015} in general and the convincing performance of the work from Miftahutdinov and Tutbalina \cite{miftakhutdinov_kfu_2017} in CLEF eHealth Task 1 2017 competition, we opt for the development of a deep learning model for this year's task.
Our work introduces a language independent approach for ICD-10 classification using multi-language word embeddings and LSTM-based recurrent models. Our work introduces a language independent approach for ICD-10 classification using multi-language word embeddings and LSTM-based recurrent models.
We divide the proposed pipeline %$classification We divide the proposed pipeline %$classification
into two tasks. into two tasks.
First, we perform Name Entity Recognition (NER), i.e. extract the death cause description from a certificate line, with an an encoder-decoder model. First, we perform Name Entity Recognition (NER), i.e. extract the death cause description from a certificate line, with an an encoder-decoder model.
Given the death cause, Named Entity Normalization (NEN), i.e. assigning an ICD-10 code to extracted death cause, is performed by a separate LSTM model. Given the death cause, Named Entity Normalization (NEN), i.e. assigning an ICD-10 code to extracted death cause, is performed by a separate long short-term memory model (LSTM).
In this work we present the setup and evaluation of an initial, baseline language-independent approach which builds on a heuristic multi-language embedding space and therefore only needs one single model for all three data sets. In this work we present the setup and evaluation of an initial, baseline language-independent approach which builds on a heuristic multi-language embedding space and therefore only needs one single model for all three data sets.
Moreover, we tried to use as little as possible additional external resources. Moreover, we tried to use as little as possible additional external resources.
PARAGRAPH ABOUT EMBEDDINGS.
......
This section highlights previous work related to our approach. This section highlights previous work related to our approach.
We give a brief introduction to the methodical foundations of our work, recurrent neural networks and word embeddings. We give a brief introduction to the methodical foundations of our work, RNNs and word embeddings.
The section concludes with a summary of ICD-10 classification approaches used in previous eHealth Lab competitions. The section concludes with a summary of ICD-10 classification approaches used in previous eHealth Lab competitions.
\subsection{Recurrent neural networks} \subsection{Recurrent neural networks (RNN)}
Recurrent neural networks (RNNs) are a widely used technique for sequence RNNs are a widely used technique for sequence learning problems such as machine translation
learning problems such as machine translation
\cite{bahdanau_neural_2014,cho_learning_2014}, image captioning \cite{bahdanau_neural_2014,cho_learning_2014}, image captioning
\cite{bengio_scheduled_2015}, named entity recognition \cite{bengio_scheduled_2015}, named entity recognition
\cite{lample_neural_2016,wei_disease_2016}, dependency parsing \cite{lample_neural_2016,wei_disease_2016}, dependency parsing
...@@ -17,8 +16,7 @@ long-term dependencies. ...@@ -17,8 +16,7 @@ long-term dependencies.
However, training of RNNs can be difficult due to the vanishing gradient problem However, training of RNNs can be difficult due to the vanishing gradient problem
\cite{hochreiter_gradient_2001,bengio_learning_1994}. The most widespread \cite{hochreiter_gradient_2001,bengio_learning_1994}. The most widespread
modifications of RNNs to overcome this problem are Long Short-Term Memory modifications of RNNs to overcome this problem are LSTMs \cite{hochreiter_long_1997} and gated recurrent units (GRU)
networks (LSTM) \cite{hochreiter_long_1997} and Gated Recurrent Units (GRU)
\cite{cho_learning_2014}. Both modifications use gated memories which control \cite{cho_learning_2014}. Both modifications use gated memories which control
and regulate the information flow between two recurrent units. A common LSTM and regulate the information flow between two recurrent units. A common LSTM
unit consists of a cell and three gates, an input gate, an output gate and a unit consists of a cell and three gates, an input gate, an output gate and a
...@@ -33,7 +31,7 @@ opposite direction. The final representation is typically the concatenation or a ...@@ -33,7 +31,7 @@ opposite direction. The final representation is typically the concatenation or a
linear combination of both states. linear combination of both states.
\subsection{Word Embeddings} \subsection{Word Embeddings}
Distributional semantic models (DSMs) have been researched for decades in the area of natural language processing (NLP) \cite{turney_frequency_2010}. Distributional semantic models (DSMs) have been researched for decades in NLP \cite{turney_frequency_2010}.
Based on a huge amount of unlabeled texts, DSMs aim to represent words using a real-valued vector (also called embedding) which captures syntactic and semantic similarities between the units. Based on a huge amount of unlabeled texts, DSMs aim to represent words using a real-valued vector (also called embedding) which captures syntactic and semantic similarities between the units.
Starting with the publication of the work from Collobert et al. \cite{collobert_natural_2011} in 2011, learning embeddings for linguistic units, such as words, sentences or paragraphs, are one of the hot topics in NLP and a plethora of appraoches have been proposed \cite{bojanowski_enriching_2016,mikolov_distributed_2013,peters_deep_2018,pennington_glove}. Starting with the publication of the work from Collobert et al. \cite{collobert_natural_2011} in 2011, learning embeddings for linguistic units, such as words, sentences or paragraphs, are one of the hot topics in NLP and a plethora of appraoches have been proposed \cite{bojanowski_enriching_2016,mikolov_distributed_2013,peters_deep_2018,pennington_glove}.
...@@ -66,7 +64,7 @@ They use a neural LSTM-based encoder-decoder model that processes the raw certif ...@@ -66,7 +64,7 @@ They use a neural LSTM-based encoder-decoder model that processes the raw certif
Furthermore a vector which captures the textual similarity between the certificate line and the death causes resp. diagnosis texts of the individual ICD-10 codes is used to integrate prior knowledge into the model. Furthermore a vector which captures the textual similarity between the certificate line and the death causes resp. diagnosis texts of the individual ICD-10 codes is used to integrate prior knowledge into the model.
The concatenation of both vector representations is then used to output the characters and numbers of the ICD-10 code in the decoding step. The concatenation of both vector representations is then used to output the characters and numbers of the ICD-10 code in the decoding step.
In contrast to their work, our approach introduces a model for multi-language ICD-10 classification. In contrast to their work, our approach introduces a model for multi-language ICD-10 classification.
We utilize two separate recurrent neural networks, one sequence to sequence model for death cause extraction and one for classification, to predict the ICD-10 codes for a certificate text independent from which language they originate. We utilize two separate RNNs, a sequence-to-sequence model for death cause extraction and a for classification, to predict the ICD-10 codes for a certificate text independent from the original language.
......
Our approach models the extraction and classification of death causes as Our approach models the extraction and classification of death causes as two-step process.
two-step process. First, we employ a neural, multi-language sequence-to-sequence First, we employ a neural, multi-language sequence-to-sequence model to receive a death cause description for a given death certificate line.
model to receive a death cause description for a given death certificate line. We then use a second classification model to assign the respective ICD-10 codes to the obtained death cause.
We will then use a second classification model to assign the respective ICD-10 The remainder of this section detailed explanation of the architecture of the two models.
codes to the obtained death cause. The remainder of this section detailed \ No newline at end of file
explanation of our two models.
\ No newline at end of file
...@@ -5,11 +5,10 @@ The dictionaries provide us with death causes resp. diagnosis for each ICD-10 co ...@@ -5,11 +5,10 @@ The dictionaries provide us with death causes resp. diagnosis for each ICD-10 co
The goal of the model is to reassemble the dictionary death cause description text from the certificate line. The goal of the model is to reassemble the dictionary death cause description text from the certificate line.
For this we adopt the encoder-decoder architecture proposed in \cite{sutskever_sequence_2014}. Figure \ref{fig:encoder_decoder} illustrates the architecture of the model. For this we adopt the encoder-decoder architecture proposed in \cite{sutskever_sequence_2014}. Figure \ref{fig:encoder_decoder} illustrates the architecture of the model.
As encoder we utilize a forward LSTM model, which takes the single words of a certificate line as inputs and scans the line from left to right. As encoder we utilize a unidirectional LSTM model, which takes the single words of a certificate line as inputs and scans the line from left to right.
Each token is represented using pre-trained fastText\footnote{https://github.com/facebookresearch/fastText/} word embeddings\cite{bojanowski_enriching_2016}. Each token is represented using pre-trained fastText\footnote{\url{https://github.com/facebookresearch/fastText/}} word embeddings \cite{bojanowski_enriching_2016}.
fastText embeddings take sub-word information into account during training whereby the model is able to provide suitable representations even for unseen, out-of-vocabulary (OOV) words. We utilize fastText embedding models for French, Italian and Hungarian trained on Common Crawl and Wikipedia articles\footnote{\url{https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md}}.
We utilize fastText embeddings for French, Italian and Hungarian trained on Common Crawl and Wikipedia articles\footnote{\url{https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md}}. Independently from the original language a word we represent it by looking up the word in all three embedding models and concatenate the obtained vectors.
Independently from the language a word originates from, we lookup the word in all three embedding models and concatenate the obtained vectors.
Through this we get a (basic) multi-language representation of the word. Through this we get a (basic) multi-language representation of the word.
This heuristic composition constitutes a naive solution to build a multi-language embedding space. This heuristic composition constitutes a naive solution to build a multi-language embedding space.
However we opted to evaluate this approach as a simple baseline for future work. However we opted to evaluate this approach as a simple baseline for future work.
...@@ -22,7 +21,7 @@ of all three languages.} ...@@ -22,7 +21,7 @@ of all three languages.}
\label{fig:encoder_decoder} \label{fig:encoder_decoder}
\end{figure} \end{figure}
For the decoder with utilize another LSTM model. The initial input of the decoder is the final state of the encoder model. For the decoder we utilize another LSTM model. The initial input of the decoder is the final state of the encoder model.
Moreover, each token of the dictionary death cause description name (padded with special start and end tag) serves as input for the different time steps. Moreover, each token of the dictionary death cause description name (padded with special start and end tag) serves as input for the different time steps.
Again, we use fastText embeddings of all three languages to represent the token. Again, we use fastText embeddings of all three languages to represent the token.
The decoder predicts one-hot-encoded words of the symptom name. The decoder predicts one-hot-encoded words of the symptom name.
......
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment