From 55f923c61cf47187b5c8c3daf62d7e6c2a891d58 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Mario=20Sa=CC=88nger?= <mario.saenger@student.hu-berlin.de> Date: Wed, 27 Jun 2018 15:16:18 +0200 Subject: [PATCH] Final, camera-ready version of paper --- paper/10_introduction.tex | 6 +++--- paper/20_related_work.tex | 12 ++++++------ paper/40_experiments.tex | 24 ++++++++++++------------ paper/50_conclusion.tex | 2 +- paper/references.bib | 2 +- paper/wbi-eclef18.tex | 12 ++++-------- 6 files changed, 27 insertions(+), 31 deletions(-) diff --git a/paper/10_introduction.tex b/paper/10_introduction.tex index 8debcb5..089feeb 100644 --- a/paper/10_introduction.tex +++ b/paper/10_introduction.tex @@ -12,11 +12,11 @@ was concerned with French and English death certificates in previous years. In contrast, this year the organizers provided annotated death reports as well as ICD-10 dictionaries for French, Italian and Hungarian. The development of language-independent, multilingual approaches was encouraged. -Inspired by the recent success of recurrent neural network models (RNN) \cite{cho_learning_2014,lample_neural_2016,dyer_transition-based_2015} in general and the convincing performance of the work from Miftahutdinov and Tutbalina \cite{miftakhutdinov_kfu_2017} in the last edition of the lab, we opt for the development of a deep learning model for this year's competition. +Inspired by the recent success of recurrent neural network models (RNN) \cite{cho_learning_2014,lample_neural_2016,dyer_transition-based_2015} in general and the convincing performance of the work from Miftahutdinov and Tutubalina \cite{miftakhutdinov_kfu_2017} in the last edition of the lab, we opt for the development of a deep learning model for this year's competition. Our work introduces a prototypical, language independent approach for ICD-10 classification using multi-language word embeddings and long short-term memory models (LSTMs). We divide the proposed pipeline %$classification into two tasks. -First, we perform Name Entity Recognition (NER), i.e. extract the death cause description from a certificate line, with an an encoder-decoder model. -Given the death cause, Named Entity Normalization (NEN), i.e. assigning an ICD-10 code to extracted death cause, is performed by a separate LSTM. +First, we perform named entity recognition (NER), i.e. extract the death cause description from a certificate line, with an an encoder-decoder model. +Given the death cause, named entity normalization (NEN), i.e. assigning an ICD-10 code to extracted death cause, is performed by a separate LSTM. Our approach builds upon a heuristic multi-language embedding space and therefore only needs one single model for all three data sets. With this work we want to experiment and evaluate which performance can be achieved with such a simple shared embedding space. diff --git a/paper/20_related_work.tex b/paper/20_related_work.tex index b1a373a..54042b8 100644 --- a/paper/20_related_work.tex +++ b/paper/20_related_work.tex @@ -6,10 +6,10 @@ The section concludes with a summary of ICD-10 classification approaches used in RNNs are a widely used technique for sequence learning problems such as machine translation \cite{bahdanau_neural_2018,cho_learning_2014}, image captioning \cite{bengio_scheduled_2015}, NER \cite{lample_neural_2016,wei_disease_2016}, dependency parsing -\cite{dyer_transition-based_2015} and POS-tagging \cite{wang_part--speech_2015}. +\cite{dyer_transition-based_2015} and part-of-speech tagging \cite{wang_part--speech_2015}. RNNs model dynamic temporal behaviour in sequential data through recurrent units, i.e. the hidden, internal state of a unit in one time step depends on the -internal state of the unit in the previous time step. These feedback connections +state of the unit in the previous time step. These feedback connections enable the network to memorize information from recent time steps and add the ability to capture long-term dependencies. @@ -34,8 +34,8 @@ Distributional semantic models (DSMs) have been researched for decades in NLP \c Based on a huge amount of unlabeled texts, DSMs aim to represent words using a real-valued vector (also called embedding) which captures syntactic and semantic similarities between the words. Starting with the publication of the work from Collobert et al. \cite{collobert_natural_2011} in 2011, learning embeddings for linguistic units, such as words, sentences or paragraphs, is one of the hot topics in NLP and a plethora of approaches have been proposed \cite{bojanowski_enriching_2017,mikolov_distributed_2013,peters_deep_2018,pennington_glove_2014}. -The majority of today's embedding models are based on deep learning models trained to perform some kind of language modeling task \cite{peters_semi-supervised_2017,peters_deep_2018,pinter_mimicking_2017}. -The most popular embedding model is the Word2Vec model introduced by Mikolov et. al \cite{mikolov_distributed_2013,mikolov_efficient_2013}. +The majority of todays embedding models are based on deep learning models trained to perform some kind of language modeling task \cite{peters_semi-supervised_2017,peters_deep_2018,pinter_mimicking_2017}. +The most popular embedding model is the Word2Vec model introduced by Mikolov et al. \cite{mikolov_efficient_2013,mikolov_distributed_2013}. They propose two shallow neural network models, continuous bag-of-words (CBOW) and SkipGram, that are trained to reconstruct the context given a center word and vice versa. In contrast, Pennington et al. \cite{pennington_glove_2014} use the ratio between co-occurrence probabilities of two words with another one to learn a vector representation. In \cite{peters_deep_2018} multi-layer, bi-directional LSTM models are utilized to learn word embeddings that also capture different contexts of it. @@ -45,7 +45,7 @@ For example, Pinter et al. \cite{pinter_mimicking_2017} try to reconstruct a pre Similarly, Bojanowski et al. \cite{bojanowski_enriching_2017} adapt the SkipGram by taking character n-grams into account. Their fastText model assigns a vector representation to each character n-gram and represents words by summing over all of these representations of a word. -In addition to embeddings that capture word similarities in one language, multi-/cross-lingual approaches have also been investigated. +In addition to embeddings that capture word similarities in one language, multi- and cross-lingual approaches have also been investigated. Proposed methods either learn a linear mapping between monolingual representations \cite{faruqui_improving_2014,xing_normalized_2015} or utilize word- \cite{guo_cross-lingual_2015,vyas_sparse_2016}, sentence- \cite{pham_learning_2015} or document-aligned \cite{sogaard_inverted_2015} corpora to build a shared embedding space. \subsection{ICD-10 Classification} @@ -57,7 +57,7 @@ For example, Di Nunzio et al. \cite{di_nunzio_lexicon_2017} calculate a score fo In contrast, Ho-Dac et al. \cite{ho-dac_litl_2017} treat the problem as information retrieval task and utilize the Apache Solr search engine\footnote{\url{http://lucene.apache.org/solr/}} to classify the individual lines. The ML-based approaches employ a variety of techniques, e.g. Conditional Random Fields (CRFs) \cite{ho-dac_litl_2016}, Labeled Latent Dirichlet Analysis (LDA) \cite{dermouche_ecstra-inserm_2016} and Support Vector Machines (SVMs) \cite{ebersbach_fusion_2017} with diverse hand-crafted features. -Most similar to our approach is the work from Miftahutdinov and Tutbalina \cite{miftakhutdinov_kfu_2017}, which achieved the best results for English certificates in the last year's competition. +Most similar to our approach is the work from Miftahutdinov and Tutubalina \cite{miftakhutdinov_kfu_2017}, which achieved the best results for English certificates in the last year's competition. They use a neural LSTM-based encoder-decoder model that processes the raw certificate text as input and encodes it into a vector representation. Additionally, a vector which captures the textual similarity between the certificate line and the death causes of the individual ICD-10 codes is used to integrate prior knowledge into the model. The concatenation of both vector representations is then used to output the characters and numbers of the ICD-10 code in the decoding step. diff --git a/paper/40_experiments.tex b/paper/40_experiments.tex index bc214fa..9d1d961 100644 --- a/paper/40_experiments.tex +++ b/paper/40_experiments.tex @@ -6,15 +6,15 @@ Each of the languages is supported by training certificate lines as well as a di The provided training data sets were imbalanced concerning the different languages: the Italian corpora consists of 49,823, French corpora of 77,348\footnote{For French we only took the provided data set from 2014.} and Hungarian corpora 323,175 certificate lines. We split each data set into a training and a hold-out evaluation set. The complete training data set was then created by combining the certificate lines of all three languages into one data set. -Beside the provided certificate data we used, no additional knowledge resources or annotated texts were used. +Beside the provided certificate data we used no additional knowledge resources or annotated texts. Due to time constraints during development no cross-validation to optimize the (hyper-) parameters and the individual layers of our models was performed. We either keep the default values of the hyper-parameters or set them to reasonable values according to existing work. -During model training we shuffle the training instances and use varying validation instances to perform a validation of the epoch. +During model training we shuffle the training instances and use varying instances to perform a validation of the epoch. %As representation for the input tokens of the model we use Pre-trained fastText word embeddings % \cite{bojanowski_enriching\_2016}. The embeddings were trained on Common Crawl and Wikipedia articles. Embeddings' -were trained using the following parameter settings: CBOW with position-weights, embedding dimension size 300, with character n-grams of length 5, a window of size 5 and 10 negatives. +were trained using the following parameter settings: CBOW with position-weights, embedding dimension size 300, with character n-grams of length 5, a window of size 5 and 10 negative samples. Unfortunately, they are trained on corpora not related with the biomedical domain and therefore do not represent the best possible textual basis for an embedding space for biomedical information extraction. Final embedding space used by our models is created by concatenating individual embedding vectors for all three languages. Thus the input of our model is embedding vector of size 900. @@ -32,7 +32,7 @@ Model training was performed either for 100 epochs or until an early stopping cr As the provided data set are imbalanced regarding the tasks' languages, we devised two different evaluation settings: (1) DCEM-Balanced, where each language was supported by 49.823 randomly drawn instances (size of the smallest corpus) and (2) DCEM-Full, where all available data is used. Table \ref{tab:s2s} shows the results obtained on the training and validation set. -The figures reveal that distribution of training instances per language have a huge influence on the performance of the model. +The figures indicate that the distribution of training instances per language have a huge influence on the performance of the model. The model trained on the full training data achieves an accuracy of 0.678 on the validation set. In contrast using the balanced data set the model reaches an accuracy of 0.899 (+ 32.5\%). @@ -57,7 +57,7 @@ data set setting.} \subsection{ICD-10 Classification Model} The classification model is responsible for assigning a ICD-10 code to death cause description obtained during the first step. -Our model uses an embedding layer with input masking on zero values, followed by and bidirectional LSTM layer with 256 dimension hidden layer. +Our model uses an embedding layer with input masking on zero values, followed by a bidirectional LSTM layer with 256 dimension hidden layer. Thereafter an attention layer builds an adaptive weighted average over all LSTM states. The respective ICD-10 code will be determined by a dense layer with softmax activation function. We use the Adam optimizer to perform model training. @@ -65,11 +65,11 @@ The model was validated on 25\% of the data. As for the extraction model, no cross-validation or hyper-parameter optimization was performed.% due to time constraints during development. Once again, we devised two approaches. This was mainly caused by the lack of adequate training data in terms of coverage for individual ICD-10 codes. -Therefore, we once again defined two training data settings: (1) minimal (ICD-10\_Minimal), where only ICD-10 codes with two or more supporting training instances are used. -This leaves us with 6.857 unique ICD-10 codes and discards 2.238 unique ICD-10 codes with support of one. +Therefore, we defined two training data settings: (1) minimal (ICD-10\_Minimal), where only ICD-10 codes with two or more supporting training instances are used. +This leaves us with 6,857 unique ICD-10 codes and discards 2,238 unique codes with support of one. This, of course, minimizes the number of ICD-10 codes in the label space. Therefore, (2) an extended (ICD-10\_Extended) data set was defined. Here, the original ICD-10 code mappings, found in the supplied dictionaries, are extended with the training instances from individual certificate data from the three languages. -This generates 9.591 unique ICD-10 codes. +This generates 9,591 unique ICD-10 codes. Finally, for the remaining ICD-10 codes that have only one supporting description, we duplicate those data points. The goal of this approach is to extend our possible label space to all available ICD-10 codes. @@ -103,10 +103,10 @@ In contrast, \textit{Extended} additionally takes the diagnosis texts from the c \label{tab:final_train} The two models where combined to create the final pipeline. We tested both death cause extraction models (based on the balanced and unbalanced data set) in the final pipeline, as their performance differs greatly. -On the contrary, both ICD-10 classification models perform similarly, so we just used the extended ICD-10 classification model, with word level tokens\footnote{Although models supporting character level tokens were developed and evaluated, their performance faired poorly compared to the word level tokens.}, in the final pipeline. +On the contrary, both ICD-10 classification models perform similarly, so we just used the extended ICD-10 classification model, with word level tokens\footnote{Although models supporting character level tokens were developed and evaluated, their performance fared poorly compared to the word level tokens.}, in the final pipeline. To evaluate the pipeline we build a training and a hold-out validation set during development. The obtained results on the validation set are presented in Table \ref{tab:final_train}. -The scores are calculated using a prevalence-weighted macro-average across the output classes, i.e. we calculated precision, recall and F-score for each ICD-10 code and build the average by weighting the scores by the number occurrences of the code in the gold standard into account. +The scores are calculated using a prevalence-weighted macro-average across the output classes, i.e. we calculated precision, recall and F-score for each ICD-10 code and build the average by weighting the scores by the number occurrences of the code in the gold standard. \begin{table}[t!] \centering @@ -174,7 +174,7 @@ Worst results were obtained on the middle, French, corpus while the biggest corp & Final-Balanced & 0.857 & 0.685 & 0.761 \\ & Final-Full & 0.862 & 0.689 & 0.766 \\ \cline{2-5} -& Baseline & 0,165 & 0.172 & 0.169 \\ +& Baseline & 0.165 & 0.172 & 0.169 \\ & Average & 0.844 & 0.760 & 0.799 \\ & Median & 0.900 & 0.824 & 0.863 \\ \bottomrule @@ -187,7 +187,7 @@ We identified several possible reasons for the obtained results. These also represent (possible) points for future work. One of the main disadvantages of our approach is the quality of the used word embeddings as well as the properties of the proposed language-independent embedding space. The usage of out-of-domain word embeddings which aren't targeted to the biomedical domain are likely a suboptimal solution to this problem. -We tried to alleviate this by finding suitable external corpora to train domain-dependent word embeddings for each of the supported languages, however we were unable to find any significant amount of in-domain documents (e.g. PubMed search for abstracts in either French, Hungarian or Italian found 7.843, 786 and 1659 articles respectively). +We tried to alleviate this by finding suitable external corpora to train domain-dependent word embeddings for each of the supported languages, however we were unable to find any significant amount of in-domain documents (e.g. PubMed search for abstracts in either French, Hungarian or Italian found 7843, 786 and 1659 articles respectively). Furthermore, we used a simple, heuristic solution by just concatenating the embeddings of all three languages to build a shared vector space. %This will be the main focus of future investigations on this problem. diff --git a/paper/50_conclusion.tex b/paper/50_conclusion.tex index 3813bf6..8f8b4ff 100644 --- a/paper/50_conclusion.tex +++ b/paper/50_conclusion.tex @@ -3,7 +3,7 @@ The proposed solution was focused on the setup and evaluation of an initial lang heuristic mutual word embedding space for all three languages. The proposed pipeline is divided in two steps: possible token describing the death cause are generated by using a sequence to sequence model first. Afterwards the generated token sequence is normalized to a ICD-10 code using a distinct LSTM-based classification model with attention mechanism. -During evaluation our best model achieves an f-measure of 0.34 for French, 0.45 for Hungarian and 0.77 for Italian. +During evaluation our best model achieves an F-score of 0.34 for French, 0.45 for Hungarian and 0.77 for Italian. The obtained results are encouraging for further investigation however can't compete with the solutions of the other participants yet. We detected several issues with the proposed pipeline. diff --git a/paper/references.bib b/paper/references.bib index b60d956..4defc4f 100644 --- a/paper/references.bib +++ b/paper/references.bib @@ -25,7 +25,7 @@ title = {Kfu at clef ehealth 2017 task 1: {Icd}-10 coding of english death certificates with recurrent neural networks}, booktitle = {{CLEF} 2017 {Online} {Working} {Notes}}, publisher = {CEUR-WS}, - author = {Miftakhutdinov, Zulfat and Tutubalina, Elena}, + author = {Miftahutdinov, Zulfat and Tutubalina, Elena}, year = {2017}, keywords = {Read, CLEF, ICD-10-Classification}, file = {Fulltext:/Users/mario/Zotero/storage/HRZ6Q8Q6/Miftakhutdinov und Tutubalina - 2017 - Kfu at clef ehealth 2017 task 1 Icd-10 coding of .pdf:application/pdf;Snapshot:/Users/mario/Zotero/storage/J8TXTUNT/Miftakhutdinov und Tutubalina - 2017 - Kfu at clef ehealth 2017 task 1 Icd-10 coding of .pdf:application/pdf} diff --git a/paper/wbi-eclef18.tex b/paper/wbi-eclef18.tex index 2609470..075ec72 100644 --- a/paper/wbi-eclef18.tex +++ b/paper/wbi-eclef18.tex @@ -27,10 +27,6 @@ \begin{document} -\newcommand{\nm}[1]{\textcolor{green}{Mario: #1}\\} -\newcommand{\nj}[1]{\textcolor{blue}{Jurica: #1}\\} -\newcommand{\td}[1]{\textcolor{red}{\uppercase{#1}}} - \title{WBI at CLEF eHealth 2018 Task 1: Language-independent ICD-10 coding using multi-lingual embeddings and recurrent neural networks} % If the paper title is too long for the running head, you can set @@ -39,15 +35,15 @@ \author{Jurica \v{S}eva\inst{1} \and Mario Sänger\inst{1} \and -Ulf Leser\inst{1}} +Ulf Leser\inst{1}} % First names are abbreviated in the running head. % If there are more than two authors, 'et al.' is used. \authorrunning{\v{S}eva et al.} -\institute{Humboldt-Universität zu Berlin, Knowledge Management in +\institute{\inst{1}Humboldt-Universität zu Berlin, Knowledge Management in Bioinformatics, \\ Berlin, Germany\\ -\email{seva,saengema,leser@informatik.hu-berlin.de}} +\email{\{seva,saengema,leser\}@informatik.hu-berlin.de}} % \maketitle % typeset the header of the contribution % @@ -58,7 +54,7 @@ The approach builds on two recurrent neural networks models to extract and class First, we employ a LSTM-based sequence-to-sequence model to obtain a death cause from each death certificate line. We then utilize a bidirectional LSTM model with attention mechanism to assign the respective ICD-10 codes to the received death cause description. Both models take multi-language word embeddings as inputs. -During evaluation our best model achieves an F-measure of 0.34 for French, 0.45 for Hungarian and 0.77 for Italian. +During evaluation our best model achieves an F-score of 0.34 for French, 0.45 for Hungarian and 0.77 for Italian. The results are encouraging for future work as well as the extension and improvement of the proposed baseline system. \keywords{ICD-10 coding \and Biomedical information extraction \and Multi-lingual sequence-to-sequence model -- GitLab