diff --git a/paper/10_introduction.tex b/paper/10_introduction.tex index 1a645e78812b37043c01665e8017b11e99ce9c77..e8f8928beee6faf369fc27b7a78d382cdcafd63b 100644 --- a/paper/10_introduction.tex +++ b/paper/10_introduction.tex @@ -31,11 +31,12 @@ for the development of a deep learning model for this year's task. Our work introduces a language independent approach for ICD-10 classification using multi-language word embeddings and LSTM-based recurrent models. We divide the the classification into two tasks. First, we extract the death cause description -from a certificate line backed by an encoder-decoder model. Given the death cause -the actual ICD-10 classification will be performed by a separate LSTM model. Our -work focus on the introduction of and the experiment with an -language-independent approach which requires as little as possible additional -resources and only needs one single model for all three languages. +from a certificate line backed by an encoder-decoder model. Given the death +cause the actual ICD-10 classification will be performed by a separate LSTM +model. Our work focus on the setup and evaluation of a first, baseline +language-independent approach which builds on a heuristic multi-language +embedding space and therefore only needs one single model for all three data +sets. Moreover, we tried to as little as possible additional external resources. diff --git a/paper/20_related_work.tex b/paper/20_related_work.tex index ec2edc8317b8d635d412d225b1a43aff7a430c73..bbd7b4f59c1ed331ea6d98b969736c2f8b340ad6 100644 --- a/paper/20_related_work.tex +++ b/paper/20_related_work.tex @@ -13,25 +13,28 @@ according to a rule framework. For example, Di Nunzio et al. by summing the binary or tf-idf weights of each term of a certificate line segment and assign the ICD-10 code with the highest score. In contrast, Ho-Dac et al. \cite{ho-dac_litl_2017} treat the problem as information retrieval task -and utilze the SOLR search engine. +and utilze the Apache Solr search engine\footnote{\url{http://lucene.apache.org/solr/}}. The machine learning based approaches employ a variety techniques, e.g. Conditional Random Fields (CRFs) \cite{ho-dac_litl_2016}, Labeled Latent -Dirichlet Analysis (LDA) \cite{dermouche_ecstra-inserm_2016} and Support Vector Machines -(SVMs) \cite{ebersbach_fusion_2017} with diverse hand-crafted features. Most -similar to our approach is the work from Miftahutdinov and Tutbalina \cite{miftakhutdinov_kfu_2017}, -which achieved the best results for English certificates in the last year's -competition. They use a neural LSTM-based encoder-decoder model that processes the raw -certificate text as input and encodes it into a vector representation. -Furthermore a vector which captures the textual similarity between the -certificate line and the death causes resp. diagnosis texts of the individual ICD-10 codes -is used to integrate prior knowledge into the model. The concatenation of both -vector representations is then used to output the characters and numbers of the -ICD-10 code in the decoding step. In contrast to their work, our approach -introduces a model for multi-language ICD-10 classification. We utilitize two -separate recurrent neural networks, one sequence to sequence model for death cause -extraction and one for classification, to predict the ICD-10 codes for a -certificate text independent from which language they originate. +Dirichlet Analysis (LDA) \cite{dermouche_ecstra-inserm_2016} and Support Vector +Machines (SVMs) \cite{ebersbach_fusion_2017} with diverse hand-crafted features. + + +Most similar to our approach is the work from Miftahutdinov and Tutbalina +\cite{miftakhutdinov_kfu_2017}, which achieved the best results for English +certificates in the last year's competition. They use a neural LSTM-based +encoder-decoder model that processes the raw certificate text as input and +encodes it into a vector representation. Furthermore a vector which captures the +textual similarity between the certificate line and the death causes resp. +diagnosis texts of the individual ICD-10 codes is used to integrate prior +knowledge into the model. The concatenation of both vector representations is +then used to output the characters and numbers of the ICD-10 code in the +decoding step. In contrast to their work, our approach introduces a model for +multi-language ICD-10 classification. We utilitize two separate recurrent neural +networks, one sequence to sequence model for death cause extraction and one for +classification, to predict the ICD-10 codes for a certificate text independent +from which language they originate. diff --git a/paper/30_methods_intro.tex b/paper/30_methods_intro.tex index ed6e653bf1666d7d300ba8d7dd8b7f2ec0beb96c..0226452b5b412f1af81a958892ea17bb3570f3c9 100644 --- a/paper/30_methods_intro.tex +++ b/paper/30_methods_intro.tex @@ -1,9 +1,10 @@ Our approach models the extraction and classification of death causes as two-step process. First, we employ a neural, multi-language sequence-to-sequence -model to receive a death cause description for a given death certificate line. We will then -use a second classification model to assign the respective ICD-10 codes to the -obtained death cause. The remainder of this section gives a short introduction -to recurrent neural networks, followed by a detailed explanation of our two models. +model to receive a death cause description for a given death certificate line. +We will then use a second classification model to assign the respective ICD-10 +codes to the obtained death cause. The remainder of this section gives a short +introduction to recurrent neural networks, followed by a detailed explanation of +our two models. \subsection{Recurrent neural networks} Recurrent neural networks (RNNs) are a widely used technique for sequence diff --git a/paper/31_methods_seq2seq.tex b/paper/31_methods_seq2seq.tex index bf836157918f4e34bd8e76c2542efc861a570838..e18d6c1ab14f3592a035dc9a21eab955e9c65eab 100644 --- a/paper/31_methods_seq2seq.tex +++ b/paper/31_methods_seq2seq.tex @@ -10,16 +10,19 @@ For this we adopt the encoder-decoder architecture proposed in \cite{sutskever_sequence_2014}. Figure \ref{fig:encoder_decoder} illustrates the architecture of the model. As encoder we utilize a forward LSTM model, which takes the single words of a certificate line as inputs and scans the line from -left to right. Each token will be represented using pre-trained FastText +left to right. Each token will be represented using pre-trained fastText word embeddings. Word embedding models represent words using a real-valued -vector and caputure syntactic and semantic similiarities between them. FastText +vector and caputure syntactic and semantic similiarities between them. fastText embeddings take sub-word information into account during training whereby the model is able to provide suitable representations even for unseen, -out-of-vocabulary words. We utilize FastText embeddings for French, Italian and -Hungarian trained on Wikipedia corpora. Independently from which lanugage a word -originates we lookup the word in all three embedding models and concatenate the -obtained vectors. Through this we get an language-independent representation of -the word. The encoders final state represents the semantic meaning of the +out-of-vocabulary words. We utilize fastText embeddings for French, Italian and +Hungarian trained on Common Crawl and Wikipedia articles\footnote{\url{https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md}}. +Independently from which lanugage a word originates we lookup the word in all +three embedding models and concatenate the obtained vectors. Through this we get +a (kind of) multi-language representation of the word. This heuristic +composition constitutes a naive solution to build a multi-language embedding +space, however we opted to evaluate this approach as simple baseline for future +investigations. The encoders final state represents the semantic meaning of the certificate line and serves as intial input for decoding process. \begin{figure} @@ -32,7 +35,7 @@ decoding process. The decoder will be trained to predict the death cause description text from the provided ICD-10 dictionaries word by word (using special tags \textbackslash s and \textbackslash e for start resp. end of a sequence). All input tokens will be represented using the concatenation of the -FastText embeddings of all three languages.} +fastText embeddings \cite{bojanowski_enriching_2016} of all three languages.} \label{fig:encoder_decoder} \end{figure} diff --git a/paper/32_methods_icd10.tex b/paper/32_methods_icd10.tex index 6c72dfba5fa0b1d4b9e3ffc738171080b9382d0f..89c2c572dfac5b09cf8785cf4484716852a89c31 100644 --- a/paper/32_methods_icd10.tex +++ b/paper/32_methods_icd10.tex @@ -3,7 +3,7 @@ The second step in our pipeline is to assign a ICD-10 code to the obtained death cause description. For this purpose we employ a bidirectional LSTM model which is able to capture the past and future context for each token of a death cause description. Just as in our encoder-decoder model we encode each token using the -concatenation of the FastText embeddings of the word from all three languages. +concatenation of the fastText embeddings of the word from all three languages. To enable our model to attend to different parts of the death cause description we add an extra attention layer \cite{raffel_feed-forward_2015} to the model. Through the attention mechanism our model learns a fixed-sized embedding of the @@ -22,7 +22,7 @@ The attention layer summarizes the whole description by computing an adaptive weighted average over the LSTM states. The resulting death cause embedding will be feed through a softmax layer to get the final classification. Equivalent to our encoder-decoder model all input tokens will be represented using the -concatenation of the FastText embeddings of all three languages.} +concatenation of the fastText embeddings of all three languages.} \label{fig:classification-model} \end{figure} diff --git a/paper/40_experiments.tex b/paper/40_experiments.tex index 68281f9d5a1b8f5648db8182d0851ed2852be219..91483925c18196318265bd3454413c9b3d43f0c4 100644 --- a/paper/40_experiments.tex +++ b/paper/40_experiments.tex @@ -4,31 +4,36 @@ developed models, both individually as well as combined in a pipeline setting. \subsection{Training Data and Experiment Setup} The CLEF e-Health 2018 Task 1 participants where provided with annotated death certificates for the three selected languages: French, Italian and Hungarian. -Each of the languages is supported by several data sources. Provided data sets -were imbalanced concerning the different languages: the Italian corpora consists -of 49,823, French corpora of 77,348\footnote{For French we only took the -provided data from 2014.} and Hungarian corpora 323,175 certificate lines. - -The training data used in this approach was created by combining the data -sources of all three languages. Despite the provided certificate data we used no -further, external data sources. Each dataset was split into a train and -a hold-out evaluation set. We didn't perform cross-validation during development, however -we shuffle the train and validation dataset before each training epoch. -Moreover, no hyperparameter optimization was performed due to time constraints -during the development phase. Instead we set default -the default parameters values for individual layers being used. - -We used pre-trained fastText\footnote{https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md} +Each of the languages is supported by training certificate lines as well as a +dictionary with death cause descriptions resp. diagnosises for the different ICD-10 +codes. The provided training data sets were imbalanced concerning the different +languages: the Italian corpora consists of 49,823, French corpora of 77,348\footnote{For +French we only took the provided data set from 2014.} and Hungarian corpora 323,175 +certificate lines. We split each data set into a training and a hold-out evaluation set. The +complete training data set was then created by combining the certificate lines +of all three languages into one data set. Despite the provided certificate data +we used no further, external knowledge resources or annotated texts were +incorporated. + +Due to time constraints during developement we didn't perform cross-validation +to optimize the (hyper-) parameters and the inidividual layers of our models. We +either keep the default values of the hyperparameters or set them to reasonable +values according to existing work. During model training we shuffle the training +instances and use varying validation instances to perform a validation of the +epoch. + +As representation for the input tokens of the model we use pre-trained fastText word embeddings \cite{bojanowski_enriching_2016}. The embeddings were trained on -Common Crawl and a Wikipedia dump. The embeddings were trained with the -following parameters: CBOW with position-weights, embedding dimension size 300, -with character n-grams of length 5, a window of size 5 and 10 negatives. -Unfortunately, they are trained on corpora not related with the biomedical -domain and therefore do not represent the best possible embedding space for -biomedical information extraction. Final embedding space used by our models is -created by concatenating individual embedding vectors for all three languages. -Thus the input of our model is embedding vector of size 900. All models were -implemented with the Keras library \footnote{https://keras.io/}. +Common Crawl and Wikipedia articles. For the training of the embeddings, +Bojanowski et al. used the following parameter settings: CBOW with +position-weights, embedding dimension size 300, with character n-grams of length +5, a window of size 5 and 10 negatives. Unfortunately, they are trained on +corpora not related with the biomedical domain and therefore do not represent +the best possible textual basis for an embedding space for biomedical +information extraction. Final embedding space used by our models is created by +concatenating individual embedding vectors for all three languages. Thus the +input of our model is embedding vector of size 900. All models were implemented +with the Keras library \footnote{\url{https://keras.io/}} in Version X.X. \subsection{Death cause extraction model} To identify possible tokens as candidates for a death cause description, we @@ -39,17 +44,22 @@ encoders output is used as the initial state of the decoder. The decoder generates, based on the input description from the dictionary and a special start token, a death cause word by word. This decoding process continues until a special end token is generated. The entire model is optimized using the -Adam optimizer and a batch size of 700. Model training was performed either for -100 epochs or if an early stopping criteria is met (no change in validation loss -for two epochs). +Adam optimization algorithm \cite{kingma_adam} and a batch size of 700. Model +training was performed either for 100 epochs or if an early stopping criteria is +met (no change in validation loss for two epochs). As the available dataset are imbalanced concerning the different languages, we -devised two approaches: (1) DCEM-Balanced, where each language was supported by -49.823 randomly drawn data points (size of the smallest corpus) and (2) DCEM-Full, -where all available data is used. The results, obtained on the validation set, -are shown in Table \ref{tab:s2s}. +devised two different evaluation settings: (1) DCEM-Balanced, where each +language was supported by 49.823 randomly drawn instances (size of the smallest +corpus) and (2) DCEM-Full, where all available data is used. The results, +obtained on the training and validation set, are shown in Table \ref{tab:s2s}. +The figures reveal that distribution of training instances per language have a +huge influence on the performance of the model. The model trained on the +full training data achieves an accuracy of 0.678 on the validation set. In contrast +using the balanced data set the model reaches an accuracy of 0.899 (+ 32.5\%). \begin{table}[] +\label{tab:s2s} \centering \begin{tabularx}{0.9\textwidth}{p{3cm}|c|c|c|c|c} \toprule @@ -64,7 +74,6 @@ DCEM-Full & 9 &0.709 & 0.098 & 0.678 & 0.330 \\ \caption{Experiment results of our death cause extraction sequence-to-sequence model concerning balanced (equal number of training data per language) and full data set setting.} -\label{tab:s2s} \end{table} \subsection{ICD-10 Classification Model} @@ -91,6 +100,7 @@ label space to all available ICD-10 codes. The results obtained from the two approaches are shown in Table \ref{tab:icd10Classification}. \begin{table}[] +\label{tab:icd10Classification} \centering \begin{tabularx}{\textwidth}{p{2.25cm}|p{1.75cm}|c|c|c|c|c} \toprule @@ -104,7 +114,6 @@ Character & Minimal & 91 & 0.732 & 1.186 & 0.516 & 2.505 \\ \bottomrule \end{tabularx} \caption{Experiment results for our ICD-10 classification model regarding different settings.} -\label{tab:icd10Classification} \end{table} \subsection{Complete Pipeline} diff --git a/paper/encoder-decoder-model.docx b/paper/encoder-decoder-model.docx index 85151b5c87c960837289f4a33fe37ad2bf70726a..177b925107d623612a9466888ebfda4f69fe3b59 100644 Binary files a/paper/encoder-decoder-model.docx and b/paper/encoder-decoder-model.docx differ diff --git a/paper/references.bib b/paper/references.bib index 28bddd4a79b708a1f82fc67e01a2e1ce0babc85f..7244f07ab4eb06dcd4e824ebbc80972738299367 100644 --- a/paper/references.bib +++ b/paper/references.bib @@ -353,4 +353,13 @@ The system proposed in this study provides automatic identification and characte year = {2016}, pages = {4960--4964}, file = {Fulltext:/Users/mario/Zotero/storage/ZV5B2GQJ/Chan et al. - 2016 - Listen, attend and spell A neural network for lar.pdf:application/pdf;Snapshot:/Users/mario/Zotero/storage/RS8MBCM8/7472621.html:text/html} +} + +@article{kingma_adam:_2014, + title = {Adam: {A} method for stochastic optimization}, + shorttitle = {Adam}, + journal = {arXiv preprint arXiv:1412.6980}, + author = {Kingma, Diederik P. and Ba, Jimmy}, + year = {2014}, + file = {Snapshot:/Users/mario/Zotero/storage/YSR9BL4W/1412.html:text/html} } \ No newline at end of file diff --git a/paper/wbi-eclef18.tex b/paper/wbi-eclef18.tex index 3c12529851cda0ca43e4699a1aed8a2b7f33e675..acbb005ab21d7d5d851fac481d891b58dacdcc6e 100644 --- a/paper/wbi-eclef18.tex +++ b/paper/wbi-eclef18.tex @@ -8,6 +8,7 @@ \usepackage{color} \usepackage{multirow,tabularx} \usepackage{booktabs} +\usepackage{hyperref} % Used for displaying a sample figure. If possible, figure files should % be included in EPS format. @@ -15,7 +16,7 @@ % If you use the hyperref package, please uncomment the following line % to display URLs in blue roman font according to Springer's eBook style: -% \renewcommand\UrlFont{\color{blue}\rmfamily} +\renewcommand\UrlFont{\color{blue}\rmfamily} \begin{document} @@ -44,13 +45,22 @@ Bioinformatics, \\ Berlin, Germany\\ \maketitle % typeset the header of the contribution % \begin{abstract} -This paper describes the participation of the WBI team in the CLEF eHealth 2018 shared task 1 (``Multilingual Information Extraction - ICD-10 coding''). -Our approach builds on two recurrent neural networks models to extract and classify causes of death from French, Italian and Hungarian death certificates. -First, we employ a LSTM-based sequence-to-sequence model to obtain a symptom name from each death certificate line. -We then utilize a bidirectional LSTM model with attention mechanism to assign the respective ICD-10 codes to the received symptom names. -Our model achieves \ldots +This paper describes the participation of the WBI team in the CLEF eHealth 2018 +shared task 1 (``Multilingual Information Extraction - ICD-10 coding''). Our +contribution focus on the setup and evaluation of a basline language-independent +neural architecture for ICD-10 classification as well as a simple, heuristic +multi-language word embedding technique. The approach builds on two recurrent +neural networks models to extract and classify causes of death from French, +Italian and Hungarian death certificates. First, we employ a LSTM-based +sequence-to-sequence model to obtain a death cause from each death certificate +line. We then utilize a bidirectional LSTM model with attention mechanism to +assign the respective ICD-10 codes to the received death cause description. Both +models take multi-language word embeddings as inputs. During evaluation our best +model achieves an F-measure of 0.34 for French, 0.45 for Hungarian and 0.77 for +Italian. -\keywords{ICD-10 coding \and Biomedical information extraction \and Multi-lingual sequence-to-sequence model \and Represention learning \and Attention mechanism} +\keywords{ICD-10 coding \and Biomedical information extraction \and Multi-lingual sequence-to-sequence model +\and Represention learning \and Recurrent neural network \and Attention mechanism \and Multi-language embeddings} \end{abstract}