diff --git a/paper/10_introduction.tex b/paper/10_introduction.tex index 3913217d8d118294a8b5bd3d83828f5babcf907b..1f014ca94ec69d3d56b8366282855f5173ba6c20 100644 --- a/paper/10_introduction.tex +++ b/paper/10_introduction.tex @@ -1,43 +1,26 @@ -Automatic extraction, classification and analysis of biological and medical -concepts from unstructured texts, such as scientific publications or electronic -health documents, is a highly important task to support many applications in -research, daily clinical routine and policy-making. Computer-aided approaches -can improve decision making and support clinical processes, for example, by -giving a more sophisticated overview about a research area, providing detailed -information about the aetiopathology of a patient or disease patterns. In the -past years major advances have been made in the area of natural language -processing. However, improvements in the field of biomedical text mining lag -behind other domains mainly due to privacy issues and concerns regarding the -processed data (e.g. electronic health records). +Automatic extraction, classification and analysis of biological and medical concepts from unstructured texts, such as scientific publications or electronic health documents, is a highly important task to support many applications in research, daily clinical routine and policy-making. +Computer-aided approaches can improve decision making and support clinical processes, for example, by giving a more sophisticated overview about a research area, providing detailed information about the aetiopathology of a patient or disease patterns. +In the past years major advances have been made in the area of natural language processing. +However, improvements in the field of biomedical text mining lag behind other domains mainly due to privacy issues and concerns regarding the processed data (e.g. electronic health records). -The CLEF eHealth lab attends to this circumstance through organization of -various shared tasks which aid and support the development of approaches to -exploit electronically available medical content \cite{suominen_overview_2018}. -In particular, Task 1 of the lab was concerned with the extraction and -classification of causes of death from death certificates originating from -different languages \cite{neveol_clef_2018}. Participants were asked to classify -the death causes mentioned in the certificates according to the International -Classification of Disease version 10 (ICD-10). The task has been carried out the -last two years of the lab, however was only concerned with French and English -certificates. In contrast, the organizers provided annotated death reports as -well as ICD-10 dictionaries for French, Italian and Hungarian this year. The -development of language-independent, multilingual approaches was encouraged. - -Inspired by the recent success of recurrent neural network models -\cite{cho_learning_2014,lample_neural_2016,dyer_transition-based_2015} in -general and the convincing performance of the work from Miftahutdinov and -Tutbalina \cite{miftakhutdinov_kfu_2017} in the last year's competition we opt -for the development of a deep learning model for this year's task. Our work -introduces a language independent approach for ICD-10 classification using -multi-language word embeddings and LSTM-based recurrent models. We divide the -the classification into two tasks. First, we extract the death cause description -from a certificate line backed by an encoder-decoder model. Given the death -cause the actual ICD-10 classification will be performed by a separate LSTM -model. Our work focus on the setup and evaluation of an initial, baseline -language-independent approach which builds on a heuristic multi-language -embedding space and therefore only needs one single model for all three data -sets. Moreover, we tried to as little as possible additional external resources. +The CLEF eHealth lab attends to circumvent this through organization of various shared tasks %which aid and support the development of approaches +to exploit electronically available medical content \cite{suominen_overview_2018}. +In particular, Task 1\footnote{https://sites.google.com/view/clef-ehealth-2018/task-1-multilingual-information-extraction-icd10-coding} of the lab is concerned with the extraction and classification of death causes from death certificates originating from different languages \cite{neveol_clef_2018}. +Participants were asked to classify the death causes mentioned in the certificates according to the International Classification of Disease version 10 (ICD-10). +The task %has been carried out the last two years of the lab, however +was concerned with French and English death certificates in previous years. +In contrast, this year the organizers provided annotated death reports as well as ICD-10 dictionaries for French, Italian and Hungarian this year. +The development of language-independent, multilingual approaches was encouraged. +Inspired by the recent success of recurrent neural network models \cite{cho_learning_2014,lample_neural_2016,dyer_transition-based_2015} in general and the convincing performance of the work from Miftahutdinov and Tutbalina \cite{miftakhutdinov_kfu_2017} in CLEF eHealth Task 1 2017 competition we opt for the development of a deep learning model for this year's task. +Our work introduces a language independent approach for ICD-10 classification using multi-language word embeddings and LSTM-based recurrent models. +We divide the proposed pipeline %$classification +into two tasks. +First, we perform Name Entity Recognition (NER), i.e. extract the death cause description from a certificate line, with an an encoder-decoder model. +Given the death cause, Named Entity Normalization (NEN), i.e. assigning an ICD-10 code to extracted death cause, is performed by a separate LSTM model. +In this work we present the setup and evaluation of an initial, baseline language-independent approach which builds on a heuristic multi-language embedding space and therefore only needs one single model for all three data sets. +Moreover, we tried to use as little as possible additional external resources. +PARAGRAPH ABOUT EMBEDDINGS. diff --git a/paper/20_related_work.tex b/paper/20_related_work.tex index bbd7b4f59c1ed331ea6d98b969736c2f8b340ad6..d63f3ab2f49447e316c263c48e91c4b1df7b3ee9 100644 --- a/paper/20_related_work.tex +++ b/paper/20_related_work.tex @@ -4,7 +4,7 @@ eHealth lab. Participating teams used a plethora of different approaches to tackle the classification problem. The methods can essentially be divided into two categories: knowledge-based \cite{cabot_sibm_2016,jonnagaddala_automatic_2017,van_mulligen_erasmus_2016} and -machine learning approaches +machine learning (ML) approaches \cite{dermouche_ecstra-inserm_2016,ebersbach_fusion_2017,ho-dac_litl_2016,miftakhutdinov_kfu_2017}. The former relies on lexical sources, medical terminologies and other ontologies to match (parts of) the certificate text with entries from the knowledge-bases @@ -13,14 +13,13 @@ according to a rule framework. For example, Di Nunzio et al. by summing the binary or tf-idf weights of each term of a certificate line segment and assign the ICD-10 code with the highest score. In contrast, Ho-Dac et al. \cite{ho-dac_litl_2017} treat the problem as information retrieval task -and utilze the Apache Solr search engine\footnote{\url{http://lucene.apache.org/solr/}}. +and utilize the Apache Solr search engine\footnote{\url{http://lucene.apache.org/solr/}}. -The machine learning based approaches employ a variety techniques, e.g. +The ML-based approaches employ a variety of techniques, e.g. Conditional Random Fields (CRFs) \cite{ho-dac_litl_2016}, Labeled Latent Dirichlet Analysis (LDA) \cite{dermouche_ecstra-inserm_2016} and Support Vector Machines (SVMs) \cite{ebersbach_fusion_2017} with diverse hand-crafted features. - Most similar to our approach is the work from Miftahutdinov and Tutbalina \cite{miftakhutdinov_kfu_2017}, which achieved the best results for English certificates in the last year's competition. They use a neural LSTM-based @@ -31,7 +30,7 @@ diagnosis texts of the individual ICD-10 codes is used to integrate prior knowledge into the model. The concatenation of both vector representations is then used to output the characters and numbers of the ICD-10 code in the decoding step. In contrast to their work, our approach introduces a model for -multi-language ICD-10 classification. We utilitize two separate recurrent neural +multi-language ICD-10 classification. We utilize two separate recurrent neural networks, one sequence to sequence model for death cause extraction and one for classification, to predict the ICD-10 codes for a certificate text independent from which language they originate. diff --git a/paper/30_methods_intro.tex b/paper/30_methods_intro.tex index 0226452b5b412f1af81a958892ea17bb3570f3c9..1f4649c32602151a718891b28e6094820af6d640 100644 --- a/paper/30_methods_intro.tex +++ b/paper/30_methods_intro.tex @@ -33,4 +33,6 @@ which make the past and future context available in every time step. A bidirectional LSTM model consists of a forward chain, which processes the input data from left to right, and and backward chain, consuming the data in the opposite direction. The final representation is typically the concatenation or a -linear combination of both states. \ No newline at end of file +linear combination of both states. + +AREN'T WE MOVING THIS TO RELATED WORK? \ No newline at end of file diff --git a/paper/31_methods_seq2seq.tex b/paper/31_methods_seq2seq.tex index e18d6c1ab14f3592a035dc9a21eab955e9c65eab..42cfde18daeedbffee0add5989b996cbafda651f 100644 --- a/paper/31_methods_seq2seq.tex +++ b/paper/31_methods_seq2seq.tex @@ -1,52 +1,33 @@ \subsection{Death Cause Extraction Model} -The first step in our pipeline is the extraction of the death cause description -from a given certificate line. We use the training certificate lines (with their -corresponding ICD-10 codes) and the ICD-10 dictionaries as basis for our model. -The dictionaries provide us with death causes resp. diagnosis for each ICD-10 -code. The goal of the model is to reassemble the dictionary death cause -description text from the certificate line. +The first step in our pipeline is the extraction of the death cause description from a given certificate line. +We use the training certificate lines (with their corresponding ICD-10 codes) and the ICD-10 dictionaries as basis for our model. +The dictionaries provide us with death causes resp. diagnosis for each ICD-10 code. +The goal of the model is to reassemble the dictionary death cause description text from the certificate line. -For this we adopt the encoder-decoder architecture proposed in -\cite{sutskever_sequence_2014}. Figure \ref{fig:encoder_decoder} illustrates the -architecture of the model. As encoder we utilize a forward LSTM model, which -takes the single words of a certificate line as inputs and scans the line from -left to right. Each token will be represented using pre-trained fastText -word embeddings. Word embedding models represent words using a real-valued -vector and caputure syntactic and semantic similiarities between them. fastText -embeddings take sub-word information into account during training whereby the -model is able to provide suitable representations even for unseen, -out-of-vocabulary words. We utilize fastText embeddings for French, Italian and -Hungarian trained on Common Crawl and Wikipedia articles\footnote{\url{https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md}}. -Independently from which lanugage a word originates we lookup the word in all -three embedding models and concatenate the obtained vectors. Through this we get -a (kind of) multi-language representation of the word. This heuristic -composition constitutes a naive solution to build a multi-language embedding -space, however we opted to evaluate this approach as simple baseline for future -investigations. The encoders final state represents the semantic meaning of the -certificate line and serves as intial input for decoding process. +For this we adopt the encoder-decoder architecture proposed in \cite{sutskever_sequence_2014}. Figure \ref{fig:encoder_decoder} illustrates the architecture of the model. +As encoder we utilize a forward LSTM model, which takes the single words of a certificate line as inputs and scans the line from left to right. +Each token is represented using pre-trained fastText\footnote{https://github.com/facebookresearch/fastText/} word embeddings\cite{bojanowski_enriching_2016}. +Word embedding models represent words using a real-valued vector and capture syntactic and semantic similarities between them. +fastText embeddings take sub-word information into account during training whereby the model is able to provide suitable representations even for unseen, out-of-vocabulary (OOV) words. +We utilize fastText embeddings for French, Italian and Hungarian trained on Common Crawl and Wikipedia articles\footnote{\url{https://github.com/facebookresearch/fastText/blob/master/docs/crawl-vectors.md}}. +Independently from the language a word originates from, we lookup the word in all three embedding models and concatenate the obtained vectors. +Through this we get a (basic) multi-language representation of the word. +This heuristic composition constitutes a naive solution to build a multi-language embedding space. +However we opted to evaluate this approach as a simple baseline for future work. +Encoders' final state represents the semantic representation of the certificate line and serves as initial input for decoding process. \begin{figure} -\includegraphics[width=\textwidth,trim={0 17cm 0 -3cm},clip=true]{encoder-decoder-model.pdf} \caption{Illustration of the neural -encoder-decoder model for death cause extraction. The encoder processes a death -certificate line token-wise from left to right. The final state of the encoder -forms a semantic representation of the line and serves as initial input for the -decoding process. The decoder will be trained to predict the death cause -description text from the provided ICD-10 dictionaries word by word (using -special tags \textbackslash s and \textbackslash e for start resp. end of a -sequence). All input tokens will be represented using the concatenation of the -fastText embeddings \cite{bojanowski_enriching_2016} of all three languages.} +\includegraphics[width=\textwidth,trim={0 17cm 0 3cm},clip=true]{encoder-decoder-model.pdf} +\caption{Illustration of the neural encoder-decoder model for death cause extraction. The encoder processes a death certificate line token-wise from left to right. The final state of the encoder forms a semantic representation of the line and serves as initial input for the decoding process. The decoder will be trained to predict the death cause description text from the provided ICD-10 dictionaries word by word (using special tags \textbackslash s and \textbackslash e for start resp. end of a sequence). All input tokens will be represented using the concatenation of the fastText embeddings %\cite{bojanowski_enriching_2016} +of all three languages.} \label{fig:encoder_decoder} \end{figure} -As decoder with utilize another LSTM model. The initial input of the decoder is -the final state of the encoder. Moreover, each token of the dictionary death -cause description name (padded with special start and end tag) serves as input -for the different time steps. Again, we use FastEmbeddngs of all three languages -to represent the token. The decoder predicts one-hot-encoded words of the -symptom name. During test time we use the encoder to obtain a semantic -representation of the certificate line and decode the death cause description -word by word starting with the special start tag. The decoding process finishs -when the decoder outputs the end tag. +For the decoder with utilize another LSTM model. The initial input of the decoder is the final state of the encoder model. +Moreover, each token of the dictionary death cause description name (padded with special start and end tag) serves as input for the different time steps. +Again, we use fastText embeddings of all three languages to represent the token. +The decoder predicts one-hot-encoded words of the symptom name. +During test time we use the encoder to obtain a semantic representation of the certificate line and decode the death cause description word by word starting with the special start tag. +The decoding process finishes when the decoder outputs the end tag. diff --git a/paper/32_methods_icd10.tex b/paper/32_methods_icd10.tex index 89c2c572dfac5b09cf8785cf4484716852a89c31..263a1826755cfc900787bcc7a68f7da664eebfea 100644 --- a/paper/32_methods_icd10.tex +++ b/paper/32_methods_icd10.tex @@ -1,33 +1,22 @@ \subsection{ICD-10 Classification Model} -The second step in our pipeline is to assign a ICD-10 code to the obtained death -cause description. For this purpose we employ a bidirectional LSTM model which -is able to capture the past and future context for each token of a death cause description. -Just as in our encoder-decoder model we encode each token using the -concatenation of the fastText embeddings of the word from all three languages. -To enable our model to attend to different parts of the death cause description -we add an extra attention layer \cite{raffel_feed-forward_2015} to the model. -Through the attention mechanism our model learns a fixed-sized embedding of the -death cause description by computing an adaptive weighted average of the state -sequence of the LSTM model. This allows the model to better integrate -information over time. Figure \ref{fig:classification-model} presents the -architecture of our ICD-10 classification model. +The second step in our pipeline is to assign a ICD-10 code to the generated death cause description. +For this we employ a bidirectional LSTM model which is able to capture the past and future context for each token of a death cause description. +Just as in our encoder-decoder model we encode each token using the concatenation of the fastText embeddings of the word from all three languages. +To enable our model to attend to different parts of the death cause description we add an extra attention layer \cite{raffel_feed-forward_2015} to the model. +Through the attention mechanism our model learns a fixed-sized embedding of the death cause description by computing an adaptive weighted average of the state sequence of the LSTM model. +This allows the model to better integrate information over time. Figure \ref{fig:classification-model} presents the architecture of our ICD-10 classification model. \begin{figure} \centering -\includegraphics[width=\textwidth,trim={0cm 16.5cm 0cm -3cm},clip=true]{classification-model.pdf} \caption{Illustration of the neural -ICD-10 classification model. The model utilizes a bi-directional LSTM layer, -which processes the death cause description from left to right and vice versa. -The attention layer summarizes the whole description by computing an adaptive -weighted average over the LSTM states. The resulting death cause embedding will -be feed through a softmax layer to get the final classification. Equivalent to -our encoder-decoder model all input tokens will be represented using the -concatenation of the fastText embeddings of all three languages.} +\includegraphics[width=\textwidth,trim={0cm 16.5cm 0cm 3cm},clip=true]{classification-model.pdf} +\caption{Illustration of the neural ICD-10 classification model. The model utilizes a bi-directional LSTM layer, which processes the death cause description from left to right and vice versa. +The attention layer summarizes the whole description by computing an adaptive weighted average over the LSTM states. +The resulting death cause embedding will be feed through a softmax layer to get the final classification. +Equivalent to our encoder-decoder model all input tokens will be represented using the concatenation of the fastText embeddings of all three languages.} \label{fig:classification-model} \end{figure} -We train the model using the provided ICD-10 dictionaries from all three -languages. During development we also experimented with character-level RNNs for -better ICD-10 classification, however couldn't achieve any performance +We train the model using the provided ICD-10 dictionaries from all three languages. +During development we also experimented with character-level RNNs for better ICD-10 classification, however couldn't achieve any performance approvements. diff --git a/paper/40_experiments.tex b/paper/40_experiments.tex index cbe5dbeff9355537627fb59fb7fdaa4e1d360370..2f96df92e98b2d6ed3a7638f05d02667761d7b80 100644 --- a/paper/40_experiments.tex +++ b/paper/40_experiments.tex @@ -1,62 +1,39 @@ -In this section we will present experiments and obtained results for the two -developed models, both individually as well as combined in a pipeline setting. +In this section we will present experiments and obtained results for the two developed models, both individually as well as combined in a pipeline setting. \subsection{Training Data and Experiment Setup} -The CLEF e-Health 2018 Task 1 participants where provided with annotated death -certificates for the three selected languages: French, Italian and Hungarian. -Each of the languages is supported by training certificate lines as well as a -dictionary with death cause descriptions resp. diagnosises for the different ICD-10 -codes. The provided training data sets were imbalanced concerning the different -languages: the Italian corpora consists of 49,823, French corpora of 77,348\footnote{For -French we only took the provided data set from 2014.} and Hungarian corpora 323,175 -certificate lines. We split each data set into a training and a hold-out evaluation set. The -complete training data set was then created by combining the certificate lines -of all three languages into one data set. Despite the provided certificate data -we used no further, external knowledge resources or annotated texts were -incorporated. - -Due to time constraints during developement we didn't perform cross-validation -to optimize the (hyper-) parameters and the inidividual layers of our models. We -either keep the default values of the hyperparameters or set them to reasonable -values according to existing work. During model training we shuffle the training -instances and use varying validation instances to perform a validation of the -epoch. - -As representation for the input tokens of the model we use pre-trained fastText -word embeddings \cite{bojanowski_enriching_2016}. The embeddings were trained on -Common Crawl and Wikipedia articles. For the training of the embeddings, -Bojanowski et al. used the following parameter settings: CBOW with -position-weights, embedding dimension size 300, with character n-grams of length -5, a window of size 5 and 10 negatives. Unfortunately, they are trained on -corpora not related with the biomedical domain and therefore do not represent -the best possible textual basis for an embedding space for biomedical -information extraction. Final embedding space used by our models is created by -concatenating individual embedding vectors for all three languages. Thus the -input of our model is embedding vector of size 900. All models were implemented -with the Keras library \footnote{\url{https://keras.io/}} in Version X.X. +The CLEF e-Health 2018 Task 1 participants where provided with annotated death certificates for the three selected languages: French, Italian and Hungarian. +Each of the languages is supported by training certificate lines as well as a dictionary with death cause descriptions resp. diagnosis for the different ICD-10 codes. +The provided training data sets were imbalanced concerning the different languages: the Italian corpora consists of 49,823, French corpora of 77,348\footnote{For French we only took the provided data set from 2014.} and Hungarian corpora 323,175 certificate lines. +We split each data set into a training and a hold-out evaluation set. +The complete training data set was then created by combining the certificate lines of all three languages into one data set. +Beside the provided certificate data we used, no additional knowledge resources or annotated texts were used. + +Due to time constraints during development no cross-validation to optimize the (hyper-) parameters and the individual layers of our models was performed. +We either keep the default values of the hyper-parameters or set them to reasonable values according to existing work. +During model training we shuffle the training instances and use varying validation instances to perform a validation of the epoch. + +As representation for the input tokens of the model we use pre-trained fastText word embeddings. % \cite{bojanowski_enriching\_2016}. The embeddings were trained on Common Crawl and Wikipedia articles. +Embeddings' were trained using the following parameter settings: CBOW with position-weights, embedding dimension size 300, with character n-grams of length 5, a window of size 5 and 10 negatives. +Unfortunately, they are trained on corpora not related with the biomedical domain and therefore do not represent the best possible textual basis for an embedding space for biomedical information extraction. +Final embedding space used by our models is created by concatenating individual embedding vectors for all three languages. +Thus the input of our model is embedding vector of size 900. +All models were implemented with the Keras library \footnote{\url{https://keras.io/}}.% in Version X.X. \subsection{Death cause extraction model} -To identify possible tokens as candidates for a death cause description, we -focused on the use of an encoder-decoder model. The encoder uses an embedding -layer with input masking on zero values and a LSTM layer with 256 units. The -encoders output is used as the initial state of the decoder. - -The decoder generates, based on the input description from the dictionary and a -special start token, a death cause word by word. This decoding process continues -until a special end token is generated. The entire model is optimized using the -Adam optimization algorithm \cite{kingma_adam:_2014} and a batch size of 700. Model -training was performed either for 100 epochs or if an early stopping criteria is -met (no change in validation loss for two epochs). - -As the available dataset are imbalanced concerning the different languages, we -devised two different evaluation settings: (1) DCEM-Balanced, where each -language was supported by 49.823 randomly drawn instances (size of the smallest -corpus) and (2) DCEM-Full, where all available data is used. The results, -obtained on the training and validation set, are shown in Table \ref{tab:s2s}. -The figures reveal that distribution of training instances per language have a -huge influence on the performance of the model. The model trained on the -full training data achieves an accuracy of 0.678 on the validation set. In contrast -using the balanced data set the model reaches an accuracy of 0.899 (+ 32.5\%). +To identify possible candidates for a death cause description, we focus on the use of an encoder-decoder model. +The encoder model uses an embedding layer with input masking on zero values and a LSTM layer with 256 units. +The encoders' output is used as the initial state of the decoder model. + +Based on the input description from the dictionary and a special start token, the decoder generates a death cause word by word. +This decoding process continues until a special end token is generated. +The entire model is optimized using the Adam optimization algorithm \cite{kingma_adam:_2014} and a batch size of 700. +Model training was performed either for 100 epochs or until an early stopping criteria is met (no change in validation loss for two epochs). + +As the available dataset are imbalanced concerning the different languages, we devised two different evaluation settings: (1) DCEM-Balanced, where each language was supported by 49.823 randomly drawn instances (size of the smallest corpus) and (2) DCEM-Full, where all available data is used. +The results, obtained on the training and validation set, are shown in Table \ref{tab:s2s}. +The figures reveal that distribution of training instances per language have a huge influence on the performance of the model. +The model trained on the full training data achieves an accuracy of 0.678 on the validation set. +In contrast using the balanced data set the model reaches an accuracy of 0.899 (+ 32.5\%). \begin{table}[] \label{tab:s2s} @@ -68,6 +45,7 @@ using the balanced data set the model reaches an accuracy of 0.899 (+ 32.5\%). &&\textbf{Accuracy}&\textbf{Loss}&\textbf{Accuracy}&\textbf{Loss} \\ \hline DCEM-Balanced & 18 & 0.958 & 0.205 & 0.899 & 0.634 \\ +\hline DCEM-Full & 9 &0.709 & 0.098 & 0.678 & 0.330 \\ \bottomrule \end{tabularx} @@ -77,36 +55,28 @@ data set setting.} \end{table} \subsection{ICD-10 Classification Model} -The classification model is responsible for assigning a ICD-10 code to death -cause description obtained during the first step. Our model uses an embedding -layer with input masking on zero values, followed by and bidirectional LSTM -layer with 256 dimension hidden layer. Thereafter an attention layer builds an -adaptive weighted average over all LSTM states. The respective ICD-10 code will -be determined by a dense layer with softmax activation function. We use the Adam -optimizer to perform model training. The model was validated on 25\% of the -data. As for the extraction model, no cross-validation or hyperparameter -optimization was performed due to time contraints during development. - -Once again, we devised two approaches. This was mainly caused by the lack of -adequate training data in terms of coverage for individual ICD-10 codes. -Therefore, we once again defined two training data settings: (1) minimal, where -only ICD-10 codes with two or more supporting training instances are used. This, -of course, minimizes the number of ICD-10 codes in the label space. Therefore, -(2) an extended dataset was defined. Here, the original ICD-10 code mappings, -found in the supplied dictionaries, are extended with the training instances -from individual certificate data from the three languages. Finally, for the -remaining ICD-10 codes that have only one supporting diagnosis text resp. death -cause description, we duplicate those data points. The goal of this approach is -to extend our possible label space to all available ICD-10 codes. The results -obtained from the two approaches on the validation set are shown in Table -\ref{tab:icd10Classification}. Using the \textit{minimal} data set the model -achieves an accuracy of 0.937. In contrast, using the extended data set the -model reaches an accuracy of 0.954 which represents an improvment of 1.8\%. +The classification model is responsible for assigning a ICD-10 code to death cause description obtained during the first step. +Our model uses an embedding layer with input masking on zero values, followed by and bidirectional LSTM layer with 256 dimension hidden layer. +Thereafter an attention layer builds an adaptive weighted average over all LSTM states. +The respective ICD-10 code will be determined by a dense layer with softmax activation function. +We use the Adam optimizer to perform model training. +The model was validated on 25\% of the data. +As for the extraction model, no cross-validation or hyper-parameter optimization was performed due to time constraints during development. + +Once again, we devised two approaches. This was mainly caused by the lack of adequate training data in terms of coverage for individual ICD-10 codes. +Therefore, we once again defined two training data settings: (1) minimal (ICD-10\_Minimal), where only ICD-10 codes with two or more supporting training instances are used. +This, of course, minimizes the number of ICD-10 codes in the label space. +Therefore, (2) an extended (ICD-10\_Extended) dataset was defined. Here, the original ICD-10 code mappings, found in the supplied dictionaries, are extended with the training instances from individual certificate data from the three languages. +Finally, for the remaining ICD-10 codes that have only one supporting diagnosis text resp. death cause description, we duplicate those data points. +The goal of this approach is to extend our possible label space to all available ICD-10 codes. +The results obtained from the two approaches on the validation set are shown in Table \ref{tab:icd10Classification}. +Using the \textit{minimal} data set the model achieves an accuracy of 0.937. +In contrast, using the extended data set the model reaches an accuracy of 0.954 which represents an improvement of 1.8\%. \begin{table}[] \label{tab:icd10Classification} \centering -\begin{tabularx}{0.85\textwidth}{p{2.25cm}|c|c|c|c|c} +\begin{tabularx}{0.9\textwidth}{p{2.25cm}|c|c|c|c|c} \toprule %\multirow{2}{*}{\textbf{Tokenization}}&\multirow{2}{*}{\textbf{Model}}&\multirow{2}{*}{\textbf{Trained Epochs}}&\multicolumn{2}{c|}{\textbf{Train}}&\multicolumn{2}{c}{\textbf{Validation}} \\ %\cline{4-7} @@ -114,33 +84,36 @@ model reaches an accuracy of 0.954 which represents an improvment of 1.8\%. \cline{3-6} &&\textbf{Accuracy}&\textbf{Loss}&\textbf{Accuracy}&\textbf{Loss} \\ \hline -Minimal & 69 & 0.925 & 0.190 & 0.937 & 0.169 \\ -Extended & 41 & 0.950 & 0.156 & 0.954 & 0.141 \\ +ICD-10\_Minimal & 69 & 0.925 & 0.190 & 0.937 & 0.169 \\ +\hline +ICD-10\_Extended & 41 & 0.950 & 0.156 & 0.954 & 0.141 \\ %Character & Minimal & 91 & 0.732 & 1.186 & 0.516 & 2.505 \\ \bottomrule \end{tabularx} -\caption{Experiment results for our ICD-10 classification model regarding different data settings. The \textit{Minimal} -setting uses only ICD-10 codes with two or more training instances in the supplied dictionary. In contrast, -\textit{Extended} addtionally takes the diagnosis texts from the certificate data and duplicates -ICD-10 training instances with only one diagnosis text in the dictionary and certificate lines.} +\caption{Experiment results for our ICD-10 classification model regarding different data settings. +The \textit{Minimal} setting uses only ICD-10 codes with two or more training instances in the supplied dictionary. +In contrast, \textit{Extended} additionally takes the diagnosis texts from the certificate data and duplicates ICD-10 training instances with only one diagnosis text in the dictionary and certificate lines.} \end{table} \subsection{Complete Pipeline} -The two models where combined to create the final pipeline. We tested both -neural models in the final pipeline, as their performance differs greatly. -As both ICD-10 classification models perform similarly, we used the word and -extended ICD-10 classification model in the final pipeline. The results obtained -during training are presented in Table \ref{tab:final_train}. Results obtained -on the evaluation dataset are shown in Table \ref{tab:final_test}. +The two models where combined to create the final pipeline. +We tested both neural models in the final pipeline, as their performance differs greatly. +As both ICD-10 classification models perform similarly, we used the word and extended ICD-10 classification model in the final pipeline. +The results obtained during training are presented in Table \ref{tab:final_train}. +Results obtained on the evaluation dataset are shown in Table \ref{tab:final_test}. \begin{table}[] \centering -\begin{tabular}{|l|l|l|l|} -Model & Precision & Recall & F-score \\ -S2S balanced + ICD-10 extended & 0.73 & 0.61 & 0.61 \\ -S2S extended + ICD-10 extended & 0.74 & 0.62 & 0.63 \\ +\begin{tabular}{|l|c|c|c|} +\toprule +\textbf{Model} & \textbf{Precision} & \textbf{Recall} & \textbf{F-score} \\ +\hline +Final-Balanced & 0.73 & 0.61 & 0.61 \\ +\hline +Final-Full & 0.74 & 0.62 & 0.63 \\ +\bottomrule \end{tabular} -\caption{Final Pipeline Evaluation} +\caption{Final Pipeline Performance - Training Data. Final-Balanced = DCEM-Balanced + ICD-10\_Extended. Final-Full = DCEM-Full + ICD-10\_Extended} \label{tab:final_train} \end{table} @@ -151,8 +124,8 @@ S2S extended + ICD-10 extended & 0.74 & 0.62 & 0.63 \\ \textbf{Language} & \textbf{Model} & \textbf{Precision} & \textbf{Recall} & \textbf{F-score}\\ \hline \multirow{2}{*}{French} -& DCEM-Balanced & 0.494 & 0.246 & 0.329 \\ -& DCEM-Full & 0.512 & 0.253 & 0.339 \\ +& Final-Balanced & 0.494 & 0.246 & 0.329 \\ +& Final-Full & 0.512 & 0.253 & 0.339 \\ \cline{2-5} & Baseline & 0.341 & 0.200 & 0.253 \\ & Average & 0.723 & 0.410 & 0.507 \\ @@ -160,8 +133,8 @@ S2S extended + ICD-10 extended & 0.74 & 0.62 & 0.63 \\ \hline \multirow{2}{*}{Hungarian} -& DCEM-Balanced & 0.518 & 0.384 & 0.441 \\ -& DCEM-Full & 0.522 & 0.388 & 0.445 \\ +& Final-Balanced & 0.518 & 0.384 & 0.441 \\ +& Final-Full & 0.522 & 0.388 & 0.445 \\ \cline{2-5} & Baseline & 0.243 & 0.174 & 0.202 \\ & Average & 0.827 & 0.783 & 0.803 \\ @@ -169,15 +142,15 @@ S2S extended + ICD-10 extended & 0.74 & 0.62 & 0.63 \\ \hline \multirow{3}{*}{Italian} -& DCEM-Balanced & 0.857 & 0.685 & 0.761 \\ -& DCEM-Full & 0.862 & 0.689 & 0.766 \\ +& Final-Balanced & 0.857 & 0.685 & 0.761 \\ +& Final-Full & 0.862 & 0.689 & 0.766 \\ \cline{2-5} & Baseline & 0,165 & 0.172 & 0.169 \\ & Average & 0.844 & 0.760 & 0.799 \\ & Median & 0,900 & 0.824 & 0.863 \\ \bottomrule \end{tabularx} -\caption{Final Pipeline Evaluation} +\caption{Final Pipeline Perfromance - Evaluation Data. Final-Balanced = DCEM-Balanced + ICD-10\_Extended. Final-Full = DCEM-Full + ICD-10\_Extended} \label{tab:final_test} \end{table} diff --git a/paper/50_conclusion.tex b/paper/50_conclusion.tex index f4e76cfdc61572028f81e0c58251fa9605f01609..c071a5bc14e3ed0fb51b9c56f3dffdbfd34a5fc5 100644 --- a/paper/50_conclusion.tex +++ b/paper/50_conclusion.tex @@ -1,33 +1,24 @@ -In this paper we tackled the problem of information extraction of death causes -in an multilingual environment. The proposed solution was focused on the setup -and evaluation of an initial language-independent model which relies on a -heuristic mutual word embedding space for all three languages. The proposed pipeline -is divided in two steps: possible token describing the death cause are generated -by using a sequence to sequence model first. Afterwards the generated token -sequence is normalized to a ICD-10 code using a distinct LSTM-based -classification model with attention mechanism. During evaluation our best model -achieves an f-measure of 0.34 for French, 0.45 for Hungarian and 0.77 for -Italian. The obtained results are encouraging for furthur investigation however -can't compete with the solutions of the other participants yet. +In this paper we tackled the problem of information extraction of death causes in an multilingual environment. +The proposed solution was focused on the setup and evaluation of an initial language-independent model which relies on a +heuristic mutual word embedding space for all three languages. +The proposed pipeline is divided in two steps: possible token describing the death cause are generated by using a sequence to sequence model first. +Afterwards the generated token sequence is normalized to a ICD-10 code using a distinct LSTM-based classification model with attention mechanism. +During evaluation our best model achieves an f-measure of 0.34 for French, 0.45 for Hungarian and 0.77 for Italian. +The obtained results are encouraging for further investigation however can't compete with the solutions of the other participants yet. -We detected several issues with the proposed pipeline. These issues serve as -prospective future work to us. First of all the representation of the input -words can be improved in several ways. The word embeddings we used are not -optimized to the biomedical domain but are trained on general text. Existing -work was proven that in-domain embeddings improve the quality of achieved -results. Although this was our initial approach, the difficulties of finding adequate -in-domain corpora for selected languages has proven to be to a hard to tackle. -Moreover, the multi-language embedding space is currently heuristically defined -as concatenation of the three word embeddings models for individual tokens. -Creating an unified embedding space would create a truly language-independent -token representation. The improvement of the input layer will be the main focus -of our future work. +We detected several issues with the proposed pipeline. +These issues serve as prospective future work to us. +First of all the representation of the input words can be improved in several ways. +The word embeddings we used are not optimized to the biomedical domain but are trained on general text. +Existing work was proven that in-domain embeddings improve the quality of achieved results. +Although this was our initial approach, the difficulties of finding adequate in-domain corpora for selected languages has proven to be to a hard to tackle. +Moreover, the multi-language embedding space is currently heuristically defined as concatenation of the three word embeddings models for individual tokens. +Creating an unified embedding space would create a truly language-independent token representation. +The improvement of the input layer will be the main focus of our future work. -The ICD-10 classification step also suffers from lack of adequate training -data. Unfortunately, we were unable to obtain extensive ICD-10 dictinaries for all -languages and therefore can't guarantee the completeness of the ICD-10 label -space. Another disadvantage of the current pipeline is the missing support for -mutli-label classification. +The ICD-10 classification step also suffers from lack of adequate training data. +Unfortunately, we were unable to obtain extensive ICD-10 dictionaries for all languages and therefore can't guarantee the completeness of the ICD-10 label space. +Another disadvantage of the current pipeline is the missing support for mutli-label classification. diff --git a/paper/wbi-eclef18.tex b/paper/wbi-eclef18.tex index a11ff54a5d9eb43e2fa973673e80e28db0c6f5bd..d6d00dae21cfdef89a224629f7b5a27bcbdba6e7 100644 --- a/paper/wbi-eclef18.tex +++ b/paper/wbi-eclef18.tex @@ -45,20 +45,14 @@ Bioinformatics, \\ Berlin, Germany\\ \maketitle % typeset the header of the contribution % \begin{abstract} -This paper describes the participation of the WBI team in the CLEF eHealth 2018 -shared task 1 (``Multilingual Information Extraction - ICD-10 coding''). Our -contribution focus on the setup and evaluation of a baseline language-independent -neural architecture for ICD-10 classification as well as a simple, heuristic -multi-language word embedding space. The approach builds on two recurrent -neural networks models to extract and classify causes of death from French, -Italian and Hungarian death certificates. First, we employ a LSTM-based -sequence-to-sequence model to obtain a death cause from each death certificate -line. We then utilize a bidirectional LSTM model with attention mechanism to -assign the respective ICD-10 codes to the received death cause description. Both -models take multi-language word embeddings as inputs. During evaluation our best -model achieves an F-measure of 0.34 for French, 0.45 for Hungarian and 0.77 for -Italian. The results are encouraging for future work as well as the extension and -improvement of the proposed baseline system. +This paper describes the participation of the WBI team in the CLEF eHealth 2018 shared task 1 (``Multilingual Information Extraction - ICD-10 coding''). +Our contribution focus on the setup and evaluation of a baseline language-independent neural architecture for ICD-10 classification as well as a simple, heuristic multi-language word embedding space. +The approach builds on two recurrent neural networks models to extract and classify causes of death from French, Italian and Hungarian death certificates. +First, we employ a LSTM-based sequence-to-sequence model to obtain a death cause from each death certificate line. +We then utilize a bidirectional LSTM model with attention mechanism to assign the respective ICD-10 codes to the received death cause description. +Both models take multi-language word embeddings as inputs. +During evaluation our best model achieves an F-measure of 0.34 for French, 0.45 for Hungarian and 0.77 for Italian. +The results are encouraging for future work as well as the extension and improvement of the proposed baseline system. \keywords{ICD-10 coding \and Biomedical information extraction \and Multi-lingual sequence-to-sequence model \and Represention learning \and Recurrent neural network \and Attention mechanism \and Multi-language embeddings}