diff --git a/paper/10_introduction.tex b/paper/10_introduction.tex index e8f8928beee6faf369fc27b7a78d382cdcafd63b..3913217d8d118294a8b5bd3d83828f5babcf907b 100644 --- a/paper/10_introduction.tex +++ b/paper/10_introduction.tex @@ -33,7 +33,7 @@ multi-language word embeddings and LSTM-based recurrent models. We divide the the classification into two tasks. First, we extract the death cause description from a certificate line backed by an encoder-decoder model. Given the death cause the actual ICD-10 classification will be performed by a separate LSTM -model. Our work focus on the setup and evaluation of a first, baseline +model. Our work focus on the setup and evaluation of an initial, baseline language-independent approach which builds on a heuristic multi-language embedding space and therefore only needs one single model for all three data sets. Moreover, we tried to as little as possible additional external resources. diff --git a/paper/50_conclusion.tex b/paper/50_conclusion.tex index 59416fd9fed72d962f9185d302aba07b2ba3386c..f4e76cfdc61572028f81e0c58251fa9605f01609 100644 --- a/paper/50_conclusion.tex +++ b/paper/50_conclusion.tex @@ -1,20 +1,33 @@ In this paper we tackled the problem of information extraction of death causes -in an multilingual environment. The proposed solution was focused in language-independent models and relies on -word embeddings for each of the languages. -The proposed pipeline is divided in two steps: (1) first, possible token describing the death cause are generated by using a sequence to sequence model with attention mechanism; then, (2) generated token sequence is normalized to a possible ICD-10 code. +in an multilingual environment. The proposed solution was focused on the setup +and evaluation of an initial language-independent model which relies on a +heuristic mutual word embedding space for all three languages. The proposed pipeline +is divided in two steps: possible token describing the death cause are generated +by using a sequence to sequence model first. Afterwards the generated token +sequence is normalized to a ICD-10 code using a distinct LSTM-based +classification model with attention mechanism. During evaluation our best model +achieves an f-measure of 0.34 for French, 0.45 for Hungarian and 0.77 for +Italian. The obtained results are encouraging for furthur investigation however +can't compete with the solutions of the other participants yet. + +We detected several issues with the proposed pipeline. These issues serve as +prospective future work to us. First of all the representation of the input +words can be improved in several ways. The word embeddings we used are not +optimized to the biomedical domain but are trained on general text. Existing +work was proven that in-domain embeddings improve the quality of achieved +results. Although this was our initial approach, the difficulties of finding adequate +in-domain corpora for selected languages has proven to be to a hard to tackle. +Moreover, the multi-language embedding space is currently heuristically defined +as concatenation of the three word embeddings models for individual tokens. +Creating an unified embedding space would create a truly language-independent +token representation. The improvement of the input layer will be the main focus +of our future work. -We detected several issues with the proposed pipeline. These issues also serve as prospecitve future work. -The word embeddings we used are not optimized to the problem domain but are trained in general text. -The mutual embeddings space is currently defined as concatenation of the the word embeddings models for individual tokens. -In this aspect, several possible improvements of the proposed pipeline are detected. -First, the use of in-domain target language embeddings as initial token embeddings. -Although this was our initial approach, the difficulties of finding adequate in-domain corpora for selected languages has proven to be to ohard to tackle. -Our current embedding space is merely a concatenation of the three target language embeddings. -Creating an unifying embeddings space would create a truly language-independent token representation. -Additionally, it was shown that in-domain embeddings improve the quality of achieved results. This will be the main focus on our future work. -The normalization step also suffered from lack of adequate training data. -Unfortunately, we were unable to obtain ICD-10 dictinaries for all languages and can, therefore, not guarantee the completeness of the ICD-10 label space. -Another downside of the proposed pipeline is the lack fo support for mutli-label classification. +The ICD-10 classification step also suffers from lack of adequate training +data. Unfortunately, we were unable to obtain extensive ICD-10 dictinaries for all +languages and therefore can't guarantee the completeness of the ICD-10 label +space. Another disadvantage of the current pipeline is the missing support for +mutli-label classification. diff --git a/paper/wbi-eclef18.tex b/paper/wbi-eclef18.tex index acbb005ab21d7d5d851fac481d891b58dacdcc6e..d89780adfae6fa3acb07c84cdca638e2d4567621 100644 --- a/paper/wbi-eclef18.tex +++ b/paper/wbi-eclef18.tex @@ -47,7 +47,7 @@ Bioinformatics, \\ Berlin, Germany\\ \begin{abstract} This paper describes the participation of the WBI team in the CLEF eHealth 2018 shared task 1 (``Multilingual Information Extraction - ICD-10 coding''). Our -contribution focus on the setup and evaluation of a basline language-independent +contribution focus on the setup and evaluation of a baseline language-independent neural architecture for ICD-10 classification as well as a simple, heuristic multi-language word embedding technique. The approach builds on two recurrent neural networks models to extract and classify causes of death from French, @@ -57,7 +57,8 @@ line. We then utilize a bidirectional LSTM model with attention mechanism to assign the respective ICD-10 codes to the received death cause description. Both models take multi-language word embeddings as inputs. During evaluation our best model achieves an F-measure of 0.34 for French, 0.45 for Hungarian and 0.77 for -Italian. +Italian. The results are encouraging for future work as well as extension and +improvement of the proposed baseline system. \keywords{ICD-10 coding \and Biomedical information extraction \and Multi-lingual sequence-to-sequence model \and Represention learning \and Recurrent neural network \and Attention mechanism \and Multi-language embeddings}