Updated conclusion text

9f123f75 · Mario Sänger · da3a7509 · 9f123f75 · 9f123f75 · 9f123f75
Commit 9f123f75 authored 6 years ago by Mario Sänger
--- a/paper/10_introduction.tex
+++ b/paper/10_introduction.tex
@@ -33,7 +33,7 @@ multi-language word embeddings and LSTM-based recurrent models. We divide the
 the classification into two tasks. First, we extract the death cause description
 from a certificate line backed by an encoder-decoder model. Given the death
 cause the actual ICD-10 classification will be performed by a separate LSTM
-model. Our work focus on the setup and evaluation of a first, baseline
+model. Our work focus on the setup and evaluation of an initial, baseline
 language-independent approach which builds on a heuristic multi-language
 embedding space and therefore only needs one single model for all three data
 sets. Moreover, we tried to as little as possible additional external resources.

--- a/paper/50_conclusion.tex
+++ b/paper/50_conclusion.tex
 In this paper we tackled the problem of information extraction of death causes
-in an multilingual environment. The proposed solution was focused in language-independent models and relies on
-word embeddings for each of the languages.
-The proposed pipeline is divided in two steps: (1) first, possible token describing the death cause are generated by using a sequence to sequence model with attention mechanism; then, (2) generated token sequence is normalized to a possible ICD-10 code.
+in an multilingual environment. The proposed solution was focused on the setup
+and evaluation of an initial language-independent model which relies on a
+heuristic mutual word embedding space for all three languages. The proposed pipeline
+is divided in two steps: possible token describing the death cause are generated
+by using a sequence to sequence model first. Afterwards the generated token
+sequence is normalized to a ICD-10 code using a distinct LSTM-based
+classification model with attention mechanism. During evaluation our best model
+achieves an f-measure of 0.34 for French, 0.45 for Hungarian and 0.77 for
+Italian. The obtained results are encouraging for furthur investigation however
+can't compete with the solutions of the other participants yet.
+ 
+We detected several issues with the proposed pipeline. These issues serve as
+prospective future work to us. First of all the representation of the input
+words can be improved in several ways. The word embeddings we used are not
+optimized to the biomedical domain but are trained on general text. Existing
+work was proven that in-domain embeddings improve the quality of achieved
+results. Although this was our initial approach, the difficulties of finding adequate
+in-domain corpora for selected languages has proven to be to a hard to tackle.
+Moreover, the multi-language embedding space is currently heuristically defined
+as concatenation of the three word embeddings models for individual tokens.
+Creating an unified embedding space would create a truly language-independent
+token representation. The improvement of the input layer will be the main focus
+of our future work.

-We detected several issues with the proposed pipeline. These issues also serve as prospecitve future work.
-The word embeddings we used are not optimized to the problem domain but are trained in general text.
-The mutual embeddings space is currently defined as concatenation of the the word embeddings models for individual tokens.
-In this aspect, several possible improvements of the proposed pipeline are detected.
-First, the use of in-domain target language embeddings as initial token embeddings.
-Although this was our initial approach, the difficulties of finding adequate in-domain corpora for selected languages has proven to be to ohard to tackle.
-Our current embedding space is merely a concatenation of the three target language embeddings.
-Creating an unifying embeddings space would create a truly language-independent token representation.
-Additionally, it was shown that in-domain embeddings improve the quality of achieved results. This will be the main focus on our future work.
-The normalization step also suffered from lack of adequate training data.
-Unfortunately, we were unable to obtain ICD-10 dictinaries for all languages and can, therefore, not guarantee the completeness of the ICD-10 label space.
-Another downside of the proposed pipeline is the lack fo support for mutli-label classification.
+The ICD-10 classification step also suffers from lack of adequate training
+data. Unfortunately, we were unable to obtain extensive ICD-10 dictinaries for all
+languages and therefore can't guarantee the completeness of the ICD-10 label
+space. Another disadvantage of the current pipeline is the missing support for
+mutli-label classification.




--- a/paper/wbi-eclef18.tex
+++ b/paper/wbi-eclef18.tex
@@ -47,7 +47,7 @@ Bioinformatics, \\ Berlin, Germany\\
 \begin{abstract}
 This paper describes the participation of the WBI team in the CLEF eHealth 2018
 shared task 1 (``Multilingual Information Extraction - ICD-10 coding''). Our
-contribution focus on the setup and evaluation of a basline language-independent
+contribution focus on the setup and evaluation of a baseline language-independent
 neural architecture for ICD-10 classification as well as a simple, heuristic
 multi-language word embedding technique. The approach builds on two recurrent
 neural networks models to extract and classify causes of death from French,
@@ -57,7 +57,8 @@ line. We then utilize a bidirectional LSTM model with attention mechanism to
 assign the respective ICD-10 codes to the received death cause description. Both
 models take multi-language word embeddings as inputs. During evaluation our best
 model achieves an F-measure of 0.34 for French, 0.45 for Hungarian and 0.77 for
-Italian.
+Italian. The results are encouraging for future work as well as extension and
+improvement of the proposed baseline system.

 \keywords{ICD-10 coding \and Biomedical information extraction \and Multi-lingual sequence-to-sequence model 
 \and Represention learning \and Recurrent neural network \and Attention mechanism \and Multi-language embeddings}