Skip to content
Snippets Groups Projects
Commit 9f123f75 authored by Mario Sänger's avatar Mario Sänger
Browse files

Updated conclusion text

parent da3a7509
No related merge requests found
......@@ -33,7 +33,7 @@ multi-language word embeddings and LSTM-based recurrent models. We divide the
the classification into two tasks. First, we extract the death cause description
from a certificate line backed by an encoder-decoder model. Given the death
cause the actual ICD-10 classification will be performed by a separate LSTM
model. Our work focus on the setup and evaluation of a first, baseline
model. Our work focus on the setup and evaluation of an initial, baseline
language-independent approach which builds on a heuristic multi-language
embedding space and therefore only needs one single model for all three data
sets. Moreover, we tried to as little as possible additional external resources.
......
In this paper we tackled the problem of information extraction of death causes
in an multilingual environment. The proposed solution was focused in language-independent models and relies on
word embeddings for each of the languages.
The proposed pipeline is divided in two steps: (1) first, possible token describing the death cause are generated by using a sequence to sequence model with attention mechanism; then, (2) generated token sequence is normalized to a possible ICD-10 code.
in an multilingual environment. The proposed solution was focused on the setup
and evaluation of an initial language-independent model which relies on a
heuristic mutual word embedding space for all three languages. The proposed pipeline
is divided in two steps: possible token describing the death cause are generated
by using a sequence to sequence model first. Afterwards the generated token
sequence is normalized to a ICD-10 code using a distinct LSTM-based
classification model with attention mechanism. During evaluation our best model
achieves an f-measure of 0.34 for French, 0.45 for Hungarian and 0.77 for
Italian. The obtained results are encouraging for furthur investigation however
can't compete with the solutions of the other participants yet.
We detected several issues with the proposed pipeline. These issues serve as
prospective future work to us. First of all the representation of the input
words can be improved in several ways. The word embeddings we used are not
optimized to the biomedical domain but are trained on general text. Existing
work was proven that in-domain embeddings improve the quality of achieved
results. Although this was our initial approach, the difficulties of finding adequate
in-domain corpora for selected languages has proven to be to a hard to tackle.
Moreover, the multi-language embedding space is currently heuristically defined
as concatenation of the three word embeddings models for individual tokens.
Creating an unified embedding space would create a truly language-independent
token representation. The improvement of the input layer will be the main focus
of our future work.
We detected several issues with the proposed pipeline. These issues also serve as prospecitve future work.
The word embeddings we used are not optimized to the problem domain but are trained in general text.
The mutual embeddings space is currently defined as concatenation of the the word embeddings models for individual tokens.
In this aspect, several possible improvements of the proposed pipeline are detected.
First, the use of in-domain target language embeddings as initial token embeddings.
Although this was our initial approach, the difficulties of finding adequate in-domain corpora for selected languages has proven to be to ohard to tackle.
Our current embedding space is merely a concatenation of the three target language embeddings.
Creating an unifying embeddings space would create a truly language-independent token representation.
Additionally, it was shown that in-domain embeddings improve the quality of achieved results. This will be the main focus on our future work.
The normalization step also suffered from lack of adequate training data.
Unfortunately, we were unable to obtain ICD-10 dictinaries for all languages and can, therefore, not guarantee the completeness of the ICD-10 label space.
Another downside of the proposed pipeline is the lack fo support for mutli-label classification.
The ICD-10 classification step also suffers from lack of adequate training
data. Unfortunately, we were unable to obtain extensive ICD-10 dictinaries for all
languages and therefore can't guarantee the completeness of the ICD-10 label
space. Another disadvantage of the current pipeline is the missing support for
mutli-label classification.
......
......@@ -47,7 +47,7 @@ Bioinformatics, \\ Berlin, Germany\\
\begin{abstract}
This paper describes the participation of the WBI team in the CLEF eHealth 2018
shared task 1 (``Multilingual Information Extraction - ICD-10 coding''). Our
contribution focus on the setup and evaluation of a basline language-independent
contribution focus on the setup and evaluation of a baseline language-independent
neural architecture for ICD-10 classification as well as a simple, heuristic
multi-language word embedding technique. The approach builds on two recurrent
neural networks models to extract and classify causes of death from French,
......@@ -57,7 +57,8 @@ line. We then utilize a bidirectional LSTM model with attention mechanism to
assign the respective ICD-10 codes to the received death cause description. Both
models take multi-language word embeddings as inputs. During evaluation our best
model achieves an F-measure of 0.34 for French, 0.45 for Hungarian and 0.77 for
Italian.
Italian. The results are encouraging for future work as well as extension and
improvement of the proposed baseline system.
\keywords{ICD-10 coding \and Biomedical information extraction \and Multi-lingual sequence-to-sequence model
\and Represention learning \and Recurrent neural network \and Attention mechanism \and Multi-language embeddings}
......
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment