In this paper we tackled the problem of information extraction of death causes
in an multilingual environment. The proposed solution was focused in language-independent models and relies on
word embeddings for each of the languages.
The proposed pipeline is divided in two steps: (1) first, possible token describing the death cause are generated by using a sequence to sequence model with attention mechanism; then, (2) generated token sequence is normalized to a possible ICD-10 code.
in an multilingual environment. The proposed solution was focused on the setup
and evaluation of an initial language-independent model which relies on a
heuristic mutual word embedding space for all three languages. The proposed pipeline
is divided in two steps: possible token describing the death cause are generated
by using a sequence to sequence model first. Afterwards the generated token
sequence is normalized to a ICD-10 code using a distinct LSTM-based
classification model with attention mechanism. During evaluation our best model
achieves an f-measure of 0.34 for French, 0.45 for Hungarian and 0.77 for
Italian. The obtained results are encouraging for furthur investigation however
can't compete with the solutions of the other participants yet.
We detected several issues with the proposed pipeline. These issues serve as
prospective future work to us. First of all the representation of the input
words can be improved in several ways. The word embeddings we used are not
optimized to the biomedical domain but are trained on general text. Existing
work was proven that in-domain embeddings improve the quality of achieved
results. Although this was our initial approach, the difficulties of finding adequate
in-domain corpora for selected languages has proven to be to a hard to tackle.
Moreover, the multi-language embedding space is currently heuristically defined
as concatenation of the three word embeddings models for individual tokens.
Creating an unified embedding space would create a truly language-independent
token representation. The improvement of the input layer will be the main focus
of our future work.
We detected several issues with the proposed pipeline. These issues also serve as prospecitve future work.
The word embeddings we used are not optimized to the problem domain but are trained in general text.
The mutual embeddings space is currently defined as concatenation of the the word embeddings models for individual tokens.
In this aspect, several possible improvements of the proposed pipeline are detected.
First, the use of in-domain target language embeddings as initial token embeddings.
Although this was our initial approach, the difficulties of finding adequate in-domain corpora for selected languages has proven to be to ohard to tackle.
Our current embedding space is merely a concatenation of the three target language embeddings.
Creating an unifying embeddings space would create a truly language-independent token representation.
Additionally, it was shown that in-domain embeddings improve the quality of achieved results. This will be the main focus on our future work.
The normalization step also suffered from lack of adequate training data.
Unfortunately, we were unable to obtain ICD-10 dictinaries for all languages and can, therefore, not guarantee the completeness of the ICD-10 label space.
Another downside of the proposed pipeline is the lack fo support for mutli-label classification.
The ICD-10 classification step also suffers from lack of adequate training
data. Unfortunately, we were unable to obtain extensive ICD-10 dictinaries for all
languages and therefore can't guarantee the completeness of the ICD-10 label
space. Another disadvantage of the current pipeline is the missing support for