10_introduction.tex

  
 Automatic extraction, classification and analysis of biological and medical concepts from unstructured texts, such as scientific publications or electronic health documents, is a highly important task to support many applications in research, daily clinical routine and policy-making.
Computer-assisted approaches can improve decision making and support clinical processes, for example, by giving a more sophisticated overview about a research area, providing detailed information about the aetiopathology of a patient or disease patterns.
In the past years major advances have been made in the area of natural-language processing (NLP).
However, improvements in the field of biomedical text mining lag behind other domains mainly due to privacy issues and concerns regarding the processed data (e.g. electronic health records).

The CLEF eHealth lab\footnote{\url{https://sites.google.com/site/clefehealth/}} attends to circumvent this situation through organization of various shared tasks %which aid and support the development of approaches
to exploit electronically available medical content \cite{suominen_overview_2018}.
In particular, Task 1\footnote{\url{https://sites.google.com/view/clef-ehealth-2018/task-1-multilingual-information-extraction-icd10-coding}} of the lab is concerned with the extraction and classification of death causes from death certificates originating from different languages \cite{neveol_clef_2018}.
Participants were asked to classify the death causes mentioned in the certificates according to the International Classification of Disease version 10 (ICD-10)\footnote{\url{http://www.who.int/classifications/icd/en/}}.
The task %has been carried out the last two years of the lab, however
was concerned with French and English death certificates in previous years.
In contrast, this year the organizers provided annotated death reports as well as ICD-10 dictionaries for French, Italian and Hungarian.
The development of language-independent, multilingual approaches was encouraged.

Inspired by the recent success of recurrent neural network models (RNN) \cite{cho_learning_2014,lample_neural_2016,dyer_transition-based_2015} in general and the convincing performance of the work from Miftahutdinov and Tutbalina \cite{miftakhutdinov_kfu_2017} in the last edition of the lab, we opt for the development of a deep learning model for this year's competition.
Our work introduces a prototypical, language independent approach for ICD-10 classification using multi-language word embeddings and long short-term memory models (LSTMs).
We divide the proposed pipeline %$classification
into two tasks.
First, we perform Name Entity Recognition (NER), i.e. extract the death cause description from a certificate line, with an an encoder-decoder model.
Given the death cause, Named Entity Normalization (NEN), i.e. assigning an ICD-10 code to extracted death cause, is performed by a separate LSTM.
Our approach builds upon a heuristic multi-language embedding space and therefore only needs one single model for all three data sets.
With this work we want to experiment and evaluate which performance can be achieved with such a simple shared embedding space.