Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
C
clef18
Manage
Activity
Members
Labels
Plan
Issues
10
Issue boards
Milestones
Wiki
Requirements
Code
Merge requests
0
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Locked files
Build
Pipelines
Jobs
Pipeline schedules
Test cases
Artifacts
Deploy
Releases
Container Registry
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Code review analytics
Issue analytics
Insights
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Jurica Seva
clef18
Commits
b043ef91
Commit
b043ef91
authored
6 years ago
by
Mario Sänger
Browse files
Options
Downloads
Patches
Plain Diff
Minor changes to experiments + wording in introduction
parent
9f123f75
Branches
Branches containing commit
No related merge requests found
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
paper/40_experiments.tex
+36
-27
36 additions, 27 deletions
paper/40_experiments.tex
paper/wbi-eclef18.tex
+2
-2
2 additions, 2 deletions
paper/wbi-eclef18.tex
with
38 additions
and
29 deletions
paper/40_experiments.tex
+
36
−
27
View file @
b043ef91
...
...
@@ -44,7 +44,7 @@ encoders output is used as the initial state of the decoder.
The decoder generates, based on the input description from the dictionary and a
special start token, a death cause word by word. This decoding process continues
until a special end token is generated. The entire model is optimized using the
Adam optimization algorithm
\cite
{
kingma
_
adam
}
and a batch size of 700. Model
Adam optimization algorithm
\cite
{
kingma
_
adam
:
_
2014
}
and a batch size of 700. Model
training was performed either for 100 epochs or if an early stopping criteria is
met (no change in validation loss for two epochs).
...
...
@@ -72,7 +72,7 @@ DCEM-Full & 9 &0.709 & 0.098 & 0.678 & 0.330 \\
\bottomrule
\end{tabularx}
\caption
{
Experiment results of our death cause extraction sequence-to-sequence
model concerning balanced (equal number of training
data
per language) and full
model concerning balanced (equal number of training
instances
per language) and full
data set setting.
}
\end{table}
...
...
@@ -80,40 +80,49 @@ data set setting.}
The classification model is responsible for assigning a ICD-10 code to death
cause description obtained during the first step. Our model uses an embedding
layer with input masking on zero values, followed by and bidirectional LSTM
layer with 256 dimension hidden layer. Thereafter a attention layer builds an
adaptive weighted average over all LSTM states. The
y
ICD-10 code will
be
determined by a dense layer with softmax activation function.
We use the Adam optimizer to perform model training. The model was validated on
25
\%
od the data. As for the extraction model, no cross-validation or
hyperparameter was performed due to time contraints during development. Once
again, we devised two approaches. This was manly
influenc
ed by the lack of
layer with 256 dimension hidden layer. Thereafter a
n
attention layer builds an
adaptive weighted average over all LSTM states. The
respective
ICD-10 code will
be
determined by a dense layer with softmax activation function.
We use the Adam
optimizer to perform model training. The model was validated on 25
\%
of the
data. As for the extraction model, no cross-validation or hyperparameter
optimization was performed due to time contraints during development.
Once
again, we devised two approaches. This was ma
i
nly
caus
ed by the lack of
adequate training data in terms of coverage for individual ICD-10 codes.
Therefore, we once again defined two datasets: (1) minimal, where only ICD-10
codes with 2 or more supporting data points are used. This, of course, minimizes
the number of ICD-10 codes in the label space. Therefore, (2) an extended
dataset was defined. Here, the original ICD-10 codes mappings, found in the
supplied dictionaries, are extended with the data from individual langugae
Causes Calcules. Finally, for the remaining ICD-10 codes with support of 1 we
duplicate those datapoints. The goal of this approach is to extend our possible
label space to all available ICD-10 codes. The results obtained from the two
approaches are shown in Table
\ref
{
tab:icd10Classification
}
.
Therefore, we once again defined two training data settings: (1) minimal, where
only ICD-10 codes with two or more supporting training instances are used. This,
of course, minimizes the number of ICD-10 codes in the label space. Therefore,
(2) an extended dataset was defined. Here, the original ICD-10 code mappings,
found in the supplied dictionaries, are extended with the training instances
from individual certificate data from the three languages. Finally, for the
remaining ICD-10 codes that have only one supporting diagnosis text resp. death
cause description, we duplicate those data points. The goal of this approach is
to extend our possible label space to all available ICD-10 codes. The results
obtained from the two approaches on the validation set are shown in Table
\ref
{
tab:icd10Classification
}
. Using the
\textit
{
minimal
}
data set the model
achieves an accuracy of 0.937. In contrast, using the extended data set the
model reaches an accuracy of 0.954 which represents an improvment of 1.8
\%
.
\begin{table}
[]
\label
{
tab:icd10Classification
}
\centering
\begin{tabularx}
{
\textwidth
}{
p
{
2.25cm
}
|
p
{
1.75cm
}
|
c|c|c|c|c
}
\begin{tabularx}
{
0.85
\textwidth
}{
p
{
2.25cm
}
|c|c|c|c|c
}
\toprule
\multirow
{
2
}{
*
}{
\textbf
{
Tokenization
}}&
\multirow
{
2
}{
*
}{
\textbf
{
Model
}}&
\multirow
{
2
}{
*
}{
\textbf
{
Trained Epochs
}}&
\multicolumn
{
2
}{
c|
}{
\textbf
{
Train
}}&
\multicolumn
{
2
}{
c
}{
\textbf
{
Validation
}}
\\
\cline
{
4-7
}
&&&
\textbf
{
Accuracy
}&
\textbf
{
Loss
}&
\textbf
{
Accuracy
}&
\textbf
{
Loss
}
\\
%\multirow{2}{*}{\textbf{Tokenization}}&\multirow{2}{*}{\textbf{Model}}&\multirow{2}{*}{\textbf{Trained Epochs}}&\multicolumn{2}{c|}{\textbf{Train}}&\multicolumn{2}{c}{\textbf{Validation}} \\
%\cline{4-7}
\multirow
{
2
}{
*
}{
\textbf
{
Setting
}}&
\multirow
{
2
}{
*
}{
\textbf
{
Trained Epochs
}}&
\multicolumn
{
2
}{
c|
}{
\textbf
{
Train
}}&
\multicolumn
{
2
}{
c
}{
\textbf
{
Validation
}}
\\
\cline
{
3-6
}
&&
\textbf
{
Accuracy
}&
\textbf
{
Loss
}&
\textbf
{
Accuracy
}&
\textbf
{
Loss
}
\\
\hline
Word
&
Minimal
&
69
&
0.925
&
0.190
&
0.937
&
0.169
\\
Word
&
Extended
&
41
&
0.950
&
0.156
&
0.954
&
0.141
\\
Character
&
Minimal
&
91
&
0.732
&
1.186
&
0.516
&
2.505
\\
Minimal
&
69
&
0.925
&
0.190
&
0.937
&
0.169
\\
Extended
&
41
&
0.950
&
0.156
&
0.954
&
0.141
\\
%
Character & Minimal & 91 & 0.732 & 1.186 & 0.516 & 2.505 \\
\bottomrule
\end{tabularx}
\caption
{
Experiment results for our ICD-10 classification model regarding different settings.
}
\caption
{
Experiment results for our ICD-10 classification model regarding different data settings. The
\textit
{
Minimal
}
setting uses only ICD-10 codes with two or more training instances in the supplied dictionary. In contrast,
\textit
{
Extended
}
addtionally takes the diagnosis texts from the certificate data and duplicates
ICD-10 training instances with only one diagnosis text in the dictionary and certificate lines.
}
\end{table}
\subsection
{
Complete Pipeline
}
...
...
This diff is collapsed.
Click to expand it.
paper/wbi-eclef18.tex
+
2
−
2
View file @
b043ef91
...
...
@@ -49,7 +49,7 @@ This paper describes the participation of the WBI team in the CLEF eHealth 2018
shared task 1 (``Multilingual Information Extraction - ICD-10 coding''). Our
contribution focus on the setup and evaluation of a baseline language-independent
neural architecture for ICD-10 classification as well as a simple, heuristic
multi-language word embedding
techniqu
e. The approach builds on two recurrent
multi-language word embedding
spac
e. The approach builds on two recurrent
neural networks models to extract and classify causes of death from French,
Italian and Hungarian death certificates. First, we employ a LSTM-based
sequence-to-sequence model to obtain a death cause from each death certificate
...
...
@@ -57,7 +57,7 @@ line. We then utilize a bidirectional LSTM model with attention mechanism to
assign the respective ICD-10 codes to the received death cause description. Both
models take multi-language word embeddings as inputs. During evaluation our best
model achieves an F-measure of 0.34 for French, 0.45 for Hungarian and 0.77 for
Italian. The results are encouraging for future work as well as extension and
Italian. The results are encouraging for future work as well as
the
extension and
improvement of the proposed baseline system.
\keywords
{
ICD-10 coding
\and
Biomedical information extraction
\and
Multi-lingual sequence-to-sequence model
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment