From 22fa417e7ce621741eb50ae07ff669349f818816 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Mario=20Sa=CC=88nger?= <mario.saenger@student.hu-berlin.de>
Date: Mon, 25 Jun 2018 17:05:56 +0200
Subject: [PATCH] Fix some typos

---
 paper/10_introduction.tex    |  6 +-----
 paper/20_related_work.tex    |  8 ++++----
 paper/31_methods_seq2seq.tex |  2 +-
 paper/40_experiments.tex     | 14 +++++++-------
 paper/50_conclusion.tex      |  8 +-------
 5 files changed, 14 insertions(+), 24 deletions(-)

diff --git a/paper/10_introduction.tex b/paper/10_introduction.tex
index 53b092b..8debcb5 100644
--- a/paper/10_introduction.tex
+++ b/paper/10_introduction.tex
@@ -19,8 +19,4 @@ into two tasks.
 First, we perform Name Entity Recognition (NER), i.e. extract the death cause description from a certificate line, with an an encoder-decoder model.
 Given the death cause, Named Entity Normalization (NEN), i.e. assigning an ICD-10 code to extracted death cause, is performed by a separate LSTM.
 Our approach builds upon a heuristic multi-language embedding space and therefore only needs one single model for all three data sets.
-With this work we want to experiment and evaluate which performance can be achieved with such a simple shared embedding space. 
-
-
-
-
+With this work we want to experiment and evaluate which performance can be achieved with such a simple shared embedding space.
diff --git a/paper/20_related_work.tex b/paper/20_related_work.tex
index 4145a26..b1a373a 100644
--- a/paper/20_related_work.tex
+++ b/paper/20_related_work.tex
@@ -32,9 +32,9 @@ linear combination of both states.
 \subsection{Word Embeddings} 
 Distributional semantic models (DSMs) have been researched for decades in NLP \cite{turney_frequency_2010}.
 Based on a huge amount of unlabeled texts, DSMs aim to represent words using a real-valued vector (also called embedding) which captures syntactic and semantic similarities between the words.
-Starting with the publication of the work from Collobert et al. \cite{collobert_natural_2011} in 2011, learning embeddings for linguistic units, such as words, sentences or paragraphs, ist one of the hot topics in NLP and a plethora of approaches have been proposed \cite{bojanowski_enriching_2017,mikolov_distributed_2013,peters_deep_2018,pennington_glove_2014}.
+Starting with the publication of the work from Collobert et al. \cite{collobert_natural_2011} in 2011, learning embeddings for linguistic units, such as words, sentences or paragraphs, is one of the hot topics in NLP and a plethora of approaches have been proposed \cite{bojanowski_enriching_2017,mikolov_distributed_2013,peters_deep_2018,pennington_glove_2014}.
  
-The majority of todays embedding models are based on deep learning models trained to perform some kind of language modeling task \cite{peters_semi-supervised_2017,peters_deep_2018,pinter_mimicking_2017}. 
+The majority of today's embedding models are based on deep learning models trained to perform some kind of language modeling task \cite{peters_semi-supervised_2017,peters_deep_2018,pinter_mimicking_2017}. 
 The most popular embedding model is the Word2Vec model introduced by Mikolov et. al \cite{mikolov_distributed_2013,mikolov_efficient_2013}. 
 They propose two shallow neural network models, continuous bag-of-words (CBOW) and SkipGram, that are trained to reconstruct the context given a center word and vice versa.
 In contrast, Pennington et al. \cite{pennington_glove_2014} use the ratio between co-occurrence probabilities of two words with another one to learn a vector representation.
@@ -42,7 +42,7 @@ In \cite{peters_deep_2018} multi-layer, bi-directional LSTM models are utilized
 
 Several recent models focus on the integration of subword and morphological information to provide suitable representations even for unseen, out-of-vocabulary words. 
 For example, Pinter et al. \cite{pinter_mimicking_2017} try to reconstruct a pre-trained word embedding by learning a bi-directional LSTM model on character level. 
-Similarily, Bojanowski et al. \cite{bojanowski_enriching_2017} adapt the SkipGram by taking character n-grams into account. 
+Similarly, Bojanowski et al. \cite{bojanowski_enriching_2017} adapt the SkipGram by taking character n-grams into account. 
 Their fastText model assigns a vector representation to each character n-gram and represents words by summing over all of these representations of a word.
 
 In addition to embeddings that capture word similarities in one language, multi-/cross-lingual approaches have also been investigated.
@@ -52,7 +52,7 @@ Proposed methods either learn a linear mapping between monolingual representatio
 The ICD-10 coding task has already been carried out in the 2016 \cite{neveol_clinical_2016} and 2017 \cite{neveol_clef_2017} edition of the eHealth lab. 
 Participating teams used a plethora of different approaches to tackle the classification problem. 
 The methods can essentially be divided into two categories: knowledge-based \cite{cabot_sibm_2016,jonnagaddala_automatic_2017,van_mulligen_erasmus_2016} and machine learning (ML) approaches \cite{dermouche_ecstra-inserm_2016,ebersbach_fusion_2017,ho-dac_litl_2016,miftakhutdinov_kfu_2017}.
-The former relies on lexical sources, medical terminologies and other ontologies to match (parts of) the certificate text with entries from the knowledge-bases according to a rule framework. 
+The former relies on lexical sources, medical terminologies and other dictionaries to match (parts of) the certificate text with entries from the knowledge-bases according to a rule framework. 
 For example, Di Nunzio et al. \cite{di_nunzio_lexicon_2017} calculate a score for each ICD-10 dictionary entry by summing the binary or tf-idf weights of each term of a certificate line segment and assign the ICD-10 code with the highest score. 
 In contrast, Ho-Dac et al. \cite{ho-dac_litl_2017} treat the problem as information retrieval task and utilize the Apache Solr search engine\footnote{\url{http://lucene.apache.org/solr/}} to classify the individual lines.
 
diff --git a/paper/31_methods_seq2seq.tex b/paper/31_methods_seq2seq.tex
index f9c986a..14f3c9b 100644
--- a/paper/31_methods_seq2seq.tex
+++ b/paper/31_methods_seq2seq.tex
@@ -26,4 +26,4 @@ Moreover, each token of the dictionary death cause text (padded with special sta
 Again, we use fastText embeddings of all three languages to represent the input tokens. 
 The decoder predicts one-hot-encoded words of the death cause. 
 During test time we use the encoder to obtain a semantic representation of the certificate line and decode the death cause description word by word starting with the special start tag. 
-The decoding process finishes when the decoder outputs the end tag.
+The decoding process finishes when the decoder outputs the end tag. 
diff --git a/paper/40_experiments.tex b/paper/40_experiments.tex
index 171c3af..0793bcf 100644
--- a/paper/40_experiments.tex
+++ b/paper/40_experiments.tex
@@ -30,7 +30,7 @@ This decoding process continues until a special end token is generated.
 The entire model is optimized using the Adam optimization algorithm \cite{kingma_adam:_2014} and a batch size of 700. 
 Model training was performed either for 100 epochs or until an early stopping criteria is met (no change in validation loss for two epochs).
 
-As the provided dataset are imbalanced regarding the tasks' languages, we devised two different evaluation settings: (1) DCEM-Balanced, where each language was supported by 49.823 randomly drawn instances (size of the smallest corpus) and (2) DCEM-Full, where all available data is used. 
+As the provided data set are imbalanced regarding the tasks' languages, we devised two different evaluation settings: (1) DCEM-Balanced, where each language was supported by 49.823 randomly drawn instances (size of the smallest corpus) and (2) DCEM-Full, where all available data is used. 
 Table \ref{tab:s2s} shows the results obtained on the training and validation set.
 The figures reveal that distribution of training instances per language have a huge influence on the performance of the model. 
 The model trained on the full training data achieves an accuracy of 0.678 on the validation set. 
@@ -68,7 +68,7 @@ Once again, we devised two approaches. This was mainly caused by the lack of ade
 Therefore, we once again defined two training data settings: (1) minimal (ICD-10\_Minimal), where only ICD-10 codes with two or more supporting training instances are used. 
 This leaves us with 6.857 unique ICD-10 codes and discards 2.238 unique ICD-10 codes with support of one. 
 This, of course, minimizes the number of ICD-10 codes in the label space. 
-Therefore, (2) an extended (ICD-10\_Extended) dataset was defined. Here, the original ICD-10 code mappings, found in the supplied dictionaries, are extended with the training instances from individual certificate data from the three languages. 
+Therefore, (2) an extended (ICD-10\_Extended) data set was defined. Here, the original ICD-10 code mappings, found in the supplied dictionaries, are extended with the training instances from individual certificate data from the three languages. 
 This generates 9.591 unique ICD-10 codes. 
 Finally, for the remaining ICD-10 codes that have only one supporting description, we duplicate those data points. 
 
@@ -130,20 +130,20 @@ Both model configurations have a higher precision than recall (0.73/0.61 resp. 0
 
 This can be contributed to several factors.
 First of all, a pipeline architecture always suffers from error-propagation, i.e. errors in a previous step will influence the performance of the following layers and generally lower the performance of the overall system. 
-Investigating the obtained results, we found that the imbalanced distribution of ICD-10 codes represents one the main problem. 
+Investigating the obtained results, we found that the imbalanced distribution of ICD-10 codes represents one the main problems. 
 This severely impacts the decoder-encoder architecture used here as the token generation is biased towards the available data points.
 Therefore the models misclassify certificate lines associated with ICD-10 codes that only have a small number of supporting training instances very often. 
 
 Results obtained on the test data set, resulting from the two submitted official runs, are shown in Table \ref{tab:final_test}. 
-Similiar to the evaluation results during development, the model based on the full data set performs slighly better than the model trained on the balanced data set.
+Similar to the evaluation results during development, the model based on the full data set performs slightly better than the model trained on the balanced data set.
 The full model reaches a F-score of 0.34 for French, 0.45 for Hungarian and 0.77 for Italian. 
 All of our approaches perform below the mean and median averages of all participants. 
 
 Surprisingly, there is a substantial difference in results obtained between the individual languages. 
 %This hints towards the unsuitability of out-of-domain word embeddings. 
-This confirms our assumptions about the (un)suitability of the proposed multi-lingual embedding space for this task. 
-The results also suggeest that the size of the training corpora is not influencing the final results. 
-As seen, best results were obtained on the Italian dataset were trained on the smallest corpora. 
+This confirms our assumptions about the (un-) suitability of the proposed multi-lingual embedding space for this task. 
+The results also suggest that the size of the training corpora is not influencing the final results. 
+As seen, best results were obtained on the Italian data set were trained on the smallest corpora. 
 Worst results were obtained on the middle, French, corpus while the biggest corpus, Hungarian, is in second place. 
 
 \begin{table}[]
diff --git a/paper/50_conclusion.tex b/paper/50_conclusion.tex
index c071a5b..3813bf6 100644
--- a/paper/50_conclusion.tex
+++ b/paper/50_conclusion.tex
@@ -18,10 +18,4 @@ The improvement of the input layer will be the main focus of our future work.
 
 The ICD-10 classification step also suffers from lack of adequate training data. 
 Unfortunately, we were unable to obtain extensive ICD-10 dictionaries for all languages and therefore can't guarantee the completeness of the ICD-10 label space. 
-Another disadvantage of the current pipeline is the missing support for mutli-label classification.
-
-
-
-
-
-
+Another disadvantage of the current pipeline is the missing support for multi-label classification.
-- 
GitLab