4.7 Article

Identifying beta-thalassemia carriers using a data mining approach: The case of the Gaza Strip, Palestine

Journal

ARTIFICIAL INTELLIGENCE IN MEDICINE
Volume 88, Issue -, Pages 70-83

Publisher

ELSEVIER
DOI: 10.1016/j.artmed.2018.04.009

Keywords

Thalassemia; Data mining; Classification; SMOTE; Oversampling; Imbalance; Medical dataset

Ask authors/readers for more resources

Thalassemia is considered one of the most common genetic blood disorders that has received excessive attention in the medical research fields worldwide. Under this context, one of the greatest challenges for healthcare professionals is to correctly differentiate normal individuals from asymptomatic thalassemia carriers. Usually, thalassemia diagnosis is based on certain measurable characteristic changes to blood cell counts and related indices. These characteristic changes can be derived easily when performing a complete blood count test (CBC) using a special fully automated blood analyzer or counter. However, the reliability of the CBC test alone is questionable with possible candidate characteristics that could be seen in other disorders, leading to misdiagnosis of thalassemia. Therefore, other costly and time-consuming tests should be performed that may cause serious consequences due to the delay in the correct diagnosis. To help overcoming these challenging diagnostic issues, this work presents a new novel dataset collected from Palestine Avenir Foundation for persons tested for thalassemia. We aim to compile a gold standard dataset for thalassemia and make it available for researchers in this field. Moreover, we use this dataset to predict the specific type of thalassemia known as beta thalassemia beta-thalassemia) based on hybrid data mining model. The proposed model consists of two main steps. First, to overcome the problem of the highly imbalanced class distribution in the dataset, a balancing technique called SMOTE is proposed and applied to handle this problem. In the second step, four classification models, namely k nearest neighbors (k-NN), naive Bayesian (NB), decision tree (DT) and the multilayer perceptron (MLP) neural network are used to differentiate between normal persons and those patients carrying beta-thalassemia. Different evaluation metrics are used to assess the performance of the proposed model. The experimental results show that the SMOTE oversampling method can effectively improve the identification ratio of beta-thalassemia carriers in a highly imbalanced class distribution. The results reveal also that the NB classifier achieved the best performance in differentiating between normal and beta-thalassemia carriers at over sampling SMOTE ratio of 400%. This combination shows a specificity of 99.47% and a sensitivity of 98.81%.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.7
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available