EDA, ML and Flask API creation with medical data

October 3, 2021

Recent advancements in data science render scientists capable of sorting, managing and analyzing large amounts of data more effectively and efficiently. One of the sectors where these new technologies have been applied is the healthcare industry, which is increasingly becoming more data-reliant due to vast quantities of patient and medical data. As a result, medical practices and patient care are significantly improved.

In this notebook, we will implement various techniques regarding the Exploratory Data Analysis (EDA) of medical data from different datasets and then, train various machine learning models for the prediction of important medical factors. Afterwards, we will export the model that performed the best predictions, we will create a Flask API that performs predictions based on the exported model and finally, we will test the functionality of the created API via Postman.

Exploratory Data Analysis

1st Dataset

The first investigated dataset consists of several thousand measurements of height and weight of males and females. The main information of the imported dataset can be seen below:

Figure 1: Basic statistics of imported dataset.

We observe that the dataset contains 4231 measurements. We can get more detailed information about the values of each sex:

Figure 2: Mean and median height and weight per sex.

By looking at the mean and median values, we can conclude that the values of sex=1 correspond to men’s height and weight.

Data visualisation

A scatter plot is normally fairly informative and very fast to plot.

Figure 3: Scatter plot that indicates the sex of each person in each point.

We can see again that most of the women’s weight and height values are mostly smaller than those of men.

We can plot the data based on the sex of each person by taking into account the mean and standard deviation of height and weight.

Figure 4: Scatter plot that indicates the sex of each person in each point.

Figure 5: Scatter plot that indicates the sex of each person in each point.

As we can notice in the plots above, mean height and weight of men are greater than those of women.

2nd Dataset

The second dataset of this analysis concerns certain diagnostic measurements originated from the National Institute of Diabetes and Digestive and Kidney Diseases Diabetes dataset. This dataset was obtained from Kaggle. More specifically, all patients here are females at least 21 years old of Pima Indian heritage. The dataset consists of several medical predictor variables and one target variable, Outcome. Predictor variables includes the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

As a first step, let’s explore the imported dataset.

We observe that there are rows with zero values, thus we can ommit these inputs. After that, we can also get a better summary of the imported data.

Figure 7: Summary of second Diabetes dataset

Now, the remaining rows are 392.

Data visualisation

We can explore the correlation of the imported data through a heatmap.

Figure 8: Summary of second Diabetes dataset

In a heatmap like this, the more correlation values are closer to 1 or -1, the more correlated these variables are. For instance, Age and pregnancies are much correlated since the correlation value is 0.68.

Training of machine learning model

Now, let’s keep only three variables that are much correlated with the outcome of the dataset( Diabetes/not Diabetes).

Figure 9: Fraction of dataset used for ML training

We can explore the basic statistics of this dataset.

Figure 10: Statistics of dataset used for ML training

Creation of machine learning model in order to predict diabetes cases based on unseen data

By using the above narrowed dataset we trained several models based on various classification algorithms, namely:

K-Nearest Neighbours
Support Vector Classifier
Nu Support Vector Classifier
Decision Trees
Random Forest
AdaBoost
Gradient Boosting
Gaussian Naive Bayes
Linear Discriminant Analysis
Quadratic Discriminant Analysis
Multilayer Perceptron.

Figure 11: Accuracies of the trained models

According to the calculated accuracies, the best model was K Nearest Neighbour.

Figure 12: Details of model trained with Gradient Boosting algorithm

Figure 13: Classification matrix of model trained with Gradient Boosting algorithm

Based on the classifiers’ accuracy, the trained model that performed the best accuracy was Gradient Boosting classifier with accuracy = 79.41%, by showing low log loss : 0.52154. This model, however, showed high Fitting time: Fitting time= 0.0518 sec.

Best model

Best Model	Accuracy (%)	Log Loss	Fitting Time (s)
Gradient Boosting	79.41%	0.52154	0.0518 sec

Creation of Flask API

As a next step, we exported the trained model based on GradientBoosting classifier and created a Flask API. This API takes as input parameters random values of Glucose, BMI and Age, and according to those values, the trained model makes a prediction. A schematic representation of how this API works can be shown below:

Figure 14: Schematic representation of Flask API's functionality

Afterwards, once we execute the Python that creates the API, we can make several tests through Postman. Some of these tests are illustrated below:

Figure 15: Indicative example of how predictions are made via Flask API

Figure 16: Indicative examples of Flask API requests

A more detailed description of this project’s implementation in Python can be seen in this github repository: Link to Github repository

Photo Credits

EDA, ML and Flask API creation with medical data

Statistical analysis by using medical data

EDA, ML and Flask API creation with medical data

Exploratory Data Analysis

1st Dataset

Figure 1: Basic statistics of imported dataset.

Figure 2: Mean and median height and weight per sex.

Data visualisation

Figure 3: Scatter plot that indicates the sex of each person in each point.

Figure 4: Scatter plot that indicates the sex of each person in each point.

Figure 5: Scatter plot that indicates the sex of each person in each point.

2nd Dataset

Figure 6: Part of the second dataset.

Figure 7: Summary of second Diabetes dataset

Data visualisation

Figure 8: Summary of second Diabetes dataset

Training of machine learning model

Figure 9: Fraction of dataset used for ML training

Figure 10: Statistics of dataset used for ML training

Creation of machine learning model in order to predict diabetes cases based on unseen data

Figure 11: Accuracies of the trained models

Figure 12: Details of model trained with Gradient Boosting algorithm

Figure 13: Classification matrix of model trained with Gradient Boosting algorithm

Best model

Creation of Flask API

Figure 14: Schematic representation of Flask API's functionality

Figure 15: Indicative example of how predictions are made via Flask API

Figure 16: Indicative examples of Flask API requests