A fine-needle aspiration-based protein signature discriminates benign from malignant breast lesions Bo Franzen et al., Molecular Oncology 12 (2018) 1415–1428¶
The data consists of 72 samples/patients with a few patient duplicates, each characterized by 172 different protein biomarkers. Thus, this is a; n << p data set, not uncommon when using state-of-the-art analytical technologies.
A better balance between n and p can is possible using feature extraction or data compression. However, the data presented here are pre-processed, as outlined below. The data contains many missing values for some biomarkers. However, all variables are kept in the first round of analysis. Missing values are assigned to the seemingly arbitrary value of -3 in the paper by Franzen et al.
Missing values in this context means, in most cases, a not detected biomarker. A not detected biomarker in itself can carry valuable information. We can assume that some biomarkers are highly indicative of one or other of the classes examined, and the absence of a particular biomarker (not detected) could be informative. The problem we phase is, how do we attribute missing values, and what numerical value do we choose for the imputation.
This blog will examine various imputation techniques to understand what is more or less suitable from a machine learning perspective.
An initial classification has been published using XGBoost, with the categorical values Benign and Cancer coded as 0 and 1, respectively. https://jdaco.se/xgboost-for-classification-of-biomarker-dataset/
Essential read:¶
Statistical Imputation Chapter 8 Data Preparation for Machine Learning by Jason Brownlee
https://machinelearningmastery.com/data-preparation-for-machine-learning/
# Import libraries
from numpy import mean
from numpy import std
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.pipeline import Pipeline
from matplotlib import pyplot
import re
# Excel tabular data is read from a local folder
data = pd.read_excel("C:\\Data\\PEA_data_diagnosis_BCa_mv.xlsx")
"""
1) to rename and simplify long feature names
2) change categorical to int values for the target
3) replace "Excluded " notations in the data with -3
"""
new_names = []
for i in range(0, len(data.columns)):
new_names = new_names + [re.split(r'\s', data.columns[i])[0]]
data.set_axis(new_names, axis=1, inplace=True)
data=data.replace(to_replace="B",value=0)
data=data.replace(to_replace="C",value=1)
data=data.replace(to_replace="Excluded", value=-3)
data.head(1)
SAMPLE | Patient | Short | Final | 101_IL-8 | 102_TNFRSF9 | 103_TIE2 | 105_CCL7 | 106_CD40-L | 107_IL-1 | ... | 185_ADAM-TS | 187_RSPO3 | 188_FR-gamma | 189_CEACAM5 | 190_VEGFR-3 | 191_MUC-16 | 192_WIF-1 | 194_FCRLB | 195_ANXA1 | 196_FR-alpha | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 104 | 0 | Benign | 9.65 | 5.39 | 1.0 | 1.49 | 2.27 | 0.18 | ... | 0.9 | 1.1 | 5.09 | 0.83 | -0.34 | 6.9 | 6.06 | -0.26 | 5.6 | 9.22 |
1 rows × 175 columns
pd.set_option('display.max_rows', data.shape[1]) # to be able to see all features/independent variables
missing_biomarkers=data.isnull().sum() # finds number of NaN for each feature (column wise)
#print(missing_biomarkers)
pyplot.figure(figsize=[10, 6])
pyplot.xlabel(' Feature id')
pyplot.ylabel('% missing values')
pyplot.title('Missing values per feature')
missing_biomarkers.plot(kind='bar', xticks = np.arange(0, 180, 10))
pyplot.show()
n_miss = data.isnull().sum(axis=1) #gives the row with numbers of missing values
perc = round((n_miss / data.shape[1]) * 100,1) # percentage of missing values in each row
"""
print('Percentage missing values in each row')
print((perc))
"""
"\nprint('Percentage missing values in each row')\nprint((perc))\n"
#pyplot.scatter(perc[0],perc[1])
pyplot.figure(figsize=[10, 6])
pyplot.xlabel(' Patien / Sample Nos')
pyplot.ylabel('% missing values')
pyplot.title('Missing biomarkers per sample/patient')
perc.plot(kind='bar', xticks = np.arange(0, 75, 5))
pyplot.show()
Clearly, there are a lot of missing values in each sample and in the different biomarkers. Is there any preferred way that the NaNs can be replaced. Since Franzen et al. kept all independent variables, this will also be the case in this study.
The Scikit-Learn SimpleImputer can impute mean, median, most_frequent, and constant values for the NaNs.
Let's examine the cross-validated accuracy results for these different replacements of NaNs using a RandomForestClassifier and LogisticRegression, respectively.
#Split the data into X and y
X = data.iloc[:,4:175]
y = data.iloc[:,2]
# evaluate each strategy on the dataset
results = list()
strategies = ['mean', 'median', 'most_frequent', 'constant']
for s in strategies:
# create the modeling pipeline
pipeline = Pipeline(steps=[('i', SimpleImputer(strategy=s)), ('m',RandomForestClassifier())])
# evaluate the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# store results
results.append(scores)
print('<>%s %.3f (%.2f)' % (s, mean(scores), std(scores)))
<>mean 0.908 (0.09) <>median 0.895 (0.08) <>most_frequent 0.937 (0.08) <>constant 0.928 (0.08)
# plot model performance for comparison
pyplot.boxplot(results, labels=strategies, showmeans = True, vert=False)
pyplot.show()
from sklearn.linear_model import LogisticRegression
results1 = list()
for s in strategies:
# create the modeling pipeline
pipeline = Pipeline(steps=[('i', SimpleImputer(strategy=s)), ('m',LogisticRegression())])
# evaluate the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# store results
results1.append(scores)
print('>%s %.3f (%.3f)' % (s, mean(scores), std(scores)))
>mean 0.932 (0.078) >median 0.936 (0.078) >most_frequent 0.936 (0.068) >constant 0.959 (0.063)
# plot model performance for comparison
pyplot.boxplot(results1, labels=strategies, showmeans = True, vert=False)
pyplot.show()
Of the classifier studied, logistic regression delivers the best accuracy and for both methods, imputation with a constant is the best choice. The constant is 0.
In the next step logistic regression, with imputation in the range -99 to 9 will be studied.
results3 = list()
constants = [-99, -9, -3, 0, 0.1, 0.3, 3, 9]
for s in constants:
# create the modeling pipeline
pipeline = Pipeline(steps=[('i', SimpleImputer(strategy='constant', fill_value=s)),
('m',LogisticRegression())])
# evaluate the model
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=1)
scores = cross_val_score(pipeline, X, y, scoring='accuracy', cv=cv, n_jobs=-1)
# store results
results3.append(scores)
print('>imputation constant %s >accuracy %.3f (std %.3f)' % (s, mean(scores), std(scores)))
>imputation constant -99 >accuracy 0.881 (std 0.112) >imputation constant -9 >accuracy 0.945 (std 0.072) >imputation constant -3 >accuracy 0.948 (std 0.067) >imputation constant 0 >accuracy 0.956 (std 0.065) >imputation constant 0.1 >accuracy 0.953 (std 0.065) >imputation constant 0.3 >accuracy 0.956 (std 0.064) >imputation constant 3 >accuracy 0.916 (std 0.084) >imputation constant 9 >accuracy 0.861 (std 0.118)
pyplot.boxplot(results3, labels=constants, showmeans = True, vert=False)
pyplot.xlabel('Accuracy')
pyplot.ylabel('Imputation constant')
pyplot.show()
Conclusion¶
Imputation with a constant in the range of -3 to 0.3 seems to be an acceptable choice using logistic regression for classification