XGBoost for classification of biomarker dataset

  • Post author:
  • Post category:News

Some initial application of XGBoost classification on data published in:

A fine-needle aspiration-based protein signature discriminates benign from malignant breast lesions Bo Franzen et. al
Molecular Oncology 12 (2018) 1415–1428
The data consists of 72 samples/patients with a few patient duplicates, each characterized by 172 different protein biomarkers. Thus, this is a n << p data set, not uncommon when using state-of-the-art analytical technologies.
A better balance between n and p can be obtained using feature extraction and/or data compression, however data presented here simply pre-processed as out-lined below.
The data contains a lot of missing values for some of the biomarkers, however all variables are kept in the first round of analysis. Missing values are assigned to the arbitrary value of -3.
All scripts are in Python 3.7 and in Jupyter Notebook 6.0.3
The categorial values Benign and Cancer are coded as 0 and 1, respectively.

Some basic model settings:

# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40, random_state=7)
model = XGBClassifier(max_depth=3, learning_rate=0.1,n_estimators=99)
eval_set = [(X_train, y_train), (X_test, y_test)]
model.fit(X_train, y_train, eval_metric=[“error”, “logloss”], eval_set=eval_set, verbose=False)

Cross validation approach:

# Cross-validation while keeping patients together
#model = XGBClassifier()
rskf = RepeatedStratifiedKFold(n_splits=10, n_repeats=5, random_state=1)

Confusion matrix

Feature Importance