The current enormous interest in AI and machine learning (ML) is welcome and promotes the development of new applications as well as the field itself. We have almost three decades of experience with machine learnings, statistics and multivariate data analysis – primarily on the border between chemistry and medicine. The intention with this and future blog posts, is to teach you how to perform advanced modeling and data analysis as well as make you proficient in spotting and avoiding the most common mistakes with machine learning or other advanced techniques. Code examples of modeling patterns will be given. Our hope is that you will be able to spot the gold nuggets among the glimmering stones.
For a newcomer to the field of AI and machine learning, it can be seductively easy to believe in all the flashy success stories giving the impression that more or less anything can be achieved with ease, a deep neural network and enough computational power. But beware, there is more to modeling is a lot more than applying one fancy ML tool. The ML tool should be chosen based on the type and structure of the data, and to facilitate incorporation of prior knowledge that we may have, and, not the least, with the end-result in mind.
Important issues to be aware of:
- The problem is really easy, and no machine learning is necessary
- Usually a good sign from a scientific perspective – you are measuring the right things with respect to the research question.
- When used to demonstrate the power of machine learning it is a way of cheating or at least overstating.
2. The data are not representative of that the application is intended to be used for
- Archetypical manifestation – everything looked excellent and promising when developing the application – but once deployed, the AI could not live up to the promises.
3. Lack of validation
- Creates overly optimistic ideas about the model’s performance.
4. Improper validation
- This mistake is more subtle and difficult to spot compared to lack of validation. The result is as always, a too-good-to-be-true estimate of model performance.
With this said, let us have a first look at a breast cancer dataset . This dataset contains Olink protein expression data from 38 samples from 33 patients with benign breast lesions and 34 samples from 25 breast cancer patients. The two Olink panels have measured concentrations for 167 different proteins. In this post we will start out with two simple but effective tools, namely principal component analysis and t-tests.
Principal Component Analysis (PCA)
One of the first action with a new dataset comprising four or more variables is always to plot scores from a principal component analysis (see Figure 1). Coloring by benign or cancerous tells us that there are major protein profile differences between the benign and cancerous lesions.
PCA is a projection technique that shows the major directions of variance in a dataset which can be seen as something like the old Egyptian paintings of people from their most characteristic and informative perspective. It is a very good technique for assessing if the problem is a very easy one. Structure showing up in the first few components is the major things that are most easily seen in the dataset.
Import data and take care of excluded values by setting them to an artificially low value (here, -3 on already been log-scaled data). Split the dataset into measurements, X, and reference data (groups, patients, etc.) as Ref.
In  :
Fit a PCA model using variables centered and scaled to unit variance.
Plot scores using ggplot from plotnine.
Figure 1. Scores from principal component analysis
Dig a bit deeper by looking at scores for PC 2 vs PC 3. (Skip PC 1 here because we saw most of the separation along PC 2.)
Figure 2. Scores for PCA model, part 2.
Scores are nearly perfectly separable by a line – this is a relatively simple dataset that should be possible to predict well with almost any ML technique. Let’s check the loadings …
Figure 3. PCA loadings. Variables with loading values far from the origin of coordinates in the direction of one of the groups are important for group separation.
The variables “163_FGF-BP1” and “156a_HGF” seem to be important and higher in the benign lesions compared to cancer tumours.
The t-test – a simple but effective tool for ranking variables
For a binary classification problem like this one it is also useful to run t-tests (with Bonferroni correction for multiple testing) for each variable and rank them after how good they are at discriminating between the classes. Let’s get on with it and make a simple t-test on all columns and plot the top five best columns for separating the cancers from the benign lesions.
Runtime warnings occur because some variables are constant and therefore incompatible with the t-test. We can safely ignore these warnings for now.
Again plot using ggplot from plotnine. (Because of a bug we need to supply the fill aestethic for geom_point to avoid an error).
Figure 4. Box-and-whisker plots together with the a scatter of data points grouped by type of leasion for the top five best separating variables according to a t-test.
No single variable displays perfect group separation, which is what we expected, but still quite good separation for all of them. We find at least the three first ones in the lower left corner of the PCA loadings plot (as expected because they are lower in the cancerous lesions group).
So far we have seen that the collected data are highly relevant for the task of classifying lesions as either benign or cancerous. From a data science perspective the problem is easy and is on the borderline for warranting the use of more advanced ML tools. We will get on with doing so anyway in future blog posts to demonstrate other modeling aspects.
- Bo Franzen et. al. Molecular Oncology 12 (2018) 1415–1428
- Wikipedia, Principal Component Analysis, https://en.wikipedia.org/wiki/Principal_component_analysis (Accessed: 2020-07-08)