The generation and analysis of data has always been fundamental to the practice of science.
About a decade ago, a Cambrian explosion of new data generating hardware technologies started, and has been expanding rapidly.
This is leading to an exponential increase in data utilization. This is revolutionizing the practice of data analyses.
This means that science students need to learn additional concepts that are fundamental to these emerging technologies, mid-career scientists will need to refresh their understanding of what they learned in school, and legacy businesses will need to adapt their scientific infrastructure and data infrastructure to remain competitive.
The buzzwords in the general economy today are artificial intelligence (AI) and machine learning (ML). However, there are other data methods that are more fundamental than AI and ML.
These are statistical methods and modelling & simulation (M&S).
The previous post was about how statistical methods and M&S are tools for technology innovation and perhaps even social innovation.
This post is about how these computational tools are growing more essential to the practice of science and technology itself.
I will use examples of how statistical methods are becoming important.
The power of multivariate analysis is growing
Chemometrics uses mathematics and computational methods to interpret data obtained from chemical analysis.
This is important, not just to chemistry, but also to chemical engineering, medicine, life sciences, manufacturing, astronomy and any other field that deals with matter.
There are many different chemometric methods. One important type is multivariate analysis. It can be a very effective method of analysis when a sample is measured with multiple methods.
An earlier post explained that technological advances are increasing the number of new hyphenated methods, which is the marriage of two or more analytical techniques. These methods, in turn, enable the study of complex systems, and thus require familiarity and skills in multivariate analyses.
The most common tool for multivariate analysis is Principal Component Analysis (PCA). The slide below, courtesy of Professor Peter Wentzell of Dalhousie University, is a meta-analysis of select publications where PCA was used. It shows increasing usage of PCA since 2002 in the chemometrics community. Now PCA is crossing over into broad use.
The power of multivariate analysis such as with PCA can be illustrated in the following examples.
Fingerprinting of complex systems is enabled by multivariate analysis
Let’s talk about this over red wine. How do you characterize a wine, establish lot-to-lot consistency, and distinguish one from another? The old way is the subjective assessment of a winemaker or a professional wine taster.
There are also industry-established analytical methods to analyze pH, sugar content, turbidity, and colour analysis. Though objective, these fall short of what a human can distinguish.
The current state-of-the-art is a reductionist approach of separating the components in a wine and measuring their quantities. This uses GC-MS (gas chromatography – mass spectrometry) for the volatile compounds or HPLC-MS (high performance liquid chromatography – mass spectrometry) for the wine itself. Presently, the quality of the results are dependent on the skill of the scientists and on the configuration of instruments they choose. For now, this method is expensive and is in the domain of R&D.
This progression from human, to simple instruments, to advanced instruments is the universal trajectory of technology. All fields are going this way. Scientists need to know how to handle this data when the advanced instruments come into their field of practice.
I got away with simple linear regression which was introduced in high school. In university, I learned error analysis. In graduate studies, I had to curve fit single and double exponential decay data. Fortunately, I had smart and friendly postdocs who showed me what to do. The statistics toolkit today has increased. Multivariate analysis and specifically PCA will be one of those important tools.
The following application notes are from Agilent Technologies, a manufacturer of GC-MS and HPLC-MS instruments.
Figure 2 (top) is a GC-MS of a typical red wine. Each peak corresponds to a component in the wine, characterized by the time it comes off the chromatograph (Figure 2, bottom). The area under the each peak is roughly proportional to the quantity.
This method generates a lot of data for each wine. Multivariate analysis allows different wines to be compared.
Let’s take 55 different wines from four different French chateaux (anonymized B1 to B4), and analyze each by HPLC-MS. Each would generate a profile like that of Figure 2. Chemometrics software analyzes this data from all 55 of these samples, converts them into a set of uncorrelated variables called components, and the results are plotted along two dimensions or three dimensions.
Figure 3 shows the PCA analysis of these 55 wines analyzed by HPLC-MS, made from the four chateaux.
Wines from B2 and B3 are close together under components 1 and 2 (X, Y plane). So we need to include a third dimension where they are further separated under components 2 and 3 (Y, Z plane). This means that B2 and B3 are relatively similar, but all wines cluster into distinct sets in a three-component plot. This type of analysis can distinguish which chateaux in France made each wine.
Figure 4 adds data from two batches of wine from China (C1 and C2). They are also clearly distinguishable from the French wines. Another batch of wine of unknown origin (Unk1) is clearly not from the same source as any of the other wines.
These graphs also demonstrate the importance of data visualization. Trying to display the HPLC-MS data of over 80 wines in a traditional table does not convey a clear message to the audience. Figures 3 and 4 do.
Advanced instruments requiring multivariate analysis continue to be introduced
To emphasize this endless technological trajectory, just two years ago an exciting new technology was introduced by Horiba Scientific using a different analytical strategy for complex mixtures.
Rather than deconstruct a mixture using chromatography, Horiba uses multiple spectroscopic analyses to capture a molecular “fingerprint” of the composition.
Phenolics are a key class of compounds in wine, comprising tannins, flavonols, anthocyanins and catechins. They affect the unique color, flavor, aroma and mouthfeel of a wine. They can be detected by their absorption of light and by their florescence. All of these compounds have broad, overlapping fluorescence excitation and emission spectra.
The Horiba “A-TEEM spectroscopy” technology measures the light absorption and transmission, and fluorescence excitation and emission of the mixture very rapidly. In this approach, the “A-TEEM fingerprint” data must be analyzed by multivariate statistics to obtain scores of each component within an individual sample.
For this technology, the choice of multivariate analysis depends on sample type. Horiba recommends either PCA (Principal Components Analysis), PARAFAC (Parallel Factor Analysis), CLS (Classical least Squares) or PLS (Partial least Squares Analysis). This is what I mean by being familiar with a wide set of statistical tools.
Here is how hardware technology has changed. Figure 5 shows a fluorescence spectrum (green) of a wine sample. The inset shows the excitation spectrum (blue) and the fluorescence spectrum (green).
Back in my graduate school days, each of these spectra would take as much as 15 minutes to acquire. The excitation wavelength is fixed at a selected value, then the fluorescence spectrum is recorded one wavelength at a time using a monochromator. For the excitation spectrum, the fluorescence wavelength is fixed and the excitation spectrum is recorded one wavelength at a time.
Shortly thereafter, fluorescence spectrometers shortened the acquisition time to seconds by capturing the entire spectrum at once using a charge coupled device (CCD). The acquisition time soon dropped to sub-seconds as integrated circuit technology and computing power advanced.
The Horiba A-TEEM system captures a fluorescence spectrum from each excitation wavelength. Where it took 15 minutes to record one spectrum years ago, it now takes just over a minute to record over 300 spectra. How do you visualize such a result? Figure 6 is an example of the output.
It is very hard to compare different wines presented like Figure 6. Figure 7 is a simpler visualization of different red wines analyzed by this “fingerprinting method.” Two cognacs are also served here to better imbibe this visual concept.
The differences between each of the Figure 7 spectra of the wines are not clear to the naked eye.
Multivariate analysis of the data allows the wine samples to be distinguished. Figure 8 shows wines obtained from PARAFAC analysis (a) and PCA analysis (b). These analyses clearly distinguish where these wines were produced.
The Horiba technology and the HPLC-MS technology are two different approaches to this wine example. Their outputs serve different wine analysis problems that are too detailed (and out of scope) to explain here. The point is that the march of complex technologies continues, and with that, the importance of multivariate analysis continues to grow.
Multivariate analysis is also a powerful tool in life sciences
In a previous post, cell culture media was explained to be a high value product. There are many different manufacturers of cell culture media. Their compositions are closely guarded trade secrets.
Cell culture media is used to grow cells that produce valuable biological medicines, and to grow stem cells that are now being developed as new therapies. To any manufacturer of these high value products, cell culture media is a critical raw material where lot-to-lot consistency must be checked by quality control.
The Horiba A-TEEM spectroscopy technology is an instrument that can be used for material fingerprinting and quality control.
Figure 9 show A-TEEM spectra of cell culture media from three different suppliers. The visual representation of these spectra are not distinguishable to the naked eye.
Figure 10 is a 2-component PCA analysis of A-TEEM spectra of cell culture media from 4 different suppliers. Like the red wine example, products from different suppliers are clearly distinguishable.
Figure 11 is a 3-component PCA analysis of the A-TEEM spectra of cell culture media from one supplier, but of three different types. While all three of these media types are clustered together in a 2-component analysis (Figure 10), each one is clearly distinguishable in this analysis.
Figure 12 shows the effect of ageing on one specific cell culture media. The A-TEEM spectral analysis shows that the culture media is stable when freshly made, and also when it is stored in the cold. When left at ambient (room) temperature, it shifts upwards on component 2. This is clearly distinguishable on day 2 and day 5.
This method of analysis is now being used in biopharmaceutical applications.
When PCA doesn’t work, there are other multivariate methods
Big-leaf mahogany is a highly valued and widely traded tropical timber species, predominately growing in the Brazilian rainforest, and also in pockets of growth stretching northwards up to Mexico. After decades of extensive logging, it is now listed in Appendix II of the Convention on International Trade in Endangered Species of Wild Fauna and Flora (CITES), subjecting it to trade restrictions.
It can be confused with other species of mahogany. Hence, a rapid and low cost method is required to identify different mahogany and other wood species.
Genetic testing would be highly accurate, but it is too costly and would take too long to perform to be of use here.
This is where near infrared spectroscopy (NIRS) can be a potential field-portable wood identification tool.
NIRS has been studied for applications since the 1950s, but its wider use did not occur until other optical and detector technologies became available in the 1990s. [SpectroscopyNow.com, July 1, 2014]
NIRS detects multicomponent molecular vibrations. Its spectra show complicated overlapping absorption bands. Hence, chemometrics is an essential tool for analyzing NIR spectra. Most of the early work used multiple linear regression analysis. NIRS became a powerful analytical tool when other multivariate analysis methods became available. [Anal Sci (2012) 28, 545]
Figure 13 shows some NIR spectra of different wood, superimposed using different colours on one chart. They all look almost the same.
Figure 14 (top left frame) shows the PCA analysis of these NIR spectra. No clustering of the data is observed.
That’s where investigators turned to Professor Peter Wentzell at Dalhousie University. He is an expert in chemometrics and signal analysis.
Projection Pursuit Analysis (PPA) is another long standing multivariate analysis tool. Like PCA, it seeks to find maximum variance in the data. It does this through some interesting projection of the data, but there is no unique definition for the projection index. One of Peter’s graduate students chose kurtosis as the projection. Furthermore, the search for optimal projection vectors is a nonlinear problem which is difficult to solve using any existing software. Peter’s graduate student found an algorithm based on quasi-power methods. The computation time is not as fast as PCA analysis, but it is now possible to run a kPPA analysis.
Kurtosis is the fourth statistical moment (the first statistical moment is the mean, the second is the variance, the third is the skew). Using kurtosis as the projection in PPA seeks to find non-normality or flatness within the data.
Figure 14 (bottom left frame) shows that PPA analysis is able distinguish NIR spectra of different wood species.
Indeed, Peter notes that kPPA can identify clusters in data in some interesting cases where PCA fails to do so. kPPA has been used to distinguish NIRS of inks in forgeries, and in NMR spectra of blood plasma taken from different salmon.
Multivariate analysis and other computational methods will be critical tools in a data-rich future
As we explore increasingly complex materials, and as we solve increasingly challenging problems, we will need advanced analytical methods.
It is not just the instruments that are important. We will also need to analyze complex data sets to understand the experimental results.
For a student of science or a mid-career scientist or a corporate R&D leader strategizing for the future, this means having knowledge and experience in modelling and simulation, or in statistical methods, and in the visual presentation of those results.