Machine learning (ML) and artificial intelligence (AI) are top of mind in every scientific field and in every industry these days. How does this affect scientists working in the lab and in research departments of companies?
A dire prediction of AI imagines a machine doing the scientist’s job, prompting suggestions that AI will lead to job losses. That’s a premature conclusion that does not reflect what is really happening. Rather, scientists working at forefronts are learning how to use these new tools to make their work more efficient and to develop new lines of research.
Let’s just focus on ML, which is where applications are for now. Machine leaning is having software analyze large sets of data to look for trends or correlations, and to use that to help characterize new observations and, in some cases, to perform tasks.
Research publications that I see now emerging in the physical sciences show that scientists are just starting to experiment with machine learning as a new tool for analyzing their data.
In this manner, machine learning is being used like other new tools in the practice of science.
Enabling AI and ML requires access to past, present and future data sets
Last year, it was common for business leaders at conference fireside chats and industry interviews to say “we have millions of data points from our many years of business and we are now applying AI to them to improve what we do.” This can indeed yield results sooner when AI/ML is applied to business data such as advertising, marketing, sales, logistics, etc.
In an R&D setting, the aspiration would be to have all data generated in a lab to be in a form useful for ML.
This capability has many use cases. For example, when I started my first research job in DuPont’s Packaging and Industrial Polymers business, hundreds of polymer blends were made and tested every month to customize products to each new end use application for its many customers. These were variations in the recipe of different combinations of plastics, fillers, stabilizers, and other additives. Finding patterns in this trove of past and current recipes would save materials, time, costs, and wastage in the development of future recipes. This example is applicable to any industry where formulations are developed: from paints and coatings, to composite materials, to prepared foods… That is but just one category of use.
Setting up or converting a research lab into one where its data is usable for ML is a huge endeavour. Unlike business data, data generated from research labs are much more variable and complex. It comes from different instruments, from different types of experiments, and most of it is unstructured.
Nevertheless, this is starting to happen, certainly in new labs, and slowly within long-established labs.
Let’s look at what these future capabilities will include. This impacts what scientists need to know, and how the modern lab will operate.
A data-enabled lab needs four core capabilities
1. Understanding information sources and database practices (this is about talent)
A lot of useful data is already accessible in public and commercial domains. Scientists, lab managers, and technical leaders are most effective if they know about the latest sources and how to access them. A proficiency in using these tools are just as important as proficiency in using lab instruments and hardware. It is also important to know how to procure these.
An earlier post described the type of training in database- and information access that are becoming important in the curriculum and continuing education of scientists and engineers.
2. Electronic lab notebooks and efficient data infrastructure (this is about infrastructure)
For machine learning to take place, the data has to be in a database. The efficient capture of this data means that electronic lab notebooks (ELNs) will become standard in a modern lab. Many labs have not yet migrated to ELNs. Another previous post described the path to adoption of ELNs.
ELNs are just the starting point. The back end of a lab’s data infrastructure also needs to be built. While business data might be suitable in a data warehouse, the variability of how research data is used means it will likely reside in a data lake. This allows data to be stored in its raw format, allowing it to be analyzed in any number of unanticipated ways in the future.
Managing lab infrastructure—such as facilities, waste disposal, equipment maintenance—will now include working with the IT department to ensure they set up and manage this data lake appropriately. That previous post on ELNs also covered this lab infrastructure.
3. Creating and maintaining “FAIR” data (this is about the working culture)
All of the data generated by different individuals, in the present and the future, need to be complete and structured. The emerging practice to achieve this, discussed in an earlier post, is having FAIR data standards: data that is Findable, Accessible, Interoperable and Reusable.
Presently, a common way to do this is to create templates within electronic lab notebooks for scientists to enter their experimental procedures and results. This ensures that all records are complete and consistent. However, a consequence of this approach is, to varying degrees, a user-unfriendly information entry system. The exploratory nature of science means that lots of experiments can present exceptions to templated record-keeping, and users can argue that this discourages the creativity that is fundamental to the scientific process.
Ironically, it will be some AI technology that will enable FAIR data to be captured seamlessly without distracting the scientist’s work. However, that technology is a long way off.
Until then, there are limited options. For existing data, one can go back and manually clean up the data; this is time intensive and not always successful. For future data, the approach is to focus on only certain types of experiments that will always have FAIR data collection standards, such as procedures that are performed repeatedly.
4. Understanding other statistical tools such as multivariate analysis (this is also about the talent)
When the objective is to find inherent value in large volumes of data, one needs a complete toolkit, rather than using the same tool for everything. This is why the most important talent is not in AI or ML alone, but in data analytics as a whole.
An earlier post covered the broader topic of statistical methods. Existing data analytics tools are often sufficient for a given purpose, without resorting to machine learning. For example, Principal Component Analysis (PCA), an unsupervised multivariate method, is becoming an accepted method in analyzing certain data sets. The important asset are scientists who know how to use these methods for the right purpose, and in complementary manners.
What implementation of ML looks like among early adopter labs
Here are two case studies of how machine learning is being implemented in industrial research labs. These have been presented at public conferences by two pharmaceutical companies. The approach would be the same in any other industry and even in an academic lab.
Case study 1: implementing a system to capture FAIR data into the data lake
At Amgen, a global pharmaceutical company, chemists have been making mostly unstructured entries into their electronic lab notebooks.
In 2016, a decision was made to create a screening decision tree for a specific type of experiment which they do repeatedly. There were already established procedures for these experiments and a sequence of steps to use. For these experiments, chemists were migrated to a tiered list of solvents and steps to use.
The objective of doing this was to create a consistent data collection process. The experiments are tiered, the complete data set for the conditions are collected, and the experimental results are rank scored into result categories.
The system was operational in 2017. Since then, they have accumulated 3000 experiments on 76 compounds, and 11 processes.
Before this change, it took 30 minutes to query all the different ELNs to access the necessary data and to analyze the results. After this change, it takes 30 seconds. They use Spotfire for the visualization tool.
Their next step is to obtain detailed analytical data of the products of the reaction. This is a much larger effort.
The program director emphasized that stakeholder engagement was important throughout this process, as was the active encouragement of this practice. People in the audience were also asking how well this is going, because they also anticipate inertia. While the culture will affect the success of implementation, our collective experience of technology adoption shows that it does happen eventually. This is where this work practice is headed.
Case study 2: implementing ML to predict probability of success for a specific type of experiment
A group at Pfizer, a global pharmaceutical company, were interested in applying machine learning to their historic data to see whether it could predict the same process conditions developed by their scientists.
They have 12 years of experimental data. The project started with clean-up of this data. The data clean-up and curation took 2 years. It was not continuous work, but when they had time:
- They had to clean-up notations and terminology in each Electronic Lab Notebook, because all scientists used different notations and formats.
- They had to quality check the data.
- They structured the data by reaction conditions and by classifying the experiments: heating methods and heating rate, cooling methods such as liquid nitrogen etc., process methods, etc. Some steps had to be assigned a score.
- They assigned consistent scores for reaction success factors.
When completed, the database had 11,000 experiments, 88 unique compounds, and 42 solvent systems. 58% of these experiments were scored as “successful”, 19% had “mixed success”, 6.5% were “unknown”, and 16.5% were “unsuccessful”. Having negative results are important for machine learning. For most labs, getting data from failed experiments is a challenge. Collecting all experimental results in this manner provides an important pool of this type of data.
The ML experiment was to predict the impact of the different reactants, conditions and procedural steps on experimental success. They tried three machine learning models: Random Forest, Support Vector Machine, and an Artificial Neural Network.
Their results are beyond the scope here. They gained some general insights from this project, and they felt it is worthwhile to continue building on this database with these experiments.
Their scientists are now recording experiments and results differently. In order to allow future data to be used by ML tools, all of the group’s experiments are recorded as structured data to allow auditing and proper classification and scoring.
This project shows how much time and effort were required. In this case, the team was motivated to continue with their new record entry process because they appreciated the value of producing FAIR data.
It’s not what you know, it’s how you think*
The adoption of machine learning in the research lab is but one facet of a broader expansion of research into more data and data analytics. This is the key message of this and the previous four posts (and the next post).
Big data and neural machines will indeed alter the workflow and the skill sets required of scientists, but AI and ML will not displace science jobs nor threaten the practice of science. The history of technology shows us that it changes the nature of work, and ML is another example.
The two case studies above show that what matters more is the culture of the research lab, and hence the mindsets of the scientists.
Historians Will and Ariel Durant, spent two lifetimes studying the lessons of civilizations. Philosopher Zat Rana notes that the essence of what the Durants learned can be summarized in one sentence from their summary work:
“The only real revolution is in the enlightenment of the mind and the improvement of character…”
When we are confronted with a world that is changing, Zat provides an eloquent explanation about the power of the mind to adapt:
Despite all that has and continues to change in our external environment, the real battle is still internal. Real change doesn’t happen until we face our minds and our thoughts.
[This] ties into larger questions of what progress is and how subjects relate to objects. Our thoughts — and their ability to change our minds — play a pivotal role in our experience of reality. How we think affects everything from our ability to solve problems to how we understand meaning, value, and purpose.
When it comes to the human mind, there are still no concrete theories of how thought emerges. At its core, a thinking pattern is an implicit rule of thumb for the way we connect aspects of our reality. Given the complexity of this reality, the more diverse our trained thinking patterns are, the more accurately we will be able to interact with information around us.
How we think about what is happening around us is arguably more important than what is actually happening around us.
There are also no concrete theories about how artificial neural networks work. At their core, their nodes connect with other nodes. Yet the accepted future is one where AI will penetrate into all aspects of society. Will it overwhelm our ability to control our own lives? At the level of basic needs, will it take away jobs from anyone who can’t keep up (which can be any of us)?
I will say no, because human minds have the ability to be enlightened. Humans have a soul, a spirit capable of compassion and sacrifice and endurance, which can provide creativity beyond any neural machine. This is how the endeavour of humans will always be more compelling than the endeavour of machines.
Think about it differently. A scientist experimenting and using a neural machine is doing work, using it to explore. At a fundamental level, these neural machines are additional tools to seek solutions to problems to make the world and people better.