Research labs are handling increasing volumes of data coming from new instruments that continue to get more sophisticated and powerful.

This means that the data infrastructure of a lab will be increasingly critical to its operation.

As labs develop their data infrastructure, the possibilities become more profound. This post is about how this data infrastructure will enable innovation in how labs can communicate about science and technology, an effect that can crossover into social significance.

To understand how this future unfolds and to provide insight on where this future innovation will come, let’s look at the evolution of data infrastructure in the lab.

This evolution proceeds along two different paths: the private sector and the public sector.

The private sector is important because it leads in infrastructure adoption, driving the establishment of best practices, and enabling costs to come down. However, the private sector’s data is confidential and will remain largely inaccessible.

The social impact will come from how the public sector eventually uses its data infrastructure in different ways.

The outline of this post is as follows:

  1. I will describe what this data infrastructure is, by describing how it is shaping up in the private sector.
  2. Then I will describe the progress in the public sector, which on average is running a few years behind what the private sector has so far implemented.
  3. With this background as context, I will describe how the application of the infrastructure can diverge between these two sectors. This is where the profound possibilities will present themselves.

Electronic Lab Notebooks are fundamentally about productivity – of the scientist and the team

Our starting point is the electronic lab notebook (or ELN). The ELN was introduced twenty years ago. It started to gain commercial attention about ten years ago. The use case was—and still is—about productivity.

  • First is legibility. Many scientists struggle to interpret their own scribblings into a paper notebook. Looking up someone else’s notes in their paper notebook is often impossible.
  • Second, a big issue with paper notebooks is that everyone uses their own personal notations, shorthand abbreviations, and labeling schemes.
  • Third, as the volume and the complexity of experimental information being generated increases, it becomes impractical to paste all of your spectra, printouts, images, and other information formats into a paper notebook.

Derek Lowe, a popular pharmaceutical industry blogger, noted in this 2005 post his own improvement in productivity in going from paper notebooks to an ELN:

I’d go the post-it-note route… to remind me when I got around to writing it up… It got to the point at my former job that I once took a lab notebook home over Christmas… a mighty violation of good sense and legal protocol. But there I was, back at my parent’s kitchen table, writing up [experiments], one after the other. The electronic notebook has kept me honest. I type much faster than I can write by hand, so I actually put a few more details into my experimental procedures. All the analytical data is tied to the procedure, so I can’t manage to lose that, either. The structures are all done with chemical drawing software, which means that you cut, copy, and paste when you’re working in a related series of compounds.

Three years later, Derek expressed even more support of the productivity use case in this 2008 post:

It’s clear to me, that software lab notebooks are the only way to go. Attaching all the data files from LC/MS and NMR, the ease of retrieval for patent filing purposes, the ability to search structures across a whole organization’s experience – there’s no substitute. Going back to paper would be agonizing.

Note the enhancement in productivity over this time, where he became able to search for information generated by others in the company. As ELNs upgraded in functionality, they extended productivity of individuals to productivity of teams.

These ELNs allowed scientists to look up information faster, and they allowed teams to collaborate more efficiently in their data sharing. Even in my own experience from this era, assay results shared between three R&D sites 3000 km (1800 miles) and 6100 km (3800 miles) apart, and spanning two and six time zones respectively, vastly accelerated the drug discovery effort.

This type of productivity improvement is important to companies that have large cross functional research teams developing highly valuable products.

Hence, as Derek’s post in 2008 notes, ELNs from this time period were custom-built by large companies. For these companies, the increased productivity to their projects and teams justified the expense of building these tools in-house, as no better ELNs existed commercially.

Today, the ELN market is oversaturated with commercial providers. A 2017 study found 72 available products. Here is the study, which also provides the business case for investing in ELNs and how to select an appropriate one: J Cheminform 9, 31 (2017) doi:10.1186/s13321-017-0221-3.

Data infrastructure extends beyond the Electronic Lab Notebook

This proliferation of ELN products is not a bad thing. It provides choices to suit the particular needs of different research environments.

Furthermore, as ELNs and other software tools become available for license, corporate IT departments no longer need to custom build each part of the infrastructure. They can continue assembling the data infrastructure on the front end and the back end of the ELN.

At the front end of ELNs, scientists need computational tools to analyze their data and present their integrated results in charts and other visual representations that are easy to grasp.

  • The most important of these computational tools are statistical tools and modelling and simulation tools. They range from simple software packages that come with the lab instruments, to standalone statistical software, and a large list of standalone specialty packages that cater to different sub-disciplines. In chemistry, some example vendors are ACD Labs, Materials Studio, and the Cambridge Structural Database.
  • Another important set of these computational tools are called visualization tools. Around 2015, the largest pharmaceutical companies started embedding these visualization tools into the scientists’ digital toolkits. For example, TIBCO Spotfire and Tableau are two giants in the business enterprise market that have visualization tools for this purpose. There are also simpler and cheaper alternatives.

Just as workers needed to learn Excel and PowerPoint decades ago, workers need to learn these new tools today. I was at a scientific symposium three months ago where a senior scientific manager from a global pharmaceutical company mentioned that it is important for them to train their scientists in how to use these tools, and that younger scientists coming from certain schools already have experience with these tools.

The back end of ELNs are all the different data formats being generated by all the different instruments made by hundreds of instrument manufacturers. Attaching files to ELNs was just the start. There needs to be a way to read these files, manipulate the data, and integrate different data types into the eventual front end presentation. This is an industry-wide big task. Accomplishing this will take years, if not decades, to achieve.

There are different strategies on how to do this.

  • At one extreme, some companies try to build this in-house, which is possible if the scope is very narrow.
  • Another approach is the consortium: the Allotrope Foundation is an international consortium of scientific research-intensive industries that is developing a new data architecture for acquiring, exchanging, and managing laboratory data. This consortium believes that a new data architecture is required to enable the intelligent analytical laboratory of the future, where different data, methods and hardware can be shared. The first initiative of Allotrope is to create the Allotrope Data Format, the “.adf” file format, for storing data from any vendor, instrument and platform.

This slide shows the list of current Allotrope consortium members. It comprises many of the major scientific instrument vendors. Industry representation is dominated by pharmaceutical companies. The pharma industry engages the largest breadth of science sub-disciplines, hence has high need for data interoperability.

This back end of data infrastructure can consume one for an entire career in IT (and indeed some scientists switch to an IT career path post-graduation).

The key takeaway of this section is to understand this data infrastructure with a simple framework:

  1. The Electronic Lab Notebook is the starting point, because it is the primary recording tool of the scientist who is doing the experiments and generating the data.
  2. There is a back-end of integrating all the different data types, which is the domain of IT experts.
  3. There is the front end, which are computational tools used by scientists to analyze and present their work.

In business strategy, there is the concept of the value chain. A product or set of services are transformed by different businesses along this value chain as they are distributed from the starting point to the final customer. The most profitable business operate at the most valuable points along this value chain.

Over the last two decades, the most valuable point in the value chain of most industries has been migrating towards the final customer. Even in B2B businesses, having a product or service strategy that focuses on the final customer has been found to capture more value. The reason is because of technology: point-of-service systems, e-commerce, and social media are all allowing better targeting and improved servicing of the final customer, maximizing revenue and profitability.

Let’s use this same perspective in how we deliver the performance of science. When we look at the value chain of the data infrastructure of the lab—back end, ELN, front end—I would say that the most valuable part is the front end.

Statistics, modelling and simulation, and visualization are becoming the important data tools.

Before I describe why and how they will be even more powerful, we need to see the state of data infrastructure in public sector research.

Implementation of ELNs in the public sector have started in the larger institutions

In 2005, Nature magazine described electronic lab notebooks from the public sector perspective: Nature 436, 20–21 (2005).

Academia has been slower [than the private sector] to respond. Apart from a handful of big international collaborations that have database managers, such as CERN’s Large Hadron Collider, most academic research is investigator-driven and small-scale. Academics simply don’t have funds for software that needs customizing to their situation, and there are almost as many different situations as there are labs… experimental protocols are continually changing.

By 2012, Nature magazine [Nature 481, 430–431 (2012)] showed more developed ELN usage among early adopters:

[A scientist at Stanford University] uses her iPad to jot down notes, check protocols and monitor the progress of her experiments, declaring that “paper has nothing to offer me.” It’s a refrain heard in more and more labs. Some groups have ditched notebooks in favour of software from Google, such as free-to-use tools for sharing documents, spreadsheets and calendars. Others are finding that software designed specifically for lab workers has evolved to the point where it can reliably do a range of tasks, from tracking reagent supplies to sharing protocols. At a synthetic-biology lab at Harvard Medical School, researchers store records on a password-protected network of interlinked pages that any group member can edit using MediaWiki, the free-to-use software that powers Wikipedia. A page for an ongoing experiment, for example, can easily be updated to note changes in a protocol or to include a graph of the latest results. Team members log on to the site through laptops or their iPads. The era of the paperless lab, decades in the making, seems finally to have arrived.

By 2018, Nature magazine started to encourage adoption of ELNs with an article on “How to pick an electronic laboratory notebook”: Nature 560, 269-270 (2018), citing the drivers of ELN adoption:

  • One reason is demographics: “today’s early-career researchers, who have grown up with digital technology, tend to expect — and to embrace — electronic solutions.”
  • Another is the explosion of data, as explained at the beginning of this post.
  • Most significant, apart from productivity, the incentive for adoption is now also being driven by funding agencies who require better data management. This is an excellent public policy in driving for modernization and accountability: “Concerns over reproducibility, as well as more stringent requirements on data management from funding agencies, have motivated improvements in the documentation of lab work.”

The state of data infrastructure in public sector research today is that ELNs are still in early adoption. These early adopters comprise groups represented predominately by top tier research institutions and primarily in the life sciences. For example: Stanford, Harvard Medical School, Karolinska Institute, Cambridge University, the Helmholtz Association of German Research Centres.

As recently as 2019, Harvard Medical School published best practices on ELNs (here), and Rockefeller University shared their experience in transitioning to ELNs (here).

ELNs are only now spreading across the public sector. The front end has yet to take shape. This is an opportune time to strategically consider and plan for future innovations. Imagine what is possible!

Storytelling and engagement will always be the most powerful influencers of social behaviour

The earliest adopters of ELNs were not in fact companies, but that rarefied segment of public sector research that focused on Big Science, for example:

  • particle accelerators seeking the most elementary of particles in our universe and the origins of the universe itself,
  • large array telescopes probing deep into the universe and into the beginning of time itself,
  • the Human Genome Project, a 13-year effort to map all the genes of the human genome,
  • sending humans, robots and space probes into our solar system to be touched by the mysteries and miracles beyond our world.

These generate petabytes of data and involve thousands of scientists, often across multiple countries. They could not operate without ELNs.

Yet they also accomplish something more: we have all heard about these projects. Some of it may be emotionally captivating. This is because they have been able to communicate their results to audiences beyond just scientists and their funders.

This communication has always been important. In prior eras, scientists received their funding by communicating their results to patrons verbally, by writing, and by demonstration. The Big Science organizations of recent generations have added to their communication using press releases, photos, and videos.

As data representation, simulation, and visualization become more powerful with new computational tools, it becomes possible to reach new audiences, more influential audiences, and more meaningful audiences.

This ability to communicate to wider audiences is the power of the front end of data infrastructure.

It is not just communicating. The past was about one-way communication to passive audiences. Technology has allowed communication to become interactive. This engagement is what leads to action and impact.

Compelling storytelling and meaningful engagement can drive social innovation

Nature magazine noted that these technologies “herald much deeper changes to the way science is practiced. For example, once scientists start making their ELNs available online — if only to remote collaborators — there will be a revolution in data sharing. As one scientist proclaims: rather than spending time collecting their own data, scientists will organize themselves around shared data sets, much as astronomers do.”

The next steps will be more profound.

Imagine the high school student or any student isolated from meaningful opportunities, keen to learn about science, but without access to instruments and labs. This student will have access to a laptop and cloud software to analyze data and to conduct modelling and simulations with data. This student will be able to visualize data and represent it in creative ways and to verify the reproducibility of some analyses.

This data will come from the modern lab with the right data architecture to share its data. This is where the democratization of science means that even the smallest lab doing experiments in solubility, or cell cultures, or polymer blends, or bear ecology, or population health are all relevant to making a difference in accessible education and engagement.

In studying natural ecosystems, biology labs can enable citizen scientists, and so on…

Here is an example of what I mean: the number of blogs from research labs and graduate students have been rising slowly, but consistently. That is one way communication. Their software development work and even some of their data are being posted onto GitHub. Now, this is starting to be two-way communication. The engagement so far on GitHub is of a specific, limited type, because GitHub was not designed with broader engagement in mind. This is only the just the beginning of more compelling platforms to come [if you are interested in developing something like this and want help, contact me.]

I have posted on this blog about populism as a backlash to elitism, and as a reminder of the commitment that institutions need to have to their local community. This is where building data architecture is not just the domain of top tier research institutions. It also applies to local community colleges and state colleges. As the anchor institutions to their communities, this is where they can do research that is meaningful to their communities. The data architecture will allow two-way engagement.

We have seen social media–most notably Facebook, Twitter and Instagram–transform society. They have revolutionized consumer behavior, and they have amplified our more instinctive behaviors.

Those platforms are the early technology cycles of social media. There will be more technology cycles in social media to come, branching into other human endeavors. I believe that all of us aspire to more than just accumulating likes and followers on a social media platform. We want to engage with platforms that elevate ourselves to do better things.

Scientific social platforms, enabled by the data infrastructure of labs, will be an important set of these future platforms for social innovation.


Update on 4 February 2020:

Here is a pertinent example of what I mean about the importance of visualization. Fast Company magazine published this article about “How design can stop the spread of the Wuhan coronavirus”.

The point is that good visualization of data can educate, inspire, and call people to action.

A classic on data visualization is “The Visual Display of Quantitative Information” by Edward Tufte who is, not surprisingly, a statistician by training. Published in 1983, the design principles described in this book are still fundamental today.

Another classic is “How to Lie with Maps” by Mark Monmonier. First published in 1991, it is now in its third edition, published in 2018. It also points out fundamental and important design principles.

When it comes to engaging large audiences about scientific, technical, or health information, I believe new visualization platforms will have breakthrough potential.

As always, if you want to collaborate on this, please do get in touch with me.

The data infrastructure of a lab and its impact on science and on social innovation
Share this...