background image for GxP Lifeline
GxP Lifeline

GxP Approaches for Data Infrastructure and AI


Image of a life science manufacturing professional using AI for their GxP Compliance efforts.

Modern data infrastructure and artificial intelligence (AI) platforms are bringing the future forward faster than ever. But without careful thought and planning around deployment, regulated industries can be left behind. Particularly in regimes like good manufacturing practices (GMP ), the inherent lack of explainability and black-box-ness of most AI infrastructures inhibits their deployment in a way that actually touches regulated workflows, and limits their usability as “good practice” (GxP) software. At best, models are relegated to post-fact optimizations in processes as “digital twins” that can’t affect production directly; at worst, models can’t be deployed at all.

Ganymede has seen a broad number of approaches to remedy this in the life sciences and manufacturing, and has its own paradigm as well. In this blog post, we’re excited to share some challenges, opportunities, and best practices to truly build a robust data platform that will enable AI in a way that supports GxP compliance — and how, as always, it comes down to data availability and infrastructure to ensure that your models are reliable, secure, and easily accessible for regulatory scrutiny.

The Necessity of GxP-Ready Data Infrastructure

GxP encompasses a series of regulations and guidelines designed to ensure that products are safe and effective. Compliance is critical not just for passing audits, but for the very integrity of the data used to make pivotal business decisions. And while core systems of record like a laboratory information management system (LIMS) or modern manufacturing execution system (MES) like MasterControl are naturally geared toward GxP compliance, so much of the world of data infrastructure isn’t. This leads to several key challenges with leveraging data science and AI:

1. Data Integrity Risks

Inadequate data management can lead to integrity issues, including inaccurate data capture, manipulation, or loss. Much data infrastructure and biotech automation, especially for high-scale systems where storing large data is expensive, treats data as an ephemeral commodity for processing in a model. In a regulated space, traceability to data inputs is just as important as the outputs themselves.

2. Lack of Explainability

Data science models and AI also suffer from being difficult to explain. While they can make powerful predictions, the lack of clarity on how or why they came to those answers makes them difficult to rely on even if answers are correct.

3. Hallucination

Perhaps even more critically, explainability problems correlate with the risk of “hallucination” where models accidentally share inaccurate results. After all, the models are trained on patterns and can misinterpret them; they aren’t actually running logic or deduction the same way as humans do, and so can be prone to unexpected errors, making them difficult to use as GxP software.

4. Black Box Infrastructure

Many AI approaches leverage serverless workflows or software as a service (SaaS), which while introducing scalability and ease of implementation, further cloud the validation and qualification story in terms of how best to understand how they’re running or where, and on what infrastructure, to have certainty on how reliable the platform will be. Yet, this is solvable, and most LIMS or MES for the life sciences have made great strides here as discussed below.

A Modern Approach for AI in GxP: Data Infra-as-Code and Data Interfaces

The solutions we’ve seen to these GxP compliance problems in AI over time revolve around two approaches:

  • Isolating model behavior to traceable inputs and traceable, explainable outputs through creating well-defined data “interfaces” that capture not just data for retention, but also data in a schematized structure.
  • Deploying models in a purely infrastructure-as-code driven environment, which is a far higher bar for infra sophistication.

Whether you build all of this yourself or use a platform like Ganymede, we’ll cover a deeper dive into these below.

1. Data Interfaces: Outputs and Explainability

“Data Interfaces” are our term for developing traceable inputs and outputs to models that are:

  • Schematized to allow for parsing and machine-readable interpretation outside the one machine learning model that consumes or emits them, allowing for meta-analysis and portability.
  • Captured and versioned immutably, linked with the particular model run, for traceability.
  • Crucially, explainable through being structured in a format that offers the necessary level of interpretation when compared to existing manual processes.

Digging into explainability further: GxP validation of AI/machine learning (ML) models and broader biotech/life science automation is an extreme frontier of modern computer systems assurance (CSA ) approaches. In the few cases we’ve seen it be successful, we find that the best thing for a machine learning model to predict is not the final outcome, but instead, the method by which a human would deterministically arrive at that outcome.

For instance, in our recent webinar, we discussed this approach for chromatography, one of the most common workflows in the life sciences. Humans traditionally apply analytical “methods” (essentially an algorithm) to find peaks in a chromatogram. For an AI model, rather than directly predicting the peaks, you might want to predict the method (algorithm) itself, since that then offers a smooth, explainable, and validatable approach to how someone might get the peaks. If that method is then applied automatically to find peaks, you can compare to the human process, allowing for explainability and auditing of the approach. Ganymede solves this by designing models with direct integration to a system of record or data format for methods/algorithms that people are already running today.

Schematization and versioning of data are critical parts of the data pipeline as well, but are more well-known and available approaches outside of the GxP compliance space in broader data infrastructure — if usually reserved for higher-scale more built-out implementations of AI models. Ganymede offers this out of the box.

2. Data Interfaces: Inputs and System of Record Integration (LIMS/MES)

Similarly, these concepts apply to the input data. While all input data should be versioned and schematized as part of any GxP software approach, it also needs to be associated further upstream to the system of record that kicked off the process and defined any process variables. This is where many data infra and modeling deployments usually fall down: the pipeline that runs the model is not actually linked into the sample lineage or batch record in the LIMS/MES.

In this sense, your LIMS and MES for life sciences have the data backbone that covers the entire process, and so tying the model run as well as its inputs and outputs in is critical. A robust and automated integration between the model and the system of record is key, and best facilitated by triggering the pipeline from the LIMS or MES where possible via an application programming interface (API) call or event when a batch record is created, or via the instrument/asset when an output dataset is created — as long as that output dataset contains awareness of the batch or sample ID the data should be tied to. In turn, every datapoint going in or out of the ML model should be associated to that batch or sample ID in the data infrastructure layer.

Ganymede’s approach to this is by isolating all input and output datasets as flat, schematized files for the model, and then requiring that every file is tagged with at least a “unit of account” such as a sample ID or batch ID in Ganymede’s tagging framework. In turn, every model run is captured as a pipeline run and files are associated with inputs and outputs, allowing this context to be threaded through for maximized GxP compliance.

3. Infrastructure-as-Code and Infrastructure Versioning

Infrastructure-as-Code (IaC) frameworks like Terraform have become the modern standard for deploying applications in a robust and traceable way. Rather than manually configuring and setting up infrastructure, IaC allows for significant automation and versioning of the infrastructure itself.

While in the world of software development IaC is usually used to enable automation, it also provides auditability that can be very powerful for validation. Certainly in the security world, it’s quickly become a standard for platform security as well. We strongly recommend having some sort

of code-based definition of your data infrastructure that allows for fully tracing exactly what infrastructure and configuration was used, since it will significantly improve auditability.

In the Ganymede platform, our native use of IaC allows not just knowing what data was input to or output from an AI model run, but also to know what infrastructure version and configuration the model was run in. Indeed, the version of the platform infrastructure is directly available in the web application user interface (UI), alongside the versions of the model.

Conclusion and Value

While everything we’ve covered here is abstract, at the end of the day, it all comes down to enabling the incredibly powerful impacts that AI can have on workflows in terms of automating scientists’ work and deriving scientific insights. The industry is quickly turning into “haves” “have-nots” in terms of AI readiness that can save scientists a significant fraction of their time. And as the cost of drug development increases further and further, these time-savings and prediction-improvement optimizations are some of the few ways we can actually decrease costs.

At the same time, perfecting your data infrastructure for AI readiness has value beyond just running the AI applications. By investing in these more robust data infrastructure solutions, companies can mitigate the risks associated with noncompliance, improve data integrity, and streamline workflows. All of our recommendations above significantly improve data integrity and reduce risk in their own right. And as the industry continues to evolve, those who prioritize GxP compliance (in the context of AI or not) through innovative data solutions will be better positioned to succeed and get products to market faster, and more inexpensively. GxP and time-savings practices converge through automation.

Finally, as you consider your approach to GxP compliance, partnering with providers like Ganymede can provide the infrastructure you need to navigate the complexities of lab data management efficiently and effectively, without having to figure out so much of this yourself. The infrastructure demands of validatable AI modeling are vastly greater than that of traditional data science approaches, and not what you want your engineers to be focusing on. You want your data scientists to focus on the business logic of the AI modeling and your organization to focus on what truly matters: delivering safe and effective products to market.

Contact Us, or Join Us at Masters Summit 2024!

If you’re interested to learn more about Ganymede or just get perspectives on data infrastructure and GxP in the life sciences and manufacturing, reach out to us at [email protected] or visit our website at www.ganymede.bio to request a demo.

We’ll also be at Masters Summit 2024! Reach out to us at [email protected] to schedule an in-person meeting or deep-dive consultation.

 

Nathan Clark
Nathan Clark is the founder and CEO of Ganymede, a modern data platform and cloud infrastructure designed for science. Prior to Ganymede, Nathan was a product manager at Benchling, where he led several data products, including the Insights BI tool and the Machine Learning team. He also has a strong background in machine learning and data systems across both financial and general technology sectors.

Free Resource
Executive Report: Digital Maturity in Life Sciences Quality and Manufacturing

Enjoying this blog? Learn More.

Executive Report: Digital Maturity in Life Sciences Quality and Manufacturing

Download Now
[ { "key": "fid#1", "value": ["GxP Lifeline Blog"] } ]