The effective application of machine learning to antibody discovery depends on well-structured and interconnected data derived from experiments rather than the sophistication of algorithms used. Many organizations have data from their experiments distributed across different LIMS, ELN, and Excel databases, which limits cross-program analysis and the creation of ML models. Adimab’s Atlas platform is a database specifically designed to address this challenge. The Atlas platform has been capturing antibody discovery data since 2009 and has accumulated information from about 1,431 programs and 2.85 million clones.
Approach and outcomes
- Atlas is organized into four integrated modules: a PostgreSQL-based Data Integration Layer that enforces naming conventions and entity relationships at creation time; an Assay Data Warehouse spanning 34+ assay types with standardized units and QC flags; a Natural Language Interface via the Model Context Protocol (MCP); and an ML Prediction Engine integrated with AWS SageMaker for sequence-to-property prediction.
- End-to-end clone lineage is bidirectional and tracked across six stages: Library Design, Selection Campaigns, Clone Identification, Sample Production, Characterization Assays, and Delivery. Users can trace forward from a library to all derived clones, or backward from a delivered clone to its source library. Clone origins include selection-acquired (73%), library-acquired (25%), and other sources (2%).
- The Characterization Data Warehouse contains 4.1 million measurements across 16 assay platforms, all linked to their source clone and project. This includes 11,000+ site-specific accelerated stress measurements across 700+ antibodies, enabling ML models for deamidation and isomerization liability prediction that improve prediction scores 2–2.5x over baseline.
- For HIC retention time, a key developability indicator, linked sequence and characterization data from the Atlas platform, with 103,000 experimental measurements accumulated over 11 years, trained a predictive model deployed on AWS SageMaker in 2022. Predicted values are stored alongside experimental results, creating a closed feedback loop in which each validated prediction supports future model refinement. ML prediction throughput now exceeds experimental measurement throughput by approximately 10x (104K predictions vs. 11K measurements per year).
- The 103,000 HIC measurements accumulated by the Atlas platform provide the training data for a predictive model for HIC retention time, a key developability indicator. Predicted values are stored alongside experimental results, creating a closed feedback loop in which each validated prediction supports future model refinement. The trained model now generates approximately 104,000 predictions per year, approximately roughly 10x the rate of new experimental measurements (11,000Â per year), substantially expanding developability assessment throughput.
- The Atlas platform is accessible via MCP, which translates natural language queries into validated SQL and returns structured results. Researchers without database expertise can query 17 years of linked experimental data directly. For example, users can ask "What are the HIC retention times for clones from project X?" and receive tabulated results without writing SQL. All queries are read-only and audit-logged.
- Atlas captures data across 1,431 projects spanning Type I (52%), Type II (37%), Type III (10%), and other antibody formats, providing the cross-program dataset breadth needed to train models that generalize across target classes and therapeutic formats.
Why it matters
Schema and workflow design choices made in 2010 enabled machine learning applications that were not deployed until 2022, illustrating that data architecture decisions precede and constrain ML capability. The Atlas platform demonstrates that a domain-specific relational model, used consistently as an operational system over many years, produces the structured, linked, and complete data that modern ML requires. Each new Adimab program can be immediately queried in the context of all prior data, meaning every antibody program benefits from the accumulated knowledge of all programs that came before it.