Prediction of Delayed Retention of Antibodies in Hydrophobic Interaction Chromatography from Sequence Using Machine Learning

August 16, 2017
Reading time - 4 minutes

Tushar Jain, Todd Boland, Asparouh Lilov, Irina Burnina, Michael Brown, Yingda Xu, Maximiliano Vásquez

Bioinformatics, 33(23), 3758–3766, DOI: 10.1093/bioinformatics/btx519

August 16, 2017

Adimab scientists introduce a significant advancement in the early assessment of therapeutic antibody developability by developing a computational method to predict antibody hydrophobicity directly from the corresponding amino acid sequence. Hydrophobicity is a critical biophysical property that can influence the manufacturability, long-term storage, and potentially, formulation of an antibody. Antibodies with high hydrophobicity, characterized by delayed retention times in hydrophobic interaction chromatography (HIC), are typically associated with higher aggregation that can reduce potency, induce immunogenicity, and accelerate serum clearance, all of which are undesirable for therapeutic drug development.

The central hypothesis of this study is that an antibody's hydrophobic character can be predicted directly from sequence data without constructing a full 3D structural model. Rather than mapping solvent-exposed residues through structural analysis, the authors propose that machine learning can be used to extract sufficient information from the amino acid sequence to accurately infer this key biophysical property.

To validate this hypothesis, the study pursued two main objectives:

  1. Sequence-based SASA prediction: Develop a machine learning model that predicts the solvent-accessible surface area (SASA) of each amino acid in the antibody variable region directly from sequence data, eliminating the need for an existing 3D structure.
  2. Correlation with experimental HIC data: Combine the predicted SASA values with a large experimental dataset of HIC measurements to generate an amino acid–specific propensity scale to enable reliable classification of antibodies prone to delayed HIC retention.

Methodology and technical approach 

The authors developed a two-step computational strategy to evaluate antibody hydrophobicity directly from sequence data in the variable region. The initial model was trained on a curated set of 902 antibody crystal structures from the Protein Data Bank, selected for kappa light chains and distinct complementarity-determining region (CDR) sequences. 

First, the randomforest machine learning method was applied to predict the degree to which each amino acid in an antibody’s variable region is exposed to solvent, known as the solvent-accessible surface area (SASA). This property indicates how likely a residue is to participate in hydrophobic interactions. This model relied on patterns in the antibody amino acid sequence and known structural data to predict surface exposure directly from amino acid sequence information.

Next, experimental (HIC) retention time data from the large panel of antibodies were integrated with the SASA values to predict amino acid propensities using logistic regression. 

Together, these models form a sequence-based framework for predicting hydrophobicity-related developability risks in therapeutic antibodies.

Key findings and impact

This study demonstrated that it is possible to predict antibody hydrophobicity with high accuracy by using only sequence information and without the constraints of obtaining atomic-level accuracy from modeling. The surface exposure model achieved strong predictive performance, particularly when local sequence context was included, and the follow-on model reliably identified antibodies with delayed HIC retention, a key indicator of aggregation. Importantly, the amino acid scores derived from the model aligned well with biophysical expectations: hydrophobic residues increased risk, while charged residues reduced it. The approach outperformed simpler prediction methods and worked well on clinical-stage antibodies.

Significant implications for therapeutic antibody development

  • Early-stage prioritization: The low computational expense and high accuracy of this sequence-based approach enable rapid evaluation of antibody hydrophobicity during the earliest phases of discovery. This makes it possible to screen large numbers of candidates and focus resources on those with the most promising developability profiles.
  • Rational engineering: The method provides residue-level insights into which specific amino acids contribute to an antibody's hydrophobicity, supporting a targeted mutagenesis strategy. Scientists can strategically modify specific amino acids to improve developability.
  • Optimized library design: The predictive framework can be applied upstream in the design of synthetic antibody libraries, enabling researchers to build in developability considerations from the outset. This helps generate libraries enriched for sequences with favorable biophysical characteristics.
  • Scalable pipeline: Eliminating the immediate need for time-consuming and resource-intensive 3D structure determination for initial screening allows for a faster, more scalable assessment workflow. Homology models or crystallography can then be reserved for later, more detailed structural analyses as needed.

By enabling early, rapid, and scalable assessment of hydrophobicity risk, this method supports better candidate selection, guides targeted sequence optimization, and can improve the design of antibody libraries, all without relying on structural models. This represents a powerful advance for streamlining therapeutic antibody development.

For more details, read the full article in Bioinformatics.