2025-09-02 · HuggingFace

SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence

researchinfrastructure

read at source ↗ huggingface.co

SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence

Source: HuggingFace Date: 2025-09-02 URL: https://huggingface.co/blog/SandboxAQ/sair-data-accelerating-drug-discovery-with-ai

Summary

Dataset release: SAIR (Structurally Augmented IC50 Repository) from SandboxAQ — 5.24M AI-generated 3D protein-ligand complexes paired with experimentally validated IC50 binding potency data, plus 1M+ unique computationally co-folded pairs. 97% passed PoseBusters structural validation. Generated with 130k GPU hours on 760 H100s in 3 weeks (originally estimated 3 months). 40%+ of proteins have no existing PDB structures. CC BY 4.0 license, available on HF. Deep-learned affinity models on similar data claim up to 1,000x speedup over first-principles approaches.

Implications

Open-weights ecosystem health. The largest publicly available protein-ligand structure-potency dataset at CC BY 4.0 removes a significant data access barrier for drug discovery ML. The 40% “dark proteome” coverage (proteins with no PDB structures) is the scientifically interesting part — these are the targets that structural biology has not yet characterized, and ML-generated structures at scale make them tractable for the first time.

HF as open-source ML hub. SandboxAQ releasing a pharmaceutical research dataset through HF continues the pattern of domain-specific scientific datasets from commercial labs flowing through HF’s distribution infrastructure. The parquet format and standard hf_hub_download access makes SAIR immediately usable in standard ML pipelines without domain-specific tooling.

← all signals