SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence
read at source ↗ huggingface.co
SAIR: Accelerating Pharma R&D with AI-Powered Structural Intelligence
Source: HuggingFace Date: 2025-09-02 URL: https://huggingface.co/blog/SandboxAQ/sair-data-accelerating-drug-discovery-with-ai
Summary
Dataset release: SAIR (Structurally Augmented IC50 Repository) from SandboxAQ — 5.24M AI-generated 3D protein-ligand complexes paired with experimentally validated IC50 binding potency data, plus 1M+ unique computationally co-folded pairs. 97% passed PoseBusters structural validation. Generated with 130k GPU hours on 760 H100s in 3 weeks (originally estimated 3 months). 40%+ of proteins have no existing PDB structures. CC BY 4.0 license, available on HF. Deep-learned affinity models on similar data claim up to 1,000x speedup over first-principles approaches.
Implications
Open-weights ecosystem health. The largest publicly available protein-ligand structure-potency dataset at CC BY 4.0 removes a significant data access barrier for drug discovery ML. The 40% “dark proteome” coverage (proteins with no PDB structures) is the scientifically interesting part — these are the targets that structural biology has not yet characterized, and ML-generated structures at scale make them tractable for the first time.
HF as open-source ML hub. SandboxAQ releasing a pharmaceutical research dataset through HF continues the pattern of domain-specific scientific datasets from commercial labs flowing through HF’s distribution infrastructure. The parquet format and standard hf_hub_download access makes SAIR immediately usable in standard ML pipelines without domain-specific tooling.