2024-12-10 · HuggingFace

LeMaterial: an open source initiative to accelerate materials discovery and research

research

read at source ↗ huggingface.co

LeMaterial: an open source initiative to accelerate materials discovery and research

Source: HuggingFace Date: 2024-12-10 URL: https://huggingface.co/blog/lematerial

Summary

Dataset release: LeMat-Bulk (Entalpic + HF) — 6.7M materials science entries unifying Materials Project, Alexandria, and OQMD under CC-BY-4.0. Includes 7 material properties per entry and a graph-hashing fingerprinting algorithm for deduplication and novel material identification. Fingerprint hash time vs Pymatgen StructureMatcher: Carbon-24 dataset (10k structures) in 100 seconds on 12 CPUs vs 17 hours on 64 CPUs; MPTS-52 (40k structures) in 330 seconds vs 4.9 hours. Roadmap: EquiformerV2/FAENet models in v1.1, surface datasets (OC20/OC22) later.

Implications

Open-weights ecosystem health. 6.7M unified materials entries with a fast deduplication algorithm addresses the most practical bottleneck in materials science ML: fragmented databases with overlapping but inconsistently formatted data. The fingerprinting approach (100s vs 17 hours for duplicate detection) makes dataset curation at scale tractable in a way that prior similarity-based methods were not.

HF as open-source ML hub. LeMaterial follows the pattern of domain-specific scientific dataset releases through HF (alongside SAIR for drug discovery, food allergy datasets). Each scientific domain dataset hosted on HF expands HF’s relevance beyond NLP/CV into the broader scientific ML community — and the planned EquiformerV2/FAENet model releases will make LeMat-Bulk a complete benchmarking resource.

← all signals