2025-09-22 · HuggingFace

SyGra: The One-Stop Framework for Building Data for LLMs and SLMs

research

read at source ↗ huggingface.co

SyGra: The One-Stop Framework for Building Data for LLMs and SLMs

Source: HuggingFace Date: 2025-09-22 URL: https://huggingface.co/blog/ServiceNow-AI/sygra-data-gen-framework

Summary

Library release: SyGra (ServiceNow AI), a low-code/no-code Python framework for LLM/SLM dataset creation and transformation. Addresses nine common data pipeline challenges: knowledge base to Q&A conversion, complex reasoning dataset generation, DPO preference pair creation, domain filtering, PDF/image-to-document conversion, cross-language translation, and quality filtering. Supports vLLM, TGI, Triton, and Ollama as inference backends. Open-sourced on GitHub with an arxiv paper.

Implications

Open-weights ecosystem health. A reusable data generation framework from ServiceNow AI lowers the barrier to building custom training datasets — the limiting factor for most domain-specific fine-tuning projects. The DPO preference pair generation capability is particularly useful, as creating pairwise preference data at scale has historically been expensive and manual.

HF as open-source ML hub. SyGra supporting TGI as a backend means HF’s inference infrastructure can serve as the generation engine for dataset pipelines — a useful integration that makes HF’s stack useful not just for model serving but also for dataset construction. ServiceNow AI publishing on HF blog continues the pattern of enterprise AI teams using HF as their open-source distribution channel.

← all signals