SyGra: The One-Stop Framework for Building Data for LLMs and SLMs
read at source ↗ huggingface.co
SyGra: The One-Stop Framework for Building Data for LLMs and SLMs
Source: HuggingFace Date: 2025-09-22 URL: https://huggingface.co/blog/ServiceNow-AI/sygra-data-gen-framework
Summary
Library release: SyGra (ServiceNow AI), a low-code/no-code Python framework for LLM/SLM dataset creation and transformation. Addresses nine common data pipeline challenges: knowledge base to Q&A conversion, complex reasoning dataset generation, DPO preference pair creation, domain filtering, PDF/image-to-document conversion, cross-language translation, and quality filtering. Supports vLLM, TGI, Triton, and Ollama as inference backends. Open-sourced on GitHub with an arxiv paper.
Implications
Open-weights ecosystem health. A reusable data generation framework from ServiceNow AI lowers the barrier to building custom training datasets — the limiting factor for most domain-specific fine-tuning projects. The DPO preference pair generation capability is particularly useful, as creating pairwise preference data at scale has historically been expensive and manual.
HF as open-source ML hub. SyGra supporting TGI as a backend means HF’s inference infrastructure can serve as the generation engine for dataset pipelines — a useful integration that makes HF’s stack useful not just for model serving but also for dataset construction. ServiceNow AI publishing on HF blog continues the pattern of enterprise AI teams using HF as their open-source distribution channel.