Back to Newsroom
Building Zyda-2, a 5 Trillion Token High-Quality Dataset, with NVIDIA NeMo Curator
October 15, 2024
PALO ALTO, CALIFORNIA

Zyphra is excited to release Zyda2, a 5-trillion token dataset composed of filtered and cross-deduplicated DCLM, FineWeb-Edu, Zyda-1, and Dolma v1.7's Common Crawl portion. Leveraging NVIDIA NeMo Curator, we've dramatically accelerated data processing from 3 weeks to 2 days while reducing costs. Zyda-2 powers our Zamba2 series, pushing the boundaries of small-LLM performance and reinforcing Zyphra's position at the forefront of efficient, high-performance language model development.

Authors
Zyphra & Nvidia
Collaborators
Daniel A Roberts (Sequoia Capital & MIT), Andrey Gromov (Meta FAIR), Kushal Tirumala (Meta FAIR) and Hassan Shapourian (Cisco)