英伟达贡献史上最大规模开源数据集

admin • 2026年1月6日上午7:01 • 财经

Recently, NVIDIA announced the release of Cosmopedia, one of the largest open-source datasets in history. Containing over 350 billion tokens, the dataset spans diverse domains including science, technology, culture, and history, aiming to advance the training and research of large language models (LLMs). As part of its AI openness strategy, NVIDIA seeks to lower the barrier for both academia and industry to develop cutting-edge AI models by providing high-quality, diverse training data.Cosmopedia draws from a wide range of sources—such as Wikipedia, open-source textbooks, technical documentation, and programming tutorials—and has undergone rigorous cleaning and formatting to ensure accuracy and usability. Beyond its sheer scale, the dataset emphasizes educational value and knowledge density, making it ideal for training both general-purpose and domain-specific language models.In tandem with the dataset release, NVIDIA also introduced several open-source models trained on Cosmopedia, including the Nemotron series, available free of charge to developers. This move is seen as a significant contribution to the current AI ecosystem, fostering transparency, reproducibility, and responsible AI development. With active community engagement, Cosmopedia is poised to accelerate global AI innovation and drive the evolution of next-generation intelligent systems.

近日，英伟达（NVIDIA）宣布贡献了史上最大规模的开源数据集之一——Cosmopedia。该数据集包含超过3500亿个词元（tokens），涵盖科学、技术、文化、历史等多个领域，旨在推动大语言模型（LLM）的训练与研究。作为其AI开放战略的一部分，英伟达希望通过提供高质量、多样化的训练数据，降低学术界和产业界开发先进AI模型的门槛。Cosmopedia的数据来源广泛，包括维基百科、开源教科书、技术文档、编程教程等，并经过严格清洗与格式化，确保内容的准确性与可用性。该数据集不仅规模庞大，还特别注重教育性和知识密度，使其成为训练通用或专业领域语言模型的理想选择。此外，英伟达同步发布了基于Cosmopedia训练的多个开源模型，如Nemotron系列，供开发者免费使用。此举被视为对当前AI生态的重要补充，有助于促进透明、可复现和负责任的AI发展。随着开源社区的积极参与，Cosmopedia有望加速全球AI创新，推动下一代智能系统的演进。

原创文章，作者：admin，如若转载，请注明出处：https://avine.cn/9394.html

英伟达贡献史上最大规模开源数据集

相关推荐