(no particular order):
- Dataformer : https://github.com/DataformerAI/dataformer
- https://huggingface.co/collections/davanstrien/synthetic-text-dataset-generation-6643aa29d216a196f31758a8
- Datasets and synthetic data creation libraries: https://github.com/davanstrien/awesome-synthetic-datasets
- Customizable implementation of the self-instruct paper. https://github.com/jondurbin/airoboros
- Textbook quality: https://github.com/VikParuchuri/textbook_quality
- Cosmo Chat: https://x.com/vanstriendaniel/status/1787868663871418799
- Evol Instruct: https://github.com/nlpxucan/WizardLM/tree/main/Evol_Instruct
- Pluto synthentic data generation https://github.com/redotvideo/pluto
- Starcoder data cleaning: https://twitter.com/sivil_taram/status/1779413759423062114
- Better Synthetic Data by Retrieving and Transforming Existing Datasets https://twitter.com/arankomatsuzaki/status/1782600350282715532
- Data quality metrics: https://www.linkedin.com/posts/swarooprm7_data-is-king-in-the-llm-world-i-am-activity-7191291753788772352-JUTm?utm_source=share&utm_medium=member_desktop
- Google Survey Best Practises for Synthetic data: https://twitter.com/arankomatsuzaki/status/1778609441551622372
- Cosmopedia Synthentic Data Generation: https://huggingface.co/blog/cosmopedia
- DEITA: https://arxiv.org/abs/2312.15685
- Superfiltering: https://twitter.com/zhoutianyi/status/1761040930465788202
- Alignment while being poor: https://twitter.com/swaroopnath6/status/1764924435667055032
- Argilla data collection: https://www.youtube.com/watch?v=lkddA2SIEFA
- https://www.linkedin.com/posts/kaustubh-dholé-3929b32a_naacl2024-naacl2024-ir-activity-7183010210918080512-eZNZ?utm_source=share&utm_medium=member_desktop
- Long is more:
- Paper: https://arxiv.org/abs/2402.04833
- Testing: https://twitter.com/_lewtun/status/1758520258132865210