(no particular order):

  1. Dataformer : https://github.com/DataformerAI/dataformer
  2. https://huggingface.co/collections/davanstrien/synthetic-text-dataset-generation-6643aa29d216a196f31758a8
  3. Datasets and synthetic data creation libraries: https://github.com/davanstrien/awesome-synthetic-datasets
  4. Customizable implementation of the self-instruct paper. https://github.com/jondurbin/airoboros
  5. Textbook quality: https://github.com/VikParuchuri/textbook_quality
  6. Cosmo Chat: https://x.com/vanstriendaniel/status/1787868663871418799
  7. Evol Instruct: https://github.com/nlpxucan/WizardLM/tree/main/Evol_Instruct
  8. Pluto synthentic data generation https://github.com/redotvideo/pluto
  9. Starcoder data cleaning: https://twitter.com/sivil_taram/status/1779413759423062114
  10. Better Synthetic Data by Retrieving and Transforming Existing Datasets https://twitter.com/arankomatsuzaki/status/1782600350282715532
  11. Data quality metrics: https://www.linkedin.com/posts/swarooprm7_data-is-king-in-the-llm-world-i-am-activity-7191291753788772352-JUTm?utm_source=share&utm_medium=member_desktop
  12. Google Survey Best Practises for Synthetic data: https://twitter.com/arankomatsuzaki/status/1778609441551622372
  13. Cosmopedia Synthentic Data Generation: https://huggingface.co/blog/cosmopedia
  14. DEITA: https://arxiv.org/abs/2312.15685
  15. Superfiltering: https://twitter.com/zhoutianyi/status/1761040930465788202
  16. Alignment while being poor: https://twitter.com/swaroopnath6/status/1764924435667055032
  17. Argilla data collection: https://www.youtube.com/watch?v=lkddA2SIEFA
  18. https://www.linkedin.com/posts/kaustubh-dholé-3929b32a_naacl2024-naacl2024-ir-activity-7183010210918080512-eZNZ?utm_source=share&utm_medium=member_desktop
  19. Long is more:
    1. Paper: https://arxiv.org/abs/2402.04833
    2. Testing: https://twitter.com/_lewtun/status/1758520258132865210