(no particular order):

  1. Dataformer https://github.com/DataformerAI/dataformer-app
  2. Datasets and synthetic data creation libraries: https://github.com/davanstrien/awesome-synthetic-datasets
  3. Customizable implementation of the self-instruct paper. https://github.com/jondurbin/airoboros
  4. Textbook quality: https://github.com/VikParuchuri/textbook_quality
  5. Cosmo Chat: https://x.com/vanstriendaniel/status/1787868663871418799
  6. Evol Instruct: https://github.com/nlpxucan/WizardLM/tree/main/Evol_Instruct
  7. Pluto synthentic data generation https://github.com/redotvideo/pluto
  8. Starcoder data cleaning: https://twitter.com/sivil_taram/status/1779413759423062114
  9. Better Synthetic Data by Retrieving and Transforming Existing Datasets https://twitter.com/arankomatsuzaki/status/1782600350282715532
  10. Data quality metrics: https://www.linkedin.com/posts/swarooprm7_data-is-king-in-the-llm-world-i-am-activity-7191291753788772352-JUTm?utm_source=share&utm_medium=member_desktop
  11. Google Survey Best Practises for Synthetic data: https://twitter.com/arankomatsuzaki/status/1778609441551622372
  12. Cosmopedia Synthentic Data Generation: https://huggingface.co/blog/cosmopedia
  13. DEITA: https://arxiv.org/abs/2312.15685
  14. Superfiltering: https://twitter.com/zhoutianyi/status/1761040930465788202
  15. Alignment while being poor: https://twitter.com/swaroopnath6/status/1764924435667055032
  16. Argilla data collection: https://www.youtube.com/watch?v=lkddA2SIEFA
  17. https://www.linkedin.com/posts/kaustubh-dholé-3929b32a_naacl2024-naacl2024-ir-activity-7183010210918080512-eZNZ?utm_source=share&utm_medium=member_desktop
  18. Long is more:
    1. Paper: https://arxiv.org/abs/2402.04833
    2. Testing: https://twitter.com/_lewtun/status/1758520258132865210
  19. Less: https://twitter.com/xiamengzhou/status/1757832742903943215