: A collection of over 2.1 million New York Times articles updated daily. It is frequently used for human vs. AI-generated text detection research. Emerging Standards
: A dataset for stuttering event detection containing 28k labeled clips from podcasts. It is often used to train models to identify blocks, prolongations, and repetitions in speech. You can find it on GitHub via Apple's ML research . Download 273k txt
If you are looking for "txt" files related to AI crawling, you might be interested in the proposal. : A collection of over 2
: A large-scale dataset containing approximately 92,000 computer science papers from 31 major conferences. It includes AI-generated summaries (GPT-3.5) designed for large-scale scientometric studies and automated literature reviews. Download 273k txt