How to Scrape SoundCloud for AI Training: Step-By-Step Tutorial
SoundCloud is a mother lode for AI training data, with millions of audio tracks spanning every genre and style imaginable. In this guide, we’ll show you how to tap into that library using Node.js, with the help of proxies. You’ll get hands-on code examples and learn how to collect audio data for three key AI use cases: music generation, audio enhancement, and voice training.
Dominykas Niaura
Oct 08, 2025
10 min read

What is SoundCloud?
SoundCloud is a global platform where anyone can upload, share, and stream audio. Since launching in 2007, it’s grown into one of the world’s largest audio communities, hosting over 320 million tracks from artists, podcasters, and creators of all kinds.
What sets SoundCloud apart is its diversity. It’s not just major label music. You’ll find everything from bedroom demos and underground beats to polished albums and niche podcasts. For AI training, that variety is a huge advantage, offering a broad mix of styles, genres, and audio quality to work with.
Why scrape SoundCloud?
SoundCloud offers one of the richest pools of audio data for training machine learning models. Its unique mix of content and metadata makes it especially useful for AI development.
- Diverse audio content. From polished studio tracks to lo-fi bedroom recordings, SoundCloud spans every genre and style. This variety helps train AI models to handle real-world audio.
- Rich metadata. Tracks come with valuable context (play counts, likes, reposts, and user engagement), all of which can add depth to your datasets.
- Community curation. Users create playlists, charts, and collections that naturally organize content by genre, mood, or quality. These curated collections can serve as pre-filtered training datasets.
- Creative Commons tracks. Many uploads use open licenses, making them accessible for research and development.
How to train an AI, ML, or LLM model?
Training an AI model to understand or generate audio isn’t just about feeding it sound. It’s a multi-step process that starts with the right data and ends with careful model tuning. Here’s how it typically works:
- Data collection and preprocessing. First, you gather raw audio files and clean them up. That means converting formats, normalizing volume, trimming silence, and attaching metadata like genre or play counts for added context.
- Feature extraction. Audio has to be turned into something a model can understand. This might mean creating spectrograms, mel-frequency cepstral coefficients (MFCCs), or using raw waveforms, depending on your goal.
- Model architecture selection. The model you choose depends on the task. Music generation often uses transformer-based models (like OpenAI’s Jukebox or Google’s MusicLM), while audio enhancement or classification might rely on convolutional neural networks (CNNs) or recurrent neural networks (RNNs).
- Training and validation. You train the model on your dataset through multiple iterations, adjusting weights to improve performance. Validation on a separate test set helps make sure it generalizes well.
Popular platforms for building and training these models include TensorFlow, PyTorch, and JAX. For large-scale jobs, you might also use services like Google Colab, AWS SageMaker, or Paperspace for GPU access.
What you need for scraping SoundCloud
In this tutorial, we’ll be scraping SoundCloud artist names, track titles, and download URLs. To do this, we’ll use Node.js scripts powered by Puppeteer and proxy services to help us navigate SoundCloud’s dynamic, JavaScript-heavy interface.
Here’s what you’ll need to get started:
- Node.js environment. Make sure Node.js version 14 or higher is installed on your machine. (Quick note: JavaScript is the language, while Node.js is the environment that runs it outside your browser.) You’ll also use npm (Node Package Manager), which comes bundled with Node.js, to install required libraries.
- Puppeteer library. We’ll use Puppeteer to programmatically control a headless browser. It’s perfect for scraping sites like SoundCloud, which rely heavily on JavaScript to load content. Don’t worry – we’ll show you how to install and use it in the examples ahead.
- Basic browser inspection skills. You should know how to open your browser’s developer tools and inspect elements on the page. This helps identify which HTML tags and classes to target in your script.
- Proxies. SoundCloud actively limits automated access. If you're scraping more than just a few pages, a proxy service is crucial. Using residential rotating proxies can help you avoid IP bans and maintain a stable scraping session.
- Storage infrastructure. Audio files can be large, and training datasets often require thousands of tracks. Make sure you have enough local or cloud storage for the number of files you plan to collect.
Why you need proxies for scraping SoundCloud
Proxies are essential for keeping your scraping sessions smooth, anonymous, and uninterrupted. They route your requests through different IP addresses, helping you avoid detection, rate limits, and IP bans from platforms like SoundCloud.
Proxies also let you scale up running multiple sessions at once or accessing geo-restricted content. For this guide, we recommend using residential proxies for the best reliability, but datacenter, mobile, or static (ISP) proxies can also work depending on your goals and budget. Here’s how easy it is to get proxies at Decodo:
- Create a Decodo account on our dashboard.
- Find residential proxies by choosing Residential on the left panel.
- Choose a subscription, Pay As You Go plan, or opt for a 3-day free trial.
- In the Proxy setup tab, configure the location, session type, and protocol according to your needs.
- Copy your proxy address, port, username, and password for later use. Alternatively, you can click the download icon in the lower right corner of the table to download the proxy endpoints (10 by default).
Get residential proxies for SoundCloud
Claim your 3-day free trial of residential proxies and explore full features with unrestricted access.
How to run Node.js scripts
Once Node.js is installed, you'll need a way to write and run your scraping scripts. You can use any text editor with your computer's terminal, or choose an integrated development environment (IDE) like Visual Studio Code, which combines both editing and terminal functionality in one place.
Start by creating a new folder for your project and navigating to it in your terminal. You can do this in any of the following ways:
- Right-clicking the folder and selecting Open in Terminal (Windows/Linux)
- Choosing New Terminal at Folder (macOS)
- Manually running the cd command to switch directories
Next, install the required Puppeteer library by running this command in your project folder:
Copy one of the script examples from this guide into a .js file (for example, music-gen.js). Don’t forget to replace placeholder proxy credentials (YOUR_PROXY_USERNAME and YOUR_PROXY_PASSWORD) with your actual Decodo proxy details before running the script.
To run the script, use:
Your terminal or IDE will show live output as the script runs, including results and any errors you might need to troubleshoot.
1. Music generation AI training
Teaching AI to make music is no longer sci-fi. Models can now learn musical patterns, structures, and styles by analyzing existing tracks. This works because music follows both math and culture: it’s structured, yet expressive. That balance makes it ideal for training AI systems to generate coherent, listenable compositions.
Platforms like Suno AI and Udio already let millions create songs from text prompts. Others, like AIVA, Boomy, and Soundful, cater to creators needing royalty-free background music. On the cutting edge, research tools like Stable Audio by Stability AI and OpenAI’s Jukebox show just how deep this field can go.
What to scrape on SoundCloud for music generation training
When targeting SoundCloud for music generation training data, focus on curated collections that represent successful musical patterns within specific genres. Start with official music charts and trending playlists, as these contain tracks that have already proven popular with real audiences.
On the home page of SoundCloud, you’ll find the button Explore trending playlists, which will lead you to the discovery page that showcases playlists of trending, curated, up-and-coming, and other kinds of music. Select the one that matches your direction.

While many chart-topping tracks won't offer direct download options, the metadata and popularity metrics provide valuable insights into what makes music successful. You can often find ways to obtain the actual audio files through legitimate channels outside of SoundCloud once you've identified the most promising tracks.
Scraping SoundCloud playlists
Here’s a Node.js script that scrapes metadata from any public SoundCloud playlist:
When you run the script, it first shows how many tracks were initially found in the playlist. Then it attempts to reveal hidden tracks by clicking on their image icons. After that, it prints a final count of how many tracks were successfully processed, followed by a clean, numbered list that includes the artist name, track title, play count (or "N/A" if not available), and the full SoundCloud URL.
On trending playlists, SoundCloud hides part of the tracklist if you’re not logged in. But there’s a workaround: when you manually click on hidden tracks, they become revealed. The script replicates this by automatically clicking on where the track image icons would be to reveal as many tracks as possible. You can experiment with the delay time or try clicking other elements (like the track row itself) to see if it increases the number of tracks extracted.
SoundCloud also applies geo-based limitations to certain tracks. So your scraper’s results may vary depending on your proxy’s location. That’s not a bug – it can actually be useful if you want to compare regional differences in availability or popularity.

Next steps toward AI training
With this scraped metadata and URL collection, you can proceed to acquire the actual audio files through legitimate channels, since most trending tracks lack direct download options on SoundCloud. Use the artist and track information to locate purchasable versions on platforms like Beatport, Bandcamp, or streaming services that offer high-quality downloads. Then:
- Convert files to a consistent format
- Extract features like spectrograms, MIDI, or raw waveforms
- Organize your dataset using the scraped metadata
This helps your model learn from high-quality, relevant examples that reflect the musical style you’re aiming to replicate.
2. Audio enhancement AI training
Audio enhancement models are trained to clean up degraded recordings, which involves removing noise, fixing low-quality encoding, and restoring clarity to compressed or damaged files. This use case is especially relevant for content creators, podcasters, and musicians who regularly deal with imperfect audio.
The typical training method is called synthetic degradation: you take clean audio, deliberately degrade it (e.g., by lowering bitrate or adding background noise), then train the model to recover the original. Over time, the model learns what "good" audio sounds like and how to fix the bad.
From podcast noise removal to restoring old recordings, audio enhancement has gone mainstream. Tools like Adobe’s Enhance Speech, Krisp, and Descript offer real-time cleanup for creators, while NVIDIA RTX Voice shows how far the tech can go. Platforms like Auphonic and Cleanvoice even offer fully automated audio cleanup, aimed at non-technical users.
What to scrape on SoundCloud for audio enhancement training
For this use case, you’ll need actual downloadable audio files, not just metadata. Focus on Creative Commons tracks – these are free to use and often include download buttons. In the SoundCloud search bar, try terms like: "creative commons," "CC BY," "free download," or "royalty free".
To refine your results, add time filters to surface recent uploads. Once on a search results page, check a few tracks to make sure the Download file button is available (in the "three dots" dropdown next to the track).

Scraping SoundCloud search results pages
Here’s a Node.js script that scrapes SoundCloud search results and filters for tracks with downloadable audio:
When you run the search results script, it scrolls through the page automatically, loading more results as it goes. In the terminal, you’ll see updates like "Scroll attempt 1: Found 11 items" and "New content loaded: +10 items." This continues until it hits the configured limit or no new content appears.
After that, the script checks each track for download availability. This step may take a bit, depending on how many items were found.
Once it finishes, you’ll get a summary showing how many tracks were discovered and how many have download buttons. Since Creative Commons searches usually target freely available music, the success rate tends to be high. The final output is a clean, numbered list with artist names, track titles, and direct SoundCloud URLs.

Creating your training dataset
After scraping, manually download the audio files using SoundCloud's native Download file buttons. These clean tracks will serve as your "ground truth." Next, apply synthetic degradation techniques like:
- Bitrate reduction (e.g., 320kbps → 64kbps)
- Sample rate reduction (e.g., 44.1kHz → 22kHz)
- Noise injection
- Artifact simulation (e.g., compression glitches)
Pair each degraded version with its original, and label them by type and severity. This gives your model a wide range of examples to learn from.
By using a diverse set of Creative Commons tracks, your model will be exposed to different genres, vocal styles, and production levels, helping it generalize better to real-world audio issues.
3. Speech AI training
Speech AI models are built to understand, process, and generate human speech across different accents, languages, styles, and recording conditions. With voice interfaces now everywhere, from smartphones to customer service bots, there's growing demand for models that can handle natural, messy, real-world speech, not just clean, textbook samples.
Tools like Whisper (OpenAI) have set new standards for multilingual speech-to-text, while services like ElevenLabs offer voice cloning used by creators, streamers, and studios alike. Real-time applications such as Otter.ai, Rev, Alexa, and Google Assistant all rely on speech AI trained on diverse, representative voice data. The same goes for language learning apps, accessibility tools, and smart customer support systems.
What to scrape on SoundCloud for speech training
To train effective models, you’ll need high-quality, long-form spoken content featuring real people in real conversations. Some of the best sources on SoundCloud include:
- Podcasts and educational accounts. Think universities, media outlets, think tanks, or institutions uploading lectures, interviews, and panel talks.
- Interview formats. Interviews offer multiple speakers, natural conversation flow, and a range of tones and accents in one recording.
- Language learning channels. These often feature accented English and multilingual content, which is useful for training models on varied speech patterns.
- Audiobook or documentary creators. Their uploads usually provide clean, consistent solo voice recordings, ideal for voice modeling tasks.
- Long-form uploads. Look for hour-long episodes or sessions. These give you natural pauses, rhythm changes, and unscripted speech patterns.

Scraping SoundCloud profile pages
Here’s a Node.js script for scraping tracks from a specific SoundCloud profile, limited to those with enabled downloads:
When you run the profile scraping script, the terminal will show the scrolling progress as it loads tracks from the page. You’ll see updates like "Loaded 20 tracks (scroll attempt 1)" as it continues until it hits your set limit or no new content appears.
Once finished, the script displays how many total tracks were found and how many are downloadable. The results appear as a numbered list with the creator name, episode or lecture title, and full SoundCloud file download URL.

Preparing the training dataset
After downloading the audio files, you can segment the speech by speaker using speaker diarization tools (or manually, for higher precision). Then:
- Convert to a consistent format and sample rate
- Generate transcripts using Whisper or similar tools
- Tag by accent, gender, speaking style, or context (formal vs. conversational)
Podcasts are especially valuable because they contain natural speech: pauses, overlaps, informal language, and spontaneity. Educational content, on the other hand, provides clear articulation, making it useful for pronunciation learning or voice cloning models.
By combining both types, you’ll get a well-rounded dataset that helps your speech AI handle a wide range of human voices, just like it needs to in the real world.
Best practices for web scraping with Node.js
When scraping SoundCloud data, follow these established best practices to ensure more sustainable data collection:
- Launch reliably. Set realistic HTTP headers, such as User-Agent, Accept-Language, viewport, and generous timeouts. Prefer one browser with multiple pages over many browsers.
- Keep pages light. Intercept requests and abort images, media, fonts, beacons, and manifests to reduce noise and flakiness.
- Navigate with retries. Wrap navigation in a short retry with backoff, wait for result cards instead of fixed delays, then scroll until enough items load.
- Tame popups. If applicable, accept the cookie banner and close the login modal once per context before scraping.
- Use resilient selectors. Combine a few candidate selectors for result cards and avoid brittle single-class matches.
- Pace your actions. Add small random delays, limit concurrent navigations, and slow down when errors spike.
- Recover gracefully. Catch per-item failures, return partial data, and fall back to the track page if the menu path fails.
- Rotate signals when needed. Vary user agents and use quality proxies or sticky sessions if you start seeing many 403 responses.
- Save progress. Append results as you go so restarts resume cleanly, and log only the essentials to guide tuning.
Final thoughts
SoundCloud offers a huge pool of diverse audio that can power great AI work. With the basics in place you can build datasets for music generation, audio enhancement, and voice or speech training. That is only a starting point, and the same approach can support many other creative ideas.
Results come from the setup you choose and the care you put in. Puppeteer handles dynamic pages, proxies keep sessions stable at scale, and smart targeting keeps the data relevant. Log your runs, watch for interface changes, and refine as you go. Most of all, remember that data quality often matters more than quantity. A curated dataset of high-quality, relevant audio samples will typically produce better AI model performance than a larger collection of random tracks.
Get high-quality residential IPs
Choose a residential proxy plan from Decodo and enjoy industry-leading performance and reliability.
About the author

Dominykas Niaura
Technical Copywriter
Dominykas brings a unique blend of philosophical insight and technical expertise to his writing. Starting his career as a film critic and music industry copywriter, he's now an expert in making complex proxy and web scraping concepts accessible to everyone.
Connect with Dominykas via LinkedIn
All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

