Back to blog

How to Scrape SoundCloud for AI Training: Step-By-Step Tutorial

SoundCloud is a mother lode for AI training data, with millions of audio tracks spanning every genre and style imaginable. In this guide, we’ll show you how to tap into that library using Node.js, with the help of proxies. You’ll get hands-on code examples and learn how to collect audio data for three key AI use cases: music generation, audio enhancement, and voice training.

What is SoundCloud?

SoundCloud is a global platform where anyone can upload, share, and stream audio. Since launching in 2007, it’s grown into one of the world’s largest audio communities, hosting over 320 million tracks from artists, podcasters, and creators of all kinds.

What sets SoundCloud apart is its diversity. It’s not just major label music. You’ll find everything from bedroom demos and underground beats to polished albums and niche podcasts. For AI training, that variety is a huge advantage, offering a broad mix of styles, genres, and audio quality to work with.

Why scrape SoundCloud?

SoundCloud offers one of the richest pools of audio data for training machine learning models. Its unique mix of content and metadata makes it especially useful for AI development.

  • Diverse audio content. From polished studio tracks to lo-fi bedroom recordings, SoundCloud spans every genre and style. This variety helps train AI models to handle real-world audio.
  • Rich metadata. Tracks come with valuable context (play counts, likes, reposts, and user engagement), all of which can add depth to your datasets.
  • Community curation. Users create playlists, charts, and collections that naturally organize content by genre, mood, or quality. These curated collections can serve as pre-filtered training datasets.
  • Creative Commons tracks. Many uploads use open licenses, making them accessible for research and development.

How to train an AI, ML, or LLM model?

Training an AI model to understand or generate audio isn’t just about feeding it sound. It’s a multi-step process that starts with the right data and ends with careful model tuning. Here’s how it typically works:

  1. Data collection and preprocessing. First, you gather raw audio files and clean them up. That means converting formats, normalizing volume, trimming silence, and attaching metadata like genre or play counts for added context.
  2. Feature extraction. Audio has to be turned into something a model can understand. This might mean creating spectrograms, mel-frequency cepstral coefficients (MFCCs), or using raw waveforms, depending on your goal.
  3. Model architecture selection. The model you choose depends on the task. Music generation often uses transformer-based models (like OpenAI’s Jukebox or Google’s MusicLM), while audio enhancement or classification might rely on convolutional neural networks (CNNs) or recurrent neural networks (RNNs).
  4. Training and validation. You train the model on your dataset through multiple iterations, adjusting weights to improve performance. Validation on a separate test set helps make sure it generalizes well.

Popular platforms for building and training these models include TensorFlow, PyTorch, and JAX. For large-scale jobs, you might also use services like Google Colab, AWS SageMaker, or Paperspace for GPU access.

What you need for scraping SoundCloud

In this tutorial, we'll be scraping SoundCloud artist names, track titles, and download URLs. To do this, we’ll use Node.js scripts powered by Playwright and proxy services to help us navigate SoundCloud’s dynamic, JavaScript-heavy interface.

Here’s what you’ll need to get started:

  • Node.js environment. Make sure Node.js version 14 or higher is installed on your machine. (Quick note: JavaScript is the language, while Node.js is the environment that runs it outside your browser.) You’ll also use npm (Node Package Manager), which comes bundled with Node.js, to install required libraries.
  • Playwright library. We'll use Playwright to programmatically control a headless browser. It’s perfect for scraping sites like SoundCloud, which rely heavily on JavaScript to load content. Don’t worry – we’ll show you how to install and use it in the examples ahead.
  • Basic browser inspection skills. You should know how to open your browser’s developer tools and inspect elements on the page. This helps identify which HTML tags and classes to target in your script.
  • Proxies. SoundCloud actively limits automated access. If you're scraping more than just a few pages, a proxy service is crucial. Using residential rotating proxies can help you avoid IP bans and maintain a stable scraping session.
  • Storage infrastructure. Audio files can be large, and training datasets often require thousands of tracks. Make sure you have enough local or cloud storage for the number of files you plan to collect.

Why you need proxies for scraping SoundCloud

Proxies are essential for keeping your scraping sessions smooth, anonymous, and uninterrupted. They route your requests through different IP addresses, helping you avoid detection, rate limits, and IP bans from platforms like SoundCloud.

Proxies also let you scale up running multiple sessions at once or accessing geo-restricted content. For this guide, we recommend using residential proxies for the best reliability, but datacenter, mobile, or static (ISP) proxies can also work depending on your goals and budget. Here’s how easy it is to get proxies at Decodo:

  1. Create a Decodo account on our dashboard.
  2. Find residential proxies by choosing Residential on the left panel.
  3. Choose a subscription, Pay As You Go plan, or opt for a 3-day free trial.
  4. In the Proxy setup tab, configure the location, session type, and protocol according to your needs.
  5. Copy your proxy address, port, username, and password for later use. Alternatively, you can click the download icon in the lower right corner of the table to download the proxy endpoints (10 by default).
free-trial.svg

Get residential proxies for SoundCloud

Claim your 3-day free trial of residential proxies and explore full features with unrestricted access.

How to run Node.js scripts

Once Node.js is installed, you'll need a way to write and run your scraping scripts. You can use any text editor with your computer's terminal, or choose an integrated development environment (IDE) like Visual Studio Code, which combines both editing and terminal functionality in one place.

Start by creating a new folder for your project and navigating to it in your terminal. You can do this in any of the following ways:

  • Right-clicking the folder and selecting Open in Terminal (Windows/Linux)
  • Choosing New Terminal at Folder (macOS)
  • Manually running the cd command to switch directories

Next, install the required Playwright library by running this command in your project folder:

npx playwright install firefox

Copy one of the script examples from this guide into a .js file (for example, music-gen.js). Don’t forget to replace placeholder proxy credentials (YOUR_PROXY_USERNAME and YOUR_PROXY_PASSWORD) with your actual Decodo proxy details, then update the target URL you want to scrape and any other placeholders such as max tracks, max scroll attempts, or similar limits before running the script.

To run the script, use:

node music-gen.js

Your terminal or IDE will show live output as the script runs, including results and any errors you might need to troubleshoot.

1. Music generation AI training

Teaching AI to make music is no longer sci-fi. Models can now learn musical patterns, structures, and styles by analyzing existing tracks. This works because music follows both math and culture: it’s structured, yet expressive. That balance makes it ideal for training AI systems to generate coherent, listenable compositions.

Platforms like Suno AI and Udio already let millions create songs from text prompts. Others, like AIVA, Boomy, and Soundful, cater to creators needing royalty-free background music. On the cutting edge, research tools like Stable Audio by Stability AI and OpenAI’s Jukebox show just how deep this field can go.

What to scrape on SoundCloud for music generation training

When targeting SoundCloud for music generation training data, focus on curated collections that represent successful musical patterns within specific genres. Start with official music charts and trending playlists, as these contain tracks that have already proven popular with real audiences.

On the home page of SoundCloud, you’ll find the button Explore trending playlists, which will lead you to the discovery page that showcases playlists of trending, curated, up-and-coming, and other kinds of music. Select the one that matches your direction.

While many chart-topping tracks won't offer direct download options, the metadata and popularity metrics provide valuable insights into what makes music successful. You can often find ways to obtain the actual audio files through legitimate channels outside of SoundCloud once you've identified the most promising tracks.

Scraping SoundCloud playlists

Here's a Node.js script that scrapes metadata from any public SoundCloud playlist:

const { firefox } = require('playwright');
const fs = require('fs');
const delay = (ms) => new Promise(resolve => setTimeout(resolve, ms));
async function scrapePlaylistTracks() {
let browser;
try {
browser = await firefox.launch({
headless: true,
proxy: {
server: 'http://gate.decodo.com:7000',
username: 'YOUR_PROXY_USERNAME',
password: 'YOUR_PROXY_PASSWORD'
}
});
const context = await browser.newContext({
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0',
viewport: { width: 1366, height: 768 }
});
const page = await context.newPage();
await page.route('**/*', (route) => {
const type = route.request().resourceType();
if (['image', 'media', 'font', 'manifest'].includes(type)) {
route.abort();
} else {
route.continue();
}
});
// TARGET URL - Replace with the SoundCloud playlist you want to scrape
await page.goto('https://soundcloud.com/music-charts-uk/sets/hip-hop', {
waitUntil: 'domcontentloaded',
timeout: 45000
});
await delay(3000);
// Handle cookie consent
try {
await page.waitForSelector('#onetrust-accept-btn-handler', { timeout: 5000 });
await page.click('#onetrust-accept-btn-handler');
await delay(500);
} catch (e) {}
// Handle login modal
try {
await page.waitForSelector('.modal__closeButton', { timeout: 5000 });
await page.click('.modal__closeButton');
await delay(300);
} catch (e) {}
// Scroll to load all tracks
for (let i = 0; i < 5; i++) {
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await delay(3000);
}
const itemCount = await page.evaluate(() => {
const trackListContainer = document.querySelector('.trackList__list') ||
document.querySelector('ul') ||
document.querySelector('[class*="trackList"]');
return trackListContainer ? trackListContainer.querySelectorAll('li').length : 0;
});
console.log(`Found ${itemCount} playlist tracks`);
// Hover over track list items to reveal hidden track information
console.log('Revealing hidden tracks by hovering...');
const trackListItems = await page.$$('.trackList__list li');
for (let i = 0; i < trackListItems.length; i++) {
try {
await trackListItems[i].hover();
await delay(300);
} catch (e) {
// Continue on error
}
}
// Wait for content to load after hovering
await delay(1500);
// Extract playlist tracks
const tracks = await page.evaluate(() => {
const results = [];
// Find the main track list container
const trackListContainer = document.querySelector('.trackList__list') ||
document.querySelector('ul') ||
document.querySelector('[class*="trackList"]');
if (!trackListContainer) return [];
const trackItems = trackListContainer.querySelectorAll('li');
trackItems.forEach((trackItem) => {
// Get track rank/position
const rankEl = trackItem.querySelector('.trackItem__separator');
if (!rankEl) return;
const rank = rankEl.textContent.trim();
// Get artist and track title
const artistEl = trackItem.querySelector('.trackItem__username');
const titleEl = trackItem.querySelector('.trackItem__trackTitle');
if (!artistEl || !titleEl) return;
const artist = artistEl.textContent.trim();
const title = titleEl.textContent.trim();
const trackUrl = 'https://soundcloud.com' + titleEl.href.replace('https://soundcloud.com', '');
// Check if track is geo-blocked
const blockMsgEl = trackItem.querySelector('.trackItem__blockMsg');
if (blockMsgEl) {
results.push({
index: parseInt(rank),
artist,
title,
trackUrl,
status: `[${blockMsgEl.textContent.trim()}]`
});
return;
}
// Extract play count if available
const playCountEl = trackItem.querySelector('.trackItem__playCount');
let plays = 'N/A';
if (playCountEl) {
let playText = playCountEl.textContent.replace(/[\r\n\t]/g, ' ').replace(/\s+/g, ' ').trim();
const playMatch = playText.match(/(\d+(?:\.\d+)?[KMB])\s*$/);
plays = playMatch ? playMatch[1] : playText;
}
results.push({
index: parseInt(rank),
artist,
title,
trackUrl,
plays
});
});
// Sort tracks by rank to ensure proper order
return results.sort((a, b) => a.index - b.index);
});
console.log(`${tracks.length} playlist tracks:\n`);
tracks.forEach((track) => {
const playsDisplay = track.plays ? ` (${track.plays} plays)` : '';
const statusDisplay = track.status ? ` ${track.status}` : '';
console.log(`${track.index}. ${track.artist} - "${track.title}"${playsDisplay}${statusDisplay}`);
console.log(` ${track.trackUrl}\n`);
});
// Export to CSV
const csv = [
'Index,Artist,Title,URL,Plays,Status',
...tracks.map(t => {
const plays = t.plays || 'N/A';
const status = t.status || '';
return `${t.index},"${t.artist.replace(/"/g, '""')}","${t.title.replace(/"/g, '""')}",${t.trackUrl},${plays},"${status}"`;
})
].join('\n');
fs.writeFileSync('playlist-tracks.csv', csv);
console.log('Saved to playlist-tracks.csv');
return tracks;
} catch (error) {
console.error('Scraping failed:', error.message);
} finally {
if (browser) {
await browser.close();
}
}
}
scrapePlaylistTracks();

On trending playlists, SoundCloud may hide parts of the tracklist or certain metadata when you’re not logged in. But there’s a workaround: when the script programmatically hovers over each track item, the page reveals track information that isn't visible by default.

When you run the script, it first scrolls through the playlist to ensure all tracks are loaded, then logs how many playlist items were found. Next, it hovers over each track in the list to surface hidden metadata. You can experiment with hover delays, scroll counts, or alternative interaction targets if you want to maximize how much data is exposed.

After that, the script extracts and prints a clean, numbered list containing the artist name, track title, play count (or “N/A” if unavailable), and the full SoundCloud URL. Tracks are sorted by their playlist position to preserve the original order, and all collected data is saved to a CSV file.

SoundCloud also applies geo-based restrictions to some tracks, so results can vary depending on the location of the proxy you're using. This isn't a bug. It can be useful if you want to compare regional availability or differences in popularity across markets.

Next steps toward AI training

With this scraped metadata and URL collection, you can proceed to acquire the actual audio files through legitimate channels, since most trending tracks lack direct download options on SoundCloud. Use the artist and track information to locate purchasable versions on platforms like Beatport, Bandcamp, or streaming services that offer high-quality downloads. Then:

  • Convert files to a consistent format
  • Extract features like spectrograms, MIDI, or raw waveforms
  • Organize your dataset using the scraped metadata

This helps your model learn from high-quality, relevant examples that reflect the musical style you’re aiming to replicate.

2. Audio enhancement AI training

Audio enhancement models are trained to clean up degraded recordings, which involves removing noise, fixing low-quality encoding, and restoring clarity to compressed or damaged files. This use case is especially relevant for content creators, podcasters, and musicians who regularly deal with imperfect audio.

The typical training method is called synthetic degradation: you take clean audio, deliberately degrade it (e.g., by lowering bitrate or adding background noise), then train the model to recover the original. Over time, the model learns what "good" audio sounds like and how to fix the bad.

From podcast noise removal to restoring old recordings, audio enhancement has gone mainstream. Tools like Adobe’s Enhance Speech, Krisp, and Descript offer real-time cleanup for creators, while NVIDIA RTX Voice shows how far the tech can go. Platforms like Auphonic and Cleanvoice even offer fully automated audio cleanup, aimed at non-technical users.

What to scrape on SoundCloud for audio enhancement training

For this use case, you’ll need actual downloadable audio files, not just metadata. Focus on Creative Commons tracks – these are free to use and often include download buttons. In the SoundCloud search bar, try terms like: "creative commons," "CC BY," "free download," or "royalty free".

To refine your results, add time filters to surface recent uploads. Once on a search results page, check a few tracks to make sure the Download file button is available (in the "three dots" dropdown next to the track).

Scraping SoundCloud search results pages

Here’s a Node.js script that scrapes SoundCloud search results and filters for tracks with downloadable audio:

const { firefox } = require('playwright');
const fs = require('fs');
const delay = (ms) => new Promise(resolve => setTimeout(resolve, ms));
async function scrapeSearchResults() {
let browser;
try {
browser = await firefox.launch({
headless: true,
proxy: {
server: 'http://gate.decodo.com:7000',
username: 'YOUR_PROXY_USERNAME',
password: 'YOUR_PROXY_PASSWORD'
}
});
const context = await browser.newContext({
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0',
viewport: { width: 1366, height: 768 }
});
const page = await context.newPage();
await page.route('**/*', (route) => {
const type = route.request().resourceType();
if (['image', 'media', 'font', 'manifest'].includes(type)) {
route.abort();
} else {
route.continue();
}
});
// TARGET URL - Replace with the SoundCloud search results page you want to scrape
await page.goto('https://soundcloud.com/search/sounds?q=creative%20commons&filter.created_at=last_year&filter.license=to_use_commercially', {
waitUntil: 'domcontentloaded',
timeout: 45000
});
await delay(3000);
// Handle cookie consent
try {
await page.waitForSelector('#onetrust-accept-btn-handler', { timeout: 5000 });
await page.click('#onetrust-accept-btn-handler');
await delay(500);
} catch (e) {}
// Handle login modal
try {
await page.waitForSelector('.modal__closeButton', { timeout: 5000 });
await page.click('.modal__closeButton');
await delay(300);
} catch (e) {}
// Wait for search results
await page.waitForSelector('.searchList__item', { timeout: 15000 });
// Infinite scrolling logic
console.log('Starting infinite scroll to load more results...');
const maxScrollAttempts = 5;
const targetResults = 500;
let previousItemCount = 0;
let scrollAttempts = 0;
let noNewContentCount = 0;
while (scrollAttempts < maxScrollAttempts) {
const currentItemCount = await page.evaluate(() =>
document.querySelectorAll('.searchList__item').length
);
console.log(`Scroll attempt ${scrollAttempts + 1}: Found ${currentItemCount} items`);
if (currentItemCount >= targetResults) {
console.log(`Reached target of ${targetResults} results. Stopping scroll.`);
break;
}
if (currentItemCount > previousItemCount) {
noNewContentCount = 0;
console.log(`New content loaded: +${currentItemCount - previousItemCount} items`);
} else {
noNewContentCount++;
console.log(`No new content loaded (attempt ${noNewContentCount})`);
if (noNewContentCount >= 3) {
console.log('No new content loaded after 3 attempts. Reached end of results.');
break;
}
}
previousItemCount = currentItemCount;
await page.evaluate(() => {
window.scrollTo(0, document.body.scrollHeight);
});
await delay(2000);
await page.evaluate(() => {
window.scrollBy(0, window.innerHeight);
});
await delay(1500);
try {
const loadMoreButton = await page.$('.searchLoadMore button, .loadMore button, button[data-testid="load-more"]');
if (loadMoreButton) {
console.log('Found "Load more" button, clicking...');
await loadMoreButton.click();
await delay(3000);
}
} catch (e) {}
scrollAttempts++;
}
const finalItemCount = await page.evaluate(() =>
document.querySelectorAll('.searchList__item').length
);
// Extract downloadable tracks
const tracks = await page.evaluate(async () => {
const delay = (ms) => new Promise(resolve => setTimeout(resolve, ms));
const results = [];
const searchItems = document.querySelectorAll('.searchList__item');
const maxResults = searchItems.length;
console.log(`Processing ${Math.min(maxResults, searchItems.length)} items for download availability...`);
for (let i = 0; i < Math.min(maxResults, searchItems.length); i++) {
const item = searchItems[i];
if ((i + 1) % 10 === 0) {
console.log(`Processed ${i + 1}/${Math.min(maxResults, searchItems.length)} items...`);
}
item.scrollIntoView({ block: 'center' });
await delay(100);
const artistEl = item.querySelector('.soundTitle__usernameText') ||
item.querySelector('.soundTitle .usernameText') ||
item.querySelector('.sc-text-body') ||
item.querySelector('a[href*="/"]') ||
item.querySelector('[class*="username"]');
const artist = artistEl ? artistEl.textContent.trim() : 'Unknown Artist';
let title = 'Unknown Title';
let trackUrl = '';
const titleSelectors = [
'.soundTitle__title > .sc-link-dark',
'.soundTitle__title span:not(.soundTitle__usernameText)',
'h3 a.sc-link-dark',
'h3 a',
'.soundTitle__title',
'[class*="title"] a',
'a[href*="/"]',
'.sc-link-dark'
];
for (const selector of titleSelectors) {
const titleEl = item.querySelector(selector);
if (titleEl && titleEl.textContent.trim()) {
title = titleEl.textContent.trim();
break;
}
}
const linkSelectors = [
'.soundTitle__title a[href]',
'h3 a[href]',
'a.sc-link-dark[href]',
'a[href*="/"]',
'[class*="title"] a[href]',
'a[href]'
];
for (const selector of linkSelectors) {
const linkEl = item.querySelector(selector);
if (linkEl) {
const href = linkEl.getAttribute('href') || '';
if (href) {
trackUrl = href.startsWith('http') ? href : `https://soundcloud.com${href}`;
break;
}
}
}
let hasDownload = false;
const moreSelectors = [
'button[aria-label="More"]',
'.soundActions .sc-button-more',
'.sc-button-more',
'button[title="More"]',
'button[class*="more"]',
'[data-testid="more-button"]',
'button[aria-label*="more"]',
'.moreButton'
];
let moreBtn = null;
for (const selector of moreSelectors) {
moreBtn = item.querySelector(selector);
if (moreBtn) break;
}
if (moreBtn) {
try {
moreBtn.scrollIntoView({ block: 'center' });
await delay(80);
moreBtn.click();
let menuFound = false;
for (let attempts = 0; attempts < 20; attempts++) {
const menus = Array.from(document.querySelectorAll('.moreActions'));
if (menus.some(m => m.querySelector('.sc-button-label'))) {
menuFound = true;
break;
}
await delay(100);
}
if (menuFound) {
const menus = Array.from(document.querySelectorAll('.moreActions'));
if (menus.length) {
const btnRect = moreBtn.getBoundingClientRect();
let bestMenu = null;
let bestDist = Infinity;
for (const menu of menus) {
const menuRect = menu.getBoundingClientRect();
const dx = Math.abs((menuRect.left + menuRect.width / 2) - (btnRect.left + btnRect.width / 2));
const dy = Math.abs((menuRect.top + menuRect.height / 2) - (btnRect.top + btnRect.height / 2));
const distance = Math.sqrt(dx * dx + dy * dy);
if (distance < bestDist) {
bestDist = distance;
bestMenu = menu;
}
}
if (bestMenu) {
const downloadBtn = bestMenu.querySelector('.sc-button-download, button[aria-label*="Download"]');
hasDownload = !!downloadBtn;
}
}
document.dispatchEvent(new KeyboardEvent('keydown', { key: 'Escape' }));
await delay(50);
}
} catch (e) {
console.log(`Error checking download for item ${i + 1}:`, e.message);
}
}
if (hasDownload && title !== 'Unknown Title' && trackUrl) {
results.push({
index: results.length + 1,
artist,
title,
trackUrl,
hasDownload: true
});
}
await delay(150);
}
return results;
});
console.log(`\nTotal items found: ${finalItemCount}`);
console.log(`\nFound ${tracks.length} downloadable tracks:\n`);
tracks.forEach((track) => {
console.log(`${track.index}. ${track.artist} - "${track.title}"`);
console.log(` ${track.trackUrl}\n`);
});
// Export to CSV
const csv = [
'Index,Artist,Title,URL',
...tracks.map(t => `${t.index},"${t.artist.replace(/"/g, '""')}","${t.title.replace(/"/g, '""')}",${t.trackUrl}`)
].join('\n');
fs.writeFileSync('search-results.csv', csv);
console.log('Saved to search-results.csv');
return tracks;
} catch (error) {
console.error('Scraping failed:', error.message);
} finally {
if (browser) {
await browser.close();
}
}
}
scrapeSearchResults();

When you run the search results script, it automatically scrolls through the page to load more results. In the terminal, you'll see progress messages like "Scroll attempt 1: Found 11 items" and updates when new content is loaded. This continues until the script reaches the configured scroll limit, hits the target number of results, or detects that no new items are appearing.

Once scrolling is complete, the script goes through each loaded track and checks whether the Download file option is available. This step can take some time, depending on how many items were found and processed.

After the check finishes, the script prints a summary showing the total number of search results and how many tracks have download buttons. The final output is a clean, numbered list with artist names, track titles, and full SoundCloud URLs. All of this data is then saved to a CSV file on your computer.

Creating your training dataset

After scraping, manually download the audio files using SoundCloud's native Download file buttons. These clean tracks will serve as your "ground truth." Next, apply synthetic degradation techniques like:

  • Bitrate reduction (e.g., 320kbps → 64kbps)
  • Sample rate reduction (e.g., 44.1kHz → 22kHz)
  • Noise injection
  • Artifact simulation (e.g., compression glitches)

Pair each degraded version with its original, and label them by type and severity. This gives your model a wide range of examples to learn from.

By using a diverse set of Creative Commons tracks, your model will be exposed to different genres, vocal styles, and production levels, helping it generalize better to real-world audio issues.

3. Speech AI training

Speech AI models are built to understand, process, and generate human speech across different accents, languages, styles, and recording conditions. With voice interfaces now everywhere, from smartphones to customer service bots, there's growing demand for models that can handle natural, messy, real-world speech, not just clean, textbook samples.

Tools like Whisper (OpenAI) have set new standards for multilingual speech-to-text, while services like ElevenLabs offer voice cloning used by creators, streamers, and studios alike. Real-time applications such as Otter.ai, Rev, Alexa, and Google Assistant all rely on speech AI trained on diverse, representative voice data. The same goes for language learning apps, accessibility tools, and smart customer support systems.

What to scrape on SoundCloud for speech training

To train effective models, you’ll need high-quality, long-form spoken content featuring real people in real conversations. Some of the best sources on SoundCloud include:

  • Podcasts and educational accounts. Think universities, media outlets, think tanks, or institutions uploading lectures, interviews, and panel talks.
  • Interview formats. Interviews offer multiple speakers, natural conversation flow, and a range of tones and accents in one recording.
  • Language learning channels. These often feature accented English and multilingual content, which is useful for training models on varied speech patterns.
  • Audiobook or documentary creators. Their uploads usually provide clean, consistent solo voice recordings, ideal for voice modeling tasks.
  • Long-form uploads. Look for hour-long episodes or sessions. These give you natural pauses, rhythm changes, and unscripted speech patterns.

Scraping SoundCloud profile pages

Here’s a Node.js script for scraping tracks from a specific SoundCloud profile, limited to those with enabled downloads:

const { firefox } = require('playwright');
const fs = require('fs');
const delay = (ms) => new Promise(resolve => setTimeout(resolve, ms));
async function scrapeProfileTracks() {
let browser;
try {
browser = await firefox.launch({
headless: false,
proxy: {
server: 'http://gate.decodo.com:7000',
username: 'YOUR_PROXY_USERNAME',
password: 'YOUR_PROXY_PASSWORD'
}
});
const context = await browser.newContext({
userAgent: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0',
viewport: { width: 1366, height: 768 }
});
const page = await context.newPage();
await page.route('**/*', (route) => {
const type = route.request().resourceType();
if (['image', 'media', 'font', 'manifest'].includes(type)) {
route.abort();
} else {
route.continue();
}
});
// TARGET URL - Replace with the Soundcloud profile's tracks page you want to scrape
await page.goto('https://soundcloud.com/institute-of-ideas/tracks', {
waitUntil: 'domcontentloaded',
timeout: 45000
});
await delay(3000);
// Handle cookie consent
try {
await page.waitForSelector('#onetrust-accept-btn-handler', { timeout: 5000 });
await page.click('#onetrust-accept-btn-handler');
await delay(500);
} catch (e) {}
// Handle login modal
try {
await page.waitForSelector('.modal__closeButton', { timeout: 5000 });
await page.click('.modal__closeButton');
await delay(300);
} catch (e) {}
// Wait for track list to load
await page.waitForSelector('.soundList__item, .trackItem, [class*="sound"]', { timeout: 15000 });
const maxTracksToScrape = 50;
const maxScrollAttempts = 20;
// Infinite scroll to load tracks
console.log('Loading tracks through infinite scroll...');
let previousCount = 0;
let scrollAttempts = 0;
while (scrollAttempts < maxScrollAttempts) {
await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
await delay(1500);
const currentCount = await page.evaluate(() => {
const selectors = [
'.soundList__item',
'.trackItem',
'article[class*="sound"]',
'li[class*="sound"]'
];
for (const selector of selectors) {
const elements = document.querySelectorAll(selector);
if (elements.length > 0) return elements.length;
}
return 0;
});
console.log(`Loaded ${currentCount} tracks (scroll attempt ${scrollAttempts + 1})`);
if (currentCount >= maxTracksToScrape || currentCount === previousCount) {
break;
}
previousCount = currentCount;
scrollAttempts++;
}
const finalCount = await page.evaluate(() => {
const selectors = [
'.soundList__item',
'.trackItem',
'article[class*="sound"]',
'li[class*="sound"]'
];
for (const selector of selectors) {
const elements = document.querySelectorAll(selector);
if (elements.length > 0) return elements.length;
}
return 0;
});
console.log(`Found ${finalCount} total tracks`);
// Extract downloadable tracks
const tracks = await page.evaluate(async (maxTracks) => {
const delay = (ms) => new Promise(resolve => setTimeout(resolve, ms));
const results = [];
let trackItems = [];
const selectors = [
'.soundList__item',
'.trackItem',
'article[class*="sound"]',
'li[class*="sound"]'
];
for (const selector of selectors) {
trackItems = document.querySelectorAll(selector);
if (trackItems.length > 0) break;
}
const itemsToProcess = Math.min(maxTracks, trackItems.length);
for (let i = 0; i < itemsToProcess; i++) {
const item = trackItems[i];
item.scrollIntoView({ block: 'center' });
await delay(160);
let artist = 'Unknown Artist';
const artistSelectors = [
'.soundTitle__usernameText',
'.trackItem__username',
'.soundTitle .usernameText',
'a[href*="/"] .username'
];
for (const selector of artistSelectors) {
const artistEl = item.querySelector(selector);
if (artistEl && artistEl.textContent.trim()) {
artist = artistEl.textContent.trim();
break;
}
}
let title = 'Unknown Title';
let trackUrl = '';
const titleSelectors = [
'.soundTitle__title > .sc-link-dark',
'.trackItem__trackTitle',
'.soundTitle__title span:not(.soundTitle__usernameText)',
'h3 a.sc-link-dark',
'h3 a'
];
for (const selector of titleSelectors) {
const titleEl = item.querySelector(selector);
if (titleEl && titleEl.textContent.trim()) {
title = titleEl.textContent.trim();
break;
}
}
const linkSelectors = [
'.soundTitle__title a[href]',
'.trackItem__trackTitle[href]',
'h3 a[href]',
'a.sc-link-dark[href]'
];
for (const selector of linkSelectors) {
const linkEl = item.querySelector(selector);
if (linkEl) {
const href = linkEl.getAttribute('href') || '';
if (href) {
trackUrl = href.startsWith('http') ? href : `https://soundcloud.com${href}`;
break;
}
}
}
let hasDownload = false;
const moreSelectors = [
'button[aria-label="More"]',
'.soundActions .sc-button-more',
'.sc-button-more',
'button[title="More"]'
];
let moreBtn = null;
for (const selector of moreSelectors) {
moreBtn = item.querySelector(selector);
if (moreBtn) break;
}
if (moreBtn) {
try {
moreBtn.scrollIntoView({ block: 'center' });
await delay(80);
moreBtn.click();
let menuFound = false;
for (let attempts = 0; attempts < 20; attempts++) {
const menus = Array.from(document.querySelectorAll('.moreActions'));
if (menus.some(m => m.querySelector('.sc-button-label'))) {
menuFound = true;
break;
}
await delay(100);
}
if (menuFound) {
const menus = Array.from(document.querySelectorAll('.moreActions'));
if (menus.length) {
const btnRect = moreBtn.getBoundingClientRect();
let bestMenu = null;
let bestDist = Infinity;
for (const menu of menus) {
const menuRect = menu.getBoundingClientRect();
const dx = Math.abs((menuRect.left + menuRect.width / 2) - (btnRect.left + btnRect.width / 2));
const dy = Math.abs((menuRect.top + menuRect.height / 2) - (btnRect.top + btnRect.height / 2));
const distance = Math.sqrt(dx * dx + dy * dy);
if (distance < bestDist) {
bestDist = distance;
bestMenu = menu;
}
}
if (bestMenu) {
const downloadBtn = bestMenu.querySelector('.sc-button-download, button[aria-label*="Download"]');
hasDownload = !!downloadBtn;
}
}
document.dispatchEvent(new KeyboardEvent('keydown', { key: 'Escape' }));
}
} catch (e) {}
}
if (hasDownload && title !== 'Unknown Title' && trackUrl) {
results.push({
index: results.length + 1,
artist,
title,
trackUrl
});
}
await delay(200);
}
return results;
}, maxTracksToScrape);
console.log(`${tracks.length} downloadable tracks:\n`);
tracks.forEach((track) => {
console.log(`${track.index}. ${track.artist} - "${track.title}"`);
console.log(` ${track.trackUrl}\n`);
});
// Export to CSV
const csv = [
'Index,Artist,Title,URL',
...tracks.map(t => `${t.index},"${t.artist.replace(/"/g, '""')}","${t.title.replace(/"/g, '""')}",${t.trackUrl}`)
].join('\n');
fs.writeFileSync('profile-tracks.csv', csv);
console.log('Saved to profile-tracks.csv');
return tracks;
} catch (error) {
console.error('Scraping failed:', error.message);
} finally {
if (browser) {
await browser.close();
}
}
}
scrapeProfileTracks();

When you run the profile scraping script, the terminal shows the scrolling progress as tracks are loaded from the page. You’ll see updates like "Loaded 20 tracks (scroll attempt 1)" as the script continues until it reaches your configured limit or no new content appears.

Once scrolling is complete, the script reports how many total tracks were found and then checks each one for download availability. After that, it outputs a numbered list containing the creator name, episode or lecture title, and the full SoundCloud URL for each downloadable track. All of this data is also saved to a CSV file.

Preparing the training dataset

After downloading the audio files, you can segment the speech by speaker using speaker diarization tools (or manually, for higher precision). Then:

  • Convert to a consistent format and sample rate
  • Generate transcripts using Whisper or similar tools
  • Tag by accent, gender, speaking style, or context (formal vs. conversational)

Podcasts are especially valuable because they contain natural speech: pauses, overlaps, informal language, and spontaneity. Educational content, on the other hand, provides clear articulation, making it useful for pronunciation learning or voice cloning models.

By combining both types, you’ll get a well-rounded dataset that helps your speech AI handle a wide range of human voices, just like it needs to in the real world.

Best practices for web scraping with Node.js

When scraping SoundCloud data, follow these established best practices to ensure more sustainable data collection:

  • Launch reliably. Set realistic HTTP headers, such as User-Agent, Accept-Language, viewport, and generous timeouts. Prefer one browser with multiple pages over many browsers.
  • Keep pages light. Intercept requests and abort images, media, fonts, beacons, and manifests to reduce noise and flakiness.
  • Navigate with retries. Wrap navigation in a short retry with backoff, wait for result cards instead of fixed delays, then scroll until enough items load.
  • Tame popups. If applicable, accept the cookie banner and close the login modal once per context before scraping.
  • Use resilient selectors. Combine a few candidate selectors for result cards and avoid brittle single-class matches.
  • Pace your actions. Add small random delays, limit concurrent navigations, and slow down when errors spike.
  • Recover gracefully. Catch per-item failures, return partial data, and fall back to the track page if the menu path fails.
  • Rotate signals when needed. Vary user agents and use quality proxies or sticky sessions if you start seeing many 403 responses.
  • Save progress. Append results as you go so restarts resume cleanly, and log only the essentials to guide tuning.

Recent VPN blocks on SoundCloud

In mid-December 2025, SoundCloud confirmed a security incident that exposed private data for roughly a fifth of users, followed by brief denial-of-service disruptions. As part of the response, many VPN and cloud IP ranges began returning 403 errors for a period, which multiple outlets and user reports attributed to tightened filtering during mitigation.

Access largely returned, but similar filters can be reapplied during future incidents. Proxies give finer control over the exit IP and session compared with VPNs. A VPN routes all traffic through one tunnel and one exit IP. A proxy routes only your scraper and lets you choose IPs, rotate or keep them sticky, and target cities or ASNs. Proxies can still be hit by broad filters, but high-quality pools usually give you more ways to adapt. Here's why proxies have the upper hand:

  • IP reputation and mix. Residential or ISP proxies typically resemble everyday users more than many VPN exits.
  • Rotation control. Rotate or keep sticky sessions per browser context to balance stability and freshness.
  • Targeted routing. City or ASN targeting helps sidestep ranges under heavier scrutiny.

Here's how to future-proof your scraping workflows:

  • Access check. Request a known public SoundCloud track page (for example, a popular artist’s single) first without a proxy, then with a residential or ISP proxy. For each path send 3-5 GETs, record HTTP status and median TTFB, and pick the path that yields consistent 200s with lower latency and fewer 403s or timeouts.
  • Session handling. Use sticky proxy sessions per browser context for multi-step actions like opening the three-dot menu, then rotate to a fresh session between items. Reset a session after N failures or M minutes of use.
  • IP rotation. Cycle through diverse residential or ISP subnets. If 403s exceed a small threshold (for example, 3 in the last 20 requests), switch to a different subnet or provider pool automatically.
  • Targeting. Prefer consumer ASNs and the cities you actually need. Deprioritize cloud ASNs or subnets that recently produced elevated 403s.
  • Pacing. Add exponential backoff with jitter after bursts of 403s or timeouts, and cap concurrent navigations per host.
  • Metrics and logs. Record exit IP, ASN, city, status codes, latency, and a simple selector check such as the presence of .moreActions. Use these signals to trigger rotation, session resets, or provider changes.

Final thoughts

​​SoundCloud offers a huge pool of diverse audio that can power great AI work. With the basics in place you can build datasets for music generation, audio enhancement, and voice or speech training. That is only a starting point, and the same approach can support many other creative ideas.

Results come from the setup you choose and the care you put in. Playwright handles dynamic pages, proxies keep sessions stable at scale, and smart targeting keeps the data relevant. Log your runs, watch for interface changes, and refine as you go. Most of all, remember that data quality often matters more than quantity. A curated dataset of high-quality, relevant audio samples will typically produce better AI model performance than a larger collection of random tracks.

free-trial.svg

Get high-quality residential IPs

Choose a residential proxy plan from Decodo and enjoy industry-leading performance and reliability.

About the author

Dominykas Niaura

Technical Copywriter

Dominykas brings a unique blend of philosophical insight and technical expertise to his writing. Starting his career as a film critic and music industry copywriter, he's now an expert in making complex proxy and web scraping concepts accessible to everyone.


Connect with Dominykas via LinkedIn

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

Frequently asked questions

What makes SoundCloud valuable for AI training?

SoundCloud's diverse content ecosystem provides access to a wide range of audio styles, production qualities, and genres that aren't available on traditional streaming platforms. This diversity is crucial for training robust AI models that can handle real-world audio variety.

How do I find tracks with download permissions?

On SoundCloud, search for Creative Commons licensed content, remix stems, or tracks explicitly marked as downloadable. Use search filters and specific keywords like "creative commons," "stems available," or "free download" to identify suitable content.

What audio quality should I target for AI training?

For most AI applications, you want the highest quality available. Look for tracks uploaded at 320kbps or higher when possible. For audio enhancement training specifically, you might intentionally collect lower-quality samples to use as training inputs.

How much data do I need for effective AI training?

The amount depends on your specific use case. Music generation models typically need thousands of tracks, while voice training might require hundreds of hours of diverse speech samples. Start with smaller datasets to validate your approach before scaling up.

Can I scrape private or premium content?

Focus only on SoundCloud's publicly available content. Accessing private tracks and premium content would require you to log into your account, but scraping non-publicly accessible content is not advised due to legal concerns. Private or premium content may also have additional usage restrictions that make them unsuitable for AI training datasets.

Scrape Discogs marketplace data with Python, Selenium, and proxies

Scrape Discogs Marketplace with Python: A Step-By-Step Tutorial

Online marketplaces are beloved for offering everything from items we don’t need to treasures we didn’t know we were missing. For music lovers and collectors, Discogs stands out as the go-to destination – think of it as the IMDb of music records. Whether you're analyzing vinyl market trends, tracking the value of rare releases, or building a personal archive, Discogs is an unmatched resource. In this tutorial, you’ll learn how to scrape Discogs using Python, step by step, to unlock the data behind the music.

smartproxy for soundcloud proxy

How Promoters use Soundcloud Proxy to Get Unlimited Plays

Soundcloud is a perfect site for music enthusiasts and creatives. Too bad it, like so many social media sites, is dominated by the fortunate few. If you want to make your name known on Soundcloud, you have to use Soundcloud promotion agency, a tool like SCPlanner or create your own Soundcloud like bot with a rotating Soundcloud proxy service.


Thousands of artists reach an avid listener base through Soundcloud, but getting on their radar is not something that you can easily do in an oversaturated music market. If you’ve ever wondered how Soundcloud promotion works and what superstar Soundcloud promoters do, read on and get your mind blown, because you could be doing this right now.

AI Training Data: Definition, Sources & Best Practices

After years of progress, AI has gotten a lot better at acting like human thinking. Whether that’s in machine learning, robotics, natural language processing (NLP), or training AI agents. But one thing still holds true – AI is only as good as the data it learns from. In this post, we’ll look at why high-quality training data matters so much when building strong AI systems.

© 2018-2026 decodo.com (formerly smartproxy.com). All Rights Reserved