How to Scrape SoundCloud for AI Training: Step-By-Step Tutorial

SoundCloud is a mother lode for AI training data, with millions of audio tracks spanning every genre and style imaginable. In this guide, we’ll show you how to tap into that library using Node.js, with the help of proxies. You’ll get hands-on code examples and learn how to collect audio data for three key AI use cases: music generation, audio enhancement, and voice training.

Dominykas Niaura

Oct 08, 2025

10 min read

What is SoundCloud?

SoundCloud is a global platform where anyone can upload, share, and stream audio. Since launching in 2007, it’s grown into one of the world’s largest audio communities, hosting over 320 million tracks from artists, podcasters, and creators of all kinds.

What sets SoundCloud apart is its diversity. It’s not just major label music. You’ll find everything from bedroom demos and underground beats to polished albums and niche podcasts. For AI training, that variety is a huge advantage, offering a broad mix of styles, genres, and audio quality to work with.

Why scrape SoundCloud?

SoundCloud offers one of the richest pools of audio data for training machine learning models. Its unique mix of content and metadata makes it especially useful for AI development.

Diverse audio content. From polished studio tracks to lo-fi bedroom recordings, SoundCloud spans every genre and style. This variety helps train AI models to handle real-world audio.
Rich metadata. Tracks come with valuable context (play counts, likes, reposts, and user engagement), all of which can add depth to your datasets.
Community curation. Users create playlists, charts, and collections that naturally organize content by genre, mood, or quality. These curated collections can serve as pre-filtered training datasets.
Creative Commons tracks. Many uploads use open licenses, making them accessible for research and development.

How to train an AI, ML, or LLM model?

Training an AI model to understand or generate audio isn’t just about feeding it sound. It’s a multi-step process that starts with the right data and ends with careful model tuning. Here’s how it typically works:

Data collection and preprocessing. First, you gather raw audio files and clean them up. That means converting formats, normalizing volume, trimming silence, and attaching metadata like genre or play counts for added context.
Feature extraction. Audio has to be turned into something a model can understand. This might mean creating spectrograms, mel-frequency cepstral coefficients (MFCCs), or using raw waveforms, depending on your goal.
Model architecture selection. The model you choose depends on the task. Music generation often uses transformer-based models (like OpenAI’s Jukebox or Google’s MusicLM), while audio enhancement or classification might rely on convolutional neural networks (CNNs) or recurrent neural networks (RNNs).
Training and validation. You train the model on your dataset through multiple iterations, adjusting weights to improve performance. Validation on a separate test set helps make sure it generalizes well.

Popular platforms for building and training these models include TensorFlow, PyTorch, and JAX. For large-scale jobs, you might also use services like Google Colab, AWS SageMaker, or Paperspace for GPU access.

What you need for scraping SoundCloud

In this tutorial, we’ll be scraping SoundCloud artist names, track titles, and download URLs. To do this, we’ll use Node.js scripts powered by Puppeteer and proxy services to help us navigate SoundCloud’s dynamic, JavaScript-heavy interface.

Here’s what you’ll need to get started:

Node.js environment. Make sure Node.js version 14 or higher is installed on your machine. (Quick note: JavaScript is the language, while Node.js is the environment that runs it outside your browser.) You’ll also use npm (Node Package Manager), which comes bundled with Node.js, to install required libraries.
Puppeteer library. We’ll use Puppeteer to programmatically control a headless browser. It’s perfect for scraping sites like SoundCloud, which rely heavily on JavaScript to load content. Don’t worry – we’ll show you how to install and use it in the examples ahead.
Basic browser inspection skills. You should know how to open your browser’s developer tools and inspect elements on the page. This helps identify which HTML tags and classes to target in your script.
Proxies. SoundCloud actively limits automated access. If you're scraping more than just a few pages, a proxy service is crucial. Using residential rotating proxies can help you avoid IP bans and maintain a stable scraping session.
Storage infrastructure. Audio files can be large, and training datasets often require thousands of tracks. Make sure you have enough local or cloud storage for the number of files you plan to collect.

Why you need proxies for scraping SoundCloud

Proxies are essential for keeping your scraping sessions smooth, anonymous, and uninterrupted. They route your requests through different IP addresses, helping you avoid detection, rate limits, and IP bans from platforms like SoundCloud.

Proxies also let you scale up running multiple sessions at once or accessing geo-restricted content. For this guide, we recommend using residential proxies for the best reliability, but datacenter, mobile, or static (ISP) proxies can also work depending on your goals and budget. Here’s how easy it is to get proxies at Decodo:

Create a Decodo account on our dashboard.
Find residential proxies by choosing Residential on the left panel.
Choose a subscription, Pay As You Go plan, or opt for a 3-day free trial.
In the Proxy setup tab, configure the location, session type, and protocol according to your needs.
Copy your proxy address, port, username, and password for later use. Alternatively, you can click the download icon in the lower right corner of the table to download the proxy endpoints (10 by default).

Get residential proxies for SoundCloud

Claim your 3-day free trial of residential proxies and explore full features with unrestricted access.

Start free trial

How to run Node.js scripts

Once Node.js is installed, you'll need a way to write and run your scraping scripts. You can use any text editor with your computer's terminal, or choose an integrated development environment (IDE) like Visual Studio Code, which combines both editing and terminal functionality in one place.

Start by creating a new folder for your project and navigating to it in your terminal. You can do this in any of the following ways:

Right-clicking the folder and selecting Open in Terminal (Windows/Linux)
Choosing New Terminal at Folder (macOS)
Manually running the cd command to switch directories

Next, install the required Puppeteer library by running this command in your project folder:

npm install puppeteer

Copy one of the script examples from this guide into a .js file (for example, music-gen.js). Don’t forget to replace placeholder proxy credentials (YOUR_PROXY_USERNAME and YOUR_PROXY_PASSWORD) with your actual Decodo proxy details before running the script.

To run the script, use:

node music-gen.js

Your terminal or IDE will show live output as the script runs, including results and any errors you might need to troubleshoot.

1. Music generation AI training

Teaching AI to make music is no longer sci-fi. Models can now learn musical patterns, structures, and styles by analyzing existing tracks. This works because music follows both math and culture: it’s structured, yet expressive. That balance makes it ideal for training AI systems to generate coherent, listenable compositions.

Platforms like Suno AI and Udio already let millions create songs from text prompts. Others, like AIVA, Boomy, and Soundful, cater to creators needing royalty-free background music. On the cutting edge, research tools like Stable Audio by Stability AI and OpenAI’s Jukebox show just how deep this field can go.

What to scrape on SoundCloud for music generation training

When targeting SoundCloud for music generation training data, focus on curated collections that represent successful musical patterns within specific genres. Start with official music charts and trending playlists, as these contain tracks that have already proven popular with real audiences.

On the home page of SoundCloud, you’ll find the button Explore trending playlists, which will lead you to the discovery page that showcases playlists of trending, curated, up-and-coming, and other kinds of music. Select the one that matches your direction.

While many chart-topping tracks won't offer direct download options, the metadata and popularity metrics provide valuable insights into what makes music successful. You can often find ways to obtain the actual audio files through legitimate channels outside of SoundCloud once you've identified the most promising tracks.

Scraping SoundCloud playlists

Here’s a Node.js script that scrapes metadata from any public SoundCloud playlist:

const puppeteer = require('puppeteer');

const delay = (ms) => new Promise(resolve => setTimeout(resolve, ms));

async function scrapePlaylistTracks() {
  let browser;
  
  try {
    browser = await puppeteer.launch({ 
      headless: true,
      args: [
        // PROXY CONFIGURATION
        '--proxy-server=http://gate.decodo.com:7000',
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-dev-shm-usage',
        '--disable-gpu',
        '--mute-audio',
        '--window-size=1366,768'
      ]
    });
    
    const page = await browser.newPage();
    await page.setViewport({ width: 1366, height: 768 });
    await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Safari/537.36');
    
    // PROXY AUTHENTICATION - Replace with your proxy credentials
    await page.authenticate({
      username: 'YOUR_PROXY_USERNAME',
      password: 'YOUR_PROXY_PASSWORD'
    });
    
    // Block noisy resources
    await page.setRequestInterception(true);
    page.on('request', (req) => {
      const rt = req.resourceType();
      if (['image', 'media', 'font', 'manifest'].includes(rt)) {
        return req.abort();
      }
      return req.continue();
    });
    
    // Silence console noise
    page.on('console', () => {});
    page.on('pageerror', () => {});
    
    // TARGET URL - Replace with the Soundcloud playlist you want to scrape
    await page.goto('https://soundcloud.com/music-charts-uk/sets/hip-hop', {
      waitUntil: 'domcontentloaded',
      timeout: 45000
    });
    
    await delay(5000);
    
    // Handle cookie consent
    try {
      await page.waitForSelector('#onetrust-accept-btn-handler', { timeout: 5000 });
      await page.click('#onetrust-accept-btn-handler');
      await delay(500);
    } catch (e) {}
    
    // Handle login modal
    try {
      await page.waitForSelector('.modal__closeButton', { timeout: 5000 });
      await page.click('.modal__closeButton');
      await delay(300);
    } catch (e) {}
    
    // Scroll to load all tracks
    for (let i = 0; i < 5; i++) {
      await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
      await delay(3000);
    }
    
    const itemCount = await page.evaluate(() => {
      const trackListContainer = document.querySelector('.trackList__list') || 
                                 document.querySelector('ul') || 
                                 document.querySelector('[class*="trackList"]');
      return trackListContainer ? trackListContainer.querySelectorAll('li').length : 0;
    });
    console.log(`Found ${itemCount} playlist tracks`);
    
    // Click on all track photos to reveal hidden track information
    console.log('Revealing hidden tracks by clicking...');
    
    const trackImages = await page.$$('.trackList__list li img[width="32"][height="32"]');
    
    for (let i = 0; i < trackImages.length; i++) {
      try {
        await trackImages[i].click();
        await delay(800);
      } catch (e) {
        // Continue if click fails on any particular image
      }
    }
    
    // Additional delay to ensure all clicked tracks have loaded
    await delay(2000);
    
    // Extract playlist tracks
    const tracks = await page.evaluate(() => {
      const results = [];
      
      // Find the main track list container
      const trackListContainer = document.querySelector('.trackList__list') || 
                                 document.querySelector('ul') || 
                                 document.querySelector('[class*="trackList"]');
      
      if (!trackListContainer) return [];
      
      const trackItems = trackListContainer.querySelectorAll('li');
      
      trackItems.forEach((trackItem) => {
        // Get track rank/position
        const rankEl = trackItem.querySelector('.trackItem__separator');
        if (!rankEl) return;
        
        const rank = rankEl.textContent.trim();
        
        // Get artist and track title
        const artistEl = trackItem.querySelector('.trackItem__username');
        const titleEl = trackItem.querySelector('.trackItem__trackTitle');
        
        if (!artistEl || !titleEl) return;
        
        const artist = artistEl.textContent.trim();
        const title = titleEl.textContent.trim();
        const trackUrl = 'https://soundcloud.com' + titleEl.href.replace('https://soundcloud.com', '');
        
        // Check if track is geo-blocked
        const blockMsgEl = trackItem.querySelector('.trackItem__blockMsg');
        if (blockMsgEl) {
          results.push({
            index: parseInt(rank),
            artist,
            title,
            trackUrl,
            status: `[${blockMsgEl.textContent.trim()}]`
          });
          return;
        }
        
        // Extract play count if available
        const playCountEl = trackItem.querySelector('.trackItem__playCount');
        let plays = 'N/A';
        
        if (playCountEl) {
          let playText = playCountEl.textContent.replace(/[\r\n\t]/g, ' ').replace(/\s+/g, ' ').trim();
          const playMatch = playText.match(/(\d+(?:\.\d+)?[KMB])\s*$/);
          plays = playMatch ? playMatch[1] : playText;
        }
        
        results.push({
          index: parseInt(rank),
          artist,
          title,
          trackUrl,
          plays
        });
      });
      
      // Sort tracks by rank to ensure proper order
      return results.sort((a, b) => a.index - b.index);
    });
    
    console.log(`${tracks.length} playlist tracks available:\n`);
    
    tracks.forEach((track) => {
      const playsDisplay = track.plays ? ` (${track.plays} plays)` : '';
      const statusDisplay = track.status ? ` ${track.status}` : '';
      console.log(`${track.index}. ${track.artist} - "${track.title}"${playsDisplay}${statusDisplay}`);
      console.log(`   ${track.trackUrl}\n`);
    });
    
    return tracks;
    
  } catch (error) {
    console.error('Scraping failed:', error.message);
  } finally {
    if (browser) {
      await browser.close();
    }
  }
}

scrapePlaylistTracks();

const puppeteer = require('puppeteer');

const delay = (ms) => new Promise(resolve => setTimeout(resolve, ms));

async function scrapePlaylistTracks() {
  let browser;
  
  try {
    browser = await puppeteer.launch({ 
      headless: true,
      args: [
        // PROXY CONFIGURATION
        '--proxy-server=http://gate.decodo.com:7000',
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-dev-shm-usage',
        '--disable-gpu',
        '--mute-audio',
        '--window-size=1366,768'
      ]
    });
    
    const page = await browser.newPage();
    await page.setViewport({ width: 1366, height: 768 });
    await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Safari/537.36');
    
    // PROXY AUTHENTICATION - Replace with your proxy credentials
    await page.authenticate({
      username: 'YOUR_PROXY_USERNAME',
      password: 'YOUR_PROXY_PASSWORD'
    });
    
    // Block noisy resources
    await page.setRequestInterception(true);
    page.on('request', (req) => {
      const rt = req.resourceType();
      if (['image', 'media', 'font', 'manifest'].includes(rt)) {
        return req.abort();
      }
      return req.continue();
    });
    
    // Silence console noise
    page.on('console', () => {});
    page.on('pageerror', () => {});
    
    // TARGET URL - Replace with the Soundcloud playlist you want to scrape
    await page.goto('https://soundcloud.com/music-charts-uk/sets/hip-hop', {
      waitUntil: 'domcontentloaded',
      timeout: 45000
    });
    
    await delay(5000);
    
    // Handle cookie consent
    try {
      await page.waitForSelector('#onetrust-accept-btn-handler', { timeout: 5000 });
      await page.click('#onetrust-accept-btn-handler');
      await delay(500);
    } catch (e) {}
    
    // Handle login modal
    try {
      await page.waitForSelector('.modal__closeButton', { timeout: 5000 });
      await page.click('.modal__closeButton');
      await delay(300);
    } catch (e) {}
    
    // Scroll to load all tracks
    for (let i = 0; i < 5; i++) {
      await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
      await delay(3000);
    }
    
    const itemCount = await page.evaluate(() => {
      const trackListContainer = document.querySelector('.trackList__list') || 
                                 document.querySelector('ul') || 
                                 document.querySelector('[class*="trackList"]');
      return trackListContainer ? trackListContainer.querySelectorAll('li').length : 0;
    });
    console.log(`Found ${itemCount} playlist tracks`);
    
    // Click on all track photos to reveal hidden track information
    console.log('Revealing hidden tracks by clicking...');
    
    const trackImages = await page.$$('.trackList__list li img[width="32"][height="32"]');
    
    for (let i = 0; i < trackImages.length; i++) {
      try {
        await trackImages[i].click();
        await delay(800);
      } catch (e) {
        // Continue if click fails on any particular image
      }
    }
    
    // Additional delay to ensure all clicked tracks have loaded
    await delay(2000);
    
    // Extract playlist tracks
    const tracks = await page.evaluate(() => {
      const results = [];
      
      // Find the main track list container
      const trackListContainer = document.querySelector('.trackList__list') || 
                                 document.querySelector('ul') || 
                                 document.querySelector('[class*="trackList"]');
      
      if (!trackListContainer) return [];
      
      const trackItems = trackListContainer.querySelectorAll('li');
      
      trackItems.forEach((trackItem) => {
        // Get track rank/position
        const rankEl = trackItem.querySelector('.trackItem__separator');
        if (!rankEl) return;
        
        const rank = rankEl.textContent.trim();
        
        // Get artist and track title
        const artistEl = trackItem.querySelector('.trackItem__username');
        const titleEl = trackItem.querySelector('.trackItem__trackTitle');
        
        if (!artistEl || !titleEl) return;
        
        const artist = artistEl.textContent.trim();
        const title = titleEl.textContent.trim();
        const trackUrl = 'https://soundcloud.com' + titleEl.href.replace('https://soundcloud.com', '');
        
        // Check if track is geo-blocked
        const blockMsgEl = trackItem.querySelector('.trackItem__blockMsg');
        if (blockMsgEl) {
          results.push({
            index: parseInt(rank),
            artist,
            title,
            trackUrl,
            status: `[${blockMsgEl.textContent.trim()}]`
          });
          return;
        }
        
        // Extract play count if available
        const playCountEl = trackItem.querySelector('.trackItem__playCount');
        let plays = 'N/A';
        
        if (playCountEl) {
          let playText = playCountEl.textContent.replace(/[\r\n\t]/g, ' ').replace(/\s+/g, ' ').trim();
          const playMatch = playText.match(/(\d+(?:\.\d+)?[KMB])\s*$/);
          plays = playMatch ? playMatch[1] : playText;
        }
        
        results.push({
          index: parseInt(rank),
          artist,
          title,
          trackUrl,
          plays
        });
      });
      
      // Sort tracks by rank to ensure proper order
      return results.sort((a, b) => a.index - b.index);
    });
    
    console.log(`${tracks.length} playlist tracks available:\n`);
    
    tracks.forEach((track) => {
      const playsDisplay = track.plays ? ` (${track.plays} plays)` : '';
      const statusDisplay = track.status ? ` ${track.status}` : '';
      console.log(`${track.index}. ${track.artist} - "${track.title}"${playsDisplay}${statusDisplay}`);
      console.log(`   ${track.trackUrl}\n`);
    });
    
    return tracks;
    
  } catch (error) {
    console.error('Scraping failed:', error.message);
  } finally {
    if (browser) {
      await browser.close();
    }
  }
}

scrapePlaylistTracks();

When you run the script, it first shows how many tracks were initially found in the playlist. Then it attempts to reveal hidden tracks by clicking on their image icons. After that, it prints a final count of how many tracks were successfully processed, followed by a clean, numbered list that includes the artist name, track title, play count (or "N/A" if not available), and the full SoundCloud URL.

On trending playlists, SoundCloud hides part of the tracklist if you’re not logged in. But there’s a workaround: when you manually click on hidden tracks, they become revealed. The script replicates this by automatically clicking on where the track image icons would be to reveal as many tracks as possible. You can experiment with the delay time or try clicking other elements (like the track row itself) to see if it increases the number of tracks extracted.

SoundCloud also applies geo-based limitations to certain tracks. So your scraper’s results may vary depending on your proxy’s location. That’s not a bug – it can actually be useful if you want to compare regional differences in availability or popularity.

Next steps toward AI training

With this scraped metadata and URL collection, you can proceed to acquire the actual audio files through legitimate channels, since most trending tracks lack direct download options on SoundCloud. Use the artist and track information to locate purchasable versions on platforms like Beatport, Bandcamp, or streaming services that offer high-quality downloads. Then:

Convert files to a consistent format
Extract features like spectrograms, MIDI, or raw waveforms
Organize your dataset using the scraped metadata

This helps your model learn from high-quality, relevant examples that reflect the musical style you’re aiming to replicate.

2. Audio enhancement AI training

Audio enhancement models are trained to clean up degraded recordings, which involves removing noise, fixing low-quality encoding, and restoring clarity to compressed or damaged files. This use case is especially relevant for content creators, podcasters, and musicians who regularly deal with imperfect audio.

The typical training method is called synthetic degradation: you take clean audio, deliberately degrade it (e.g., by lowering bitrate or adding background noise), then train the model to recover the original. Over time, the model learns what "good" audio sounds like and how to fix the bad.

From podcast noise removal to restoring old recordings, audio enhancement has gone mainstream. Tools like Adobe’s Enhance Speech, Krisp, and Descript offer real-time cleanup for creators, while NVIDIA RTX Voice shows how far the tech can go. Platforms like Auphonic and Cleanvoice even offer fully automated audio cleanup, aimed at non-technical users.

What to scrape on SoundCloud for audio enhancement training

For this use case, you’ll need actual downloadable audio files, not just metadata. Focus on Creative Commons tracks – these are free to use and often include download buttons. In the SoundCloud search bar, try terms like: "creative commons," "CC BY," "free download," or "royalty free".

To refine your results, add time filters to surface recent uploads. Once on a search results page, check a few tracks to make sure the Download file button is available (in the "three dots" dropdown next to the track).

Scraping SoundCloud search results pages

Here’s a Node.js script that scrapes SoundCloud search results and filters for tracks with downloadable audio:

const puppeteer = require('puppeteer');

const delay = (ms) => new Promise(resolve => setTimeout(resolve, ms));

async function scrapeSearchResults() {
  let browser;
  
  try {
    browser = await puppeteer.launch({ 
      headless: true,
      args: [
        // PROXY CONFIGURATION
        '--proxy-server=http://gate.decodo.com:7000',
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-dev-shm-usage',
        '--disable-gpu',
        '--mute-audio',
        '--window-size=1366,768'
      ]
    });
    
    const page = await browser.newPage();
    await page.setViewport({ width: 1366, height: 768 });
    await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Safari/537.36');
    
    // PROXY AUTHENTICATION - Replace with your proxy credentials
    await page.authenticate({
      username: 'YOUR_PROXY_USERNAME',
      password: 'YOUR_PROXY_PASSWORD'
    });
    
    // Block noisy resources
    await page.setRequestInterception(true);
    page.on('request', (req) => {
      const rt = req.resourceType();
      if (['image', 'media', 'font', 'manifest'].includes(rt)) {
        return req.abort();
      }
      return req.continue();
    });
    
    // Silence console noise
    page.on('console', () => {});
    page.on('pageerror', () => {});
    
    // TARGET URL - Replace with the Soundcloud search results page you want to scrape
    await page.goto('https://soundcloud.com/search/sounds?q=creative%20commons&filter.created_at=last_year&filter.license=to_use_commercially', {
      waitUntil: 'domcontentloaded',
      timeout: 45000
    });
    
    await delay(3000);
    
    // Handle cookie consent
    try {
      await page.waitForSelector('#onetrust-accept-btn-handler', { timeout: 5000 });
      await page.click('#onetrust-accept-btn-handler');
      await delay(500);
    } catch (e) {}
    
    // Handle login modal
    try {
      await page.waitForSelector('.modal__closeButton', { timeout: 5000 });
      await page.click('.modal__closeButton');
      await delay(300);
    } catch (e) {}
    
    // Wait for search results
    await page.waitForSelector('.searchList__item', { timeout: 15000 });
    
    // Infinite scrolling logic
    console.log('Starting infinite scroll to load more results...');
    
    const maxScrollAttempts = 5; // Configurable
    const targetResults = 500; // Stop when we have enough results
    
    let previousItemCount = 0;
    let scrollAttempts = 0;
    let noNewContentCount = 0;
    
    while (scrollAttempts < maxScrollAttempts) {
      // Get current item count
      const currentItemCount = await page.evaluate(() =>
        document.querySelectorAll('.searchList__item').length
      );
      
      console.log(`Scroll attempt ${scrollAttempts + 1}: Found ${currentItemCount} items`);
      
      // If we have enough results, break
      if (currentItemCount >= targetResults) {
        console.log(`Reached target of ${targetResults} results. Stopping scroll.`);
        break;
      }
      
      // Check if new content was loaded
      if (currentItemCount > previousItemCount) {
        noNewContentCount = 0; // Reset counter if new content appeared
        console.log(`New content loaded: +${currentItemCount - previousItemCount} items`);
      } else {
        noNewContentCount++;
        console.log(`No new content loaded (attempt ${noNewContentCount})`);
        
        // If no new content for 3 consecutive attempts, we've reached the end
        if (noNewContentCount >= 3) {
          console.log('No new content loaded after 3 attempts. Reached end of results.');
          break;
        }
      }
      
      previousItemCount = currentItemCount;
      
      // Scroll down to trigger loading more content
      await page.evaluate(() => {
        window.scrollTo(0, document.body.scrollHeight);
      });
      
      await delay(2000);
      
      // Also try scrolling by viewport height as backup
      await page.evaluate(() => {
        window.scrollBy(0, window.innerHeight);
      });
      
      await delay(1500);
      
      // Check if there's a "Load more" button and click it
      try {
        const loadMoreButton = await page.$('.searchLoadMore button, .loadMore button, button[data-testid="load-more"]');
        if (loadMoreButton) {
          console.log('Found "Load more" button, clicking...');
          await loadMoreButton.click();
          await delay(3000);
        }
      } catch (e) {
        // No load more button found, continue with scrolling
      }
      
      scrollAttempts++;
    }
    
    const finalItemCount = await page.evaluate(() =>
      document.querySelectorAll('.searchList__item').length
    );
    
    // Extract downloadable tracks
    const tracks = await page.evaluate(async () => {
      const delay = (ms) => new Promise(resolve => setTimeout(resolve, ms));
      const results = [];
      const searchItems = document.querySelectorAll('.searchList__item');
      
      const maxResults = searchItems.length; // Process all found items
      
      console.log(`Processing ${Math.min(maxResults, searchItems.length)} items for download availability...`);
      
      for (let i = 0; i < Math.min(maxResults, searchItems.length); i++) {
        const item = searchItems[i];
        
        if ((i + 1) % 10 === 0) {
          console.log(`Processed ${i + 1}/${Math.min(maxResults, searchItems.length)} items...`);
        }
        
        item.scrollIntoView({ block: 'center' });
        await delay(100);
        
        const artistEl = item.querySelector('.soundTitle__usernameText') || 
                         item.querySelector('.soundTitle .usernameText') ||
                         item.querySelector('.sc-text-body') ||
                         item.querySelector('a[href*="/"]') ||
                         item.querySelector('[class*="username"]');
        const artist = artistEl ? artistEl.textContent.trim() : 'Unknown Artist';
        
        let title = 'Unknown Title';
        let trackUrl = '';
        
        const titleSelectors = [
          '.soundTitle__title > .sc-link-dark',
          '.soundTitle__title span:not(.soundTitle__usernameText)',
          'h3 a.sc-link-dark',
          'h3 a',
          '.soundTitle__title',
          '[class*="title"] a',
          'a[href*="/"]',
          '.sc-link-dark'
        ];
        for (const selector of titleSelectors) {
          const titleEl = item.querySelector(selector);
          if (titleEl && titleEl.textContent.trim()) {
            title = titleEl.textContent.trim();
            break;
          }
        }
        
        const linkSelectors = [
          '.soundTitle__title a[href]',
          'h3 a[href]',
          'a.sc-link-dark[href]',
          'a[href*="/"]',
          '[class*="title"] a[href]',
          'a[href]'
        ];
        for (const selector of linkSelectors) {
          const linkEl = item.querySelector(selector);
          if (linkEl) {
            const href = linkEl.getAttribute('href') || '';
            if (href) {
              trackUrl = href.startsWith('http') ? href : `https://soundcloud.com${href}`;
              break;
            }
          }
        }
        
        let hasDownload = false;
        const moreSelectors = [
          'button[aria-label="More"]',
          '.soundActions .sc-button-more',
          '.sc-button-more',
          'button[title="More"]',
          'button[class*="more"]',
          '[data-testid="more-button"]',
          'button[aria-label*="more"]',
          '.moreButton'
        ];
        let moreBtn = null;
        for (const selector of moreSelectors) {
          moreBtn = item.querySelector(selector);
          if (moreBtn) break;
        }
        
        if (moreBtn) {
          try {
            moreBtn.scrollIntoView({ block: 'center' });
            await delay(80);
            moreBtn.click();
            
            let menuFound = false;
            for (let attempts = 0; attempts < 20; attempts++) {
              const menus = Array.from(document.querySelectorAll('.moreActions'));
              if (menus.some(m => m.querySelector('.sc-button-label'))) {
                menuFound = true;
                break;
              }
              await delay(100);
            }
            
            if (menuFound) {
              const menus = Array.from(document.querySelectorAll('.moreActions'));
              if (menus.length) {
                const btnRect = moreBtn.getBoundingClientRect();
                let bestMenu = null;
                let bestDist = Infinity;
                
                for (const menu of menus) {
                  const menuRect = menu.getBoundingClientRect();
                  const dx = Math.abs((menuRect.left + menuRect.width / 2) - (btnRect.left + btnRect.width / 2));
                  const dy = Math.abs((menuRect.top + menuRect.height / 2) - (btnRect.top + btnRect.height / 2));
                  const distance = Math.sqrt(dx * dx + dy * dy);
                  if (distance < bestDist) {
                    bestDist = distance;
                    bestMenu = menu;
                  }
                }
                
                if (bestMenu) {
                  const downloadBtn = bestMenu.querySelector('.sc-button-download, button[aria-label*="Download"]');
                  hasDownload = !!downloadBtn;
                }
              }
              
              document.dispatchEvent(new KeyboardEvent('keydown', { key: 'Escape' }));
              await delay(50);
            }
          } catch (e) {
            console.log(`Error checking download for item ${i + 1}:`, e.message);
          }
        }
        
        if (hasDownload && title !== 'Unknown Title' && trackUrl) {
          results.push({
            index: results.length + 1,
            artist,
            title,
            trackUrl,
            hasDownload: true
          });
        }
        
        await delay(150);
      }
      
      return results;
    });
    
    // Single summary with leading blank line
    console.log(`\nTotal items found: ${finalItemCount}`);
    console.log(`\nFound ${tracks.length} downloadable tracks:\n`);
    
    tracks.forEach((track) => {
      console.log(`${track.index}. ${track.artist} - "${track.title}"`);
      console.log(`   ${track.trackUrl}\n`);
    });
    
    return tracks;
    
  } catch (error) {
    console.error('Scraping failed:', error.message);
  } finally {
    if (browser) {
      await browser.close();
    }
  }
}

scrapeSearchResults();

const puppeteer = require('puppeteer');

const delay = (ms) => new Promise(resolve => setTimeout(resolve, ms));

async function scrapeSearchResults() {
  let browser;
  
  try {
    browser = await puppeteer.launch({ 
      headless: true,
      args: [
        // PROXY CONFIGURATION
        '--proxy-server=http://gate.decodo.com:7000',
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-dev-shm-usage',
        '--disable-gpu',
        '--mute-audio',
        '--window-size=1366,768'
      ]
    });
    
    const page = await browser.newPage();
    await page.setViewport({ width: 1366, height: 768 });
    await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Safari/537.36');
    
    // PROXY AUTHENTICATION - Replace with your proxy credentials
    await page.authenticate({
      username: 'YOUR_PROXY_USERNAME',
      password: 'YOUR_PROXY_PASSWORD'
    });
    
    // Block noisy resources
    await page.setRequestInterception(true);
    page.on('request', (req) => {
      const rt = req.resourceType();
      if (['image', 'media', 'font', 'manifest'].includes(rt)) {
        return req.abort();
      }
      return req.continue();
    });
    
    // Silence console noise
    page.on('console', () => {});
    page.on('pageerror', () => {});
    
    // TARGET URL - Replace with the Soundcloud search results page you want to scrape
    await page.goto('https://soundcloud.com/search/sounds?q=creative%20commons&filter.created_at=last_year&filter.license=to_use_commercially', {
      waitUntil: 'domcontentloaded',
      timeout: 45000
    });
    
    await delay(3000);
    
    // Handle cookie consent
    try {
      await page.waitForSelector('#onetrust-accept-btn-handler', { timeout: 5000 });
      await page.click('#onetrust-accept-btn-handler');
      await delay(500);
    } catch (e) {}
    
    // Handle login modal
    try {
      await page.waitForSelector('.modal__closeButton', { timeout: 5000 });
      await page.click('.modal__closeButton');
      await delay(300);
    } catch (e) {}
    
    // Wait for search results
    await page.waitForSelector('.searchList__item', { timeout: 15000 });
    
    // Infinite scrolling logic
    console.log('Starting infinite scroll to load more results...');
    
    const maxScrollAttempts = 5; // Configurable
    const targetResults = 500; // Stop when we have enough results
    
    let previousItemCount = 0;
    let scrollAttempts = 0;
    let noNewContentCount = 0;
    
    while (scrollAttempts < maxScrollAttempts) {
      // Get current item count
      const currentItemCount = await page.evaluate(() =>
        document.querySelectorAll('.searchList__item').length
      );
      
      console.log(`Scroll attempt ${scrollAttempts + 1}: Found ${currentItemCount} items`);
      
      // If we have enough results, break
      if (currentItemCount >= targetResults) {
        console.log(`Reached target of ${targetResults} results. Stopping scroll.`);
        break;
      }
      
      // Check if new content was loaded
      if (currentItemCount > previousItemCount) {
        noNewContentCount = 0; // Reset counter if new content appeared
        console.log(`New content loaded: +${currentItemCount - previousItemCount} items`);
      } else {
        noNewContentCount++;
        console.log(`No new content loaded (attempt ${noNewContentCount})`);
        
        // If no new content for 3 consecutive attempts, we've reached the end
        if (noNewContentCount >= 3) {
          console.log('No new content loaded after 3 attempts. Reached end of results.');
          break;
        }
      }
      
      previousItemCount = currentItemCount;
      
      // Scroll down to trigger loading more content
      await page.evaluate(() => {
        window.scrollTo(0, document.body.scrollHeight);
      });
      
      await delay(2000);
      
      // Also try scrolling by viewport height as backup
      await page.evaluate(() => {
        window.scrollBy(0, window.innerHeight);
      });
      
      await delay(1500);
      
      // Check if there's a "Load more" button and click it
      try {
        const loadMoreButton = await page.$('.searchLoadMore button, .loadMore button, button[data-testid="load-more"]');
        if (loadMoreButton) {
          console.log('Found "Load more" button, clicking...');
          await loadMoreButton.click();
          await delay(3000);
        }
      } catch (e) {
        // No load more button found, continue with scrolling
      }
      
      scrollAttempts++;
    }
    
    const finalItemCount = await page.evaluate(() =>
      document.querySelectorAll('.searchList__item').length
    );
    
    // Extract downloadable tracks
    const tracks = await page.evaluate(async () => {
      const delay = (ms) => new Promise(resolve => setTimeout(resolve, ms));
      const results = [];
      const searchItems = document.querySelectorAll('.searchList__item');
      
      const maxResults = searchItems.length; // Process all found items
      
      console.log(`Processing ${Math.min(maxResults, searchItems.length)} items for download availability...`);
      
      for (let i = 0; i < Math.min(maxResults, searchItems.length); i++) {
        const item = searchItems[i];
        
        if ((i + 1) % 10 === 0) {
          console.log(`Processed ${i + 1}/${Math.min(maxResults, searchItems.length)} items...`);
        }
        
        item.scrollIntoView({ block: 'center' });
        await delay(100);
        
        const artistEl = item.querySelector('.soundTitle__usernameText') || 
                         item.querySelector('.soundTitle .usernameText') ||
                         item.querySelector('.sc-text-body') ||
                         item.querySelector('a[href*="/"]') ||
                         item.querySelector('[class*="username"]');
        const artist = artistEl ? artistEl.textContent.trim() : 'Unknown Artist';
        
        let title = 'Unknown Title';
        let trackUrl = '';
        
        const titleSelectors = [
          '.soundTitle__title > .sc-link-dark',
          '.soundTitle__title span:not(.soundTitle__usernameText)',
          'h3 a.sc-link-dark',
          'h3 a',
          '.soundTitle__title',
          '[class*="title"] a',
          'a[href*="/"]',
          '.sc-link-dark'
        ];
        for (const selector of titleSelectors) {
          const titleEl = item.querySelector(selector);
          if (titleEl && titleEl.textContent.trim()) {
            title = titleEl.textContent.trim();
            break;
          }
        }
        
        const linkSelectors = [
          '.soundTitle__title a[href]',
          'h3 a[href]',
          'a.sc-link-dark[href]',
          'a[href*="/"]',
          '[class*="title"] a[href]',
          'a[href]'
        ];
        for (const selector of linkSelectors) {
          const linkEl = item.querySelector(selector);
          if (linkEl) {
            const href = linkEl.getAttribute('href') || '';
            if (href) {
              trackUrl = href.startsWith('http') ? href : `https://soundcloud.com${href}`;
              break;
            }
          }
        }
        
        let hasDownload = false;
        const moreSelectors = [
          'button[aria-label="More"]',
          '.soundActions .sc-button-more',
          '.sc-button-more',
          'button[title="More"]',
          'button[class*="more"]',
          '[data-testid="more-button"]',
          'button[aria-label*="more"]',
          '.moreButton'
        ];
        let moreBtn = null;
        for (const selector of moreSelectors) {
          moreBtn = item.querySelector(selector);
          if (moreBtn) break;
        }
        
        if (moreBtn) {
          try {
            moreBtn.scrollIntoView({ block: 'center' });
            await delay(80);
            moreBtn.click();
            
            let menuFound = false;
            for (let attempts = 0; attempts < 20; attempts++) {
              const menus = Array.from(document.querySelectorAll('.moreActions'));
              if (menus.some(m => m.querySelector('.sc-button-label'))) {
                menuFound = true;
                break;
              }
              await delay(100);
            }
            
            if (menuFound) {
              const menus = Array.from(document.querySelectorAll('.moreActions'));
              if (menus.length) {
                const btnRect = moreBtn.getBoundingClientRect();
                let bestMenu = null;
                let bestDist = Infinity;
                
                for (const menu of menus) {
                  const menuRect = menu.getBoundingClientRect();
                  const dx = Math.abs((menuRect.left + menuRect.width / 2) - (btnRect.left + btnRect.width / 2));
                  const dy = Math.abs((menuRect.top + menuRect.height / 2) - (btnRect.top + btnRect.height / 2));
                  const distance = Math.sqrt(dx * dx + dy * dy);
                  if (distance < bestDist) {
                    bestDist = distance;
                    bestMenu = menu;
                  }
                }
                
                if (bestMenu) {
                  const downloadBtn = bestMenu.querySelector('.sc-button-download, button[aria-label*="Download"]');
                  hasDownload = !!downloadBtn;
                }
              }
              
              document.dispatchEvent(new KeyboardEvent('keydown', { key: 'Escape' }));
              await delay(50);
            }
          } catch (e) {
            console.log(`Error checking download for item ${i + 1}:`, e.message);
          }
        }
        
        if (hasDownload && title !== 'Unknown Title' && trackUrl) {
          results.push({
            index: results.length + 1,
            artist,
            title,
            trackUrl,
            hasDownload: true
          });
        }
        
        await delay(150);
      }
      
      return results;
    });
    
    // Single summary with leading blank line
    console.log(`\nTotal items found: ${finalItemCount}`);
    console.log(`\nFound ${tracks.length} downloadable tracks:\n`);
    
    tracks.forEach((track) => {
      console.log(`${track.index}. ${track.artist} - "${track.title}"`);
      console.log(`   ${track.trackUrl}\n`);
    });
    
    return tracks;
    
  } catch (error) {
    console.error('Scraping failed:', error.message);
  } finally {
    if (browser) {
      await browser.close();
    }
  }
}

scrapeSearchResults();

When you run the search results script, it scrolls through the page automatically, loading more results as it goes. In the terminal, you’ll see updates like "Scroll attempt 1: Found 11 items" and "New content loaded: +10 items." This continues until it hits the configured limit or no new content appears.

After that, the script checks each track for download availability. This step may take a bit, depending on how many items were found.

Once it finishes, you’ll get a summary showing how many tracks were discovered and how many have download buttons. Since Creative Commons searches usually target freely available music, the success rate tends to be high. The final output is a clean, numbered list with artist names, track titles, and direct SoundCloud URLs.

Creating your training dataset

After scraping, manually download the audio files using SoundCloud's native Download file buttons. These clean tracks will serve as your "ground truth." Next, apply synthetic degradation techniques like:

Bitrate reduction (e.g., 320kbps → 64kbps)
Sample rate reduction (e.g., 44.1kHz → 22kHz)
Noise injection
Artifact simulation (e.g., compression glitches)

Pair each degraded version with its original, and label them by type and severity. This gives your model a wide range of examples to learn from.

By using a diverse set of Creative Commons tracks, your model will be exposed to different genres, vocal styles, and production levels, helping it generalize better to real-world audio issues.

3. Speech AI training

Speech AI models are built to understand, process, and generate human speech across different accents, languages, styles, and recording conditions. With voice interfaces now everywhere, from smartphones to customer service bots, there's growing demand for models that can handle natural, messy, real-world speech, not just clean, textbook samples.

Tools like Whisper (OpenAI) have set new standards for multilingual speech-to-text, while services like ElevenLabs offer voice cloning used by creators, streamers, and studios alike. Real-time applications such as Otter.ai, Rev, Alexa, and Google Assistant all rely on speech AI trained on diverse, representative voice data. The same goes for language learning apps, accessibility tools, and smart customer support systems.

What to scrape on SoundCloud for speech training

To train effective models, you’ll need high-quality, long-form spoken content featuring real people in real conversations. Some of the best sources on SoundCloud include:

Podcasts and educational accounts. Think universities, media outlets, think tanks, or institutions uploading lectures, interviews, and panel talks.
Interview formats. Interviews offer multiple speakers, natural conversation flow, and a range of tones and accents in one recording.
Language learning channels. These often feature accented English and multilingual content, which is useful for training models on varied speech patterns.
Audiobook or documentary creators. Their uploads usually provide clean, consistent solo voice recordings, ideal for voice modeling tasks.
Long-form uploads. Look for hour-long episodes or sessions. These give you natural pauses, rhythm changes, and unscripted speech patterns.

Scraping SoundCloud profile pages

Here’s a Node.js script for scraping tracks from a specific SoundCloud profile, limited to those with enabled downloads:

const puppeteer = require('puppeteer');

const delay = (ms) => new Promise(resolve => setTimeout(resolve, ms));

async function scrapeProfileTracks() {
  let browser;
  
  try {
    browser = await puppeteer.launch({ 
      headless: true,
      args: [
        // PROXY CONFIGURATION
        '--proxy-server=http://gate.decodo.com:7000',
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-dev-shm-usage',
        '--disable-gpu',
        '--mute-audio',
        '--window-size=1366,768'
      ]
    });
    
    const page = await browser.newPage();
    await page.setViewport({ width: 1366, height: 768 });
    await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Safari/537.36');
    
    // PROXY AUTHENTICATION - Replace with your proxy credentials
    await page.authenticate({
      username: 'YOUR_PROXY_USERNAME',
      password: 'YOUR_PROXY_PASSWORD'
    });
    
    // Block noisy resources
    await page.setRequestInterception(true);
    page.on('request', (req) => {
      const rt = req.resourceType();
      if (['image', 'media', 'font', 'manifest'].includes(rt)) {
        return req.abort();
      }
      return req.continue();
    });
    
    // Silence console noise
    page.on('console', () => {});
    page.on('pageerror', () => {});
    
    // TARGET URL - Replace with the Soundcloud profile's tracks page you want to scrape
    await page.goto('https://soundcloud.com/institute-of-ideas/tracks', {
      waitUntil: 'domcontentloaded',
      timeout: 45000
    });
    
    await delay(3000);
    
    // Handle cookie consent
    try {
      await page.waitForSelector('#onetrust-accept-btn-handler', { timeout: 5000 });
      await page.click('#onetrust-accept-btn-handler');
      await delay(500);
    } catch (e) {}
    
    // Handle login modal
    try {
      await page.waitForSelector('.modal__closeButton', { timeout: 5000 });
      await page.click('.modal__closeButton');
      await delay(300);
    } catch (e) {}
    
    // Wait for track list to load
    await page.waitForSelector('.soundList__item, .trackItem, [class*="sound"]', { timeout: 15000 });
    
    // INFINITE SCROLL CONFIGURATION - Adjust these values
    const maxTracksToScrape = 50; // Maximum number of tracks to process
    const maxScrollAttempts = 20; // Maximum scroll attempts to load content
    
    // Infinite scroll to load tracks
    console.log('Loading tracks through infinite scroll...');
    let previousCount = 0;
    let scrollAttempts = 0;
    
    while (scrollAttempts < maxScrollAttempts) {
      // Scroll to bottom
      await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
      await delay(1500);
      
      // Count current tracks
      const currentCount = await page.evaluate(() => {
        const selectors = [
          '.soundList__item',
          '.trackItem',
          'article[class*="sound"]',
          'li[class*="sound"]'
        ];
        
        for (const selector of selectors) {
          const elements = document.querySelectorAll(selector);
          if (elements.length > 0) return elements.length;
        }
        return 0;
      });
      
      console.log(`Loaded ${currentCount} tracks (scroll attempt ${scrollAttempts + 1})`);
      
      // Stop if we have enough tracks or no new tracks loaded
      if (currentCount >= maxTracksToScrape || currentCount === previousCount) {
        break;
      }
      
      previousCount = currentCount;
      scrollAttempts++;
    }
    
    const finalCount = await page.evaluate(() => {
      const selectors = [
        '.soundList__item',
        '.trackItem', 
        'article[class*="sound"]',
        'li[class*="sound"]'
      ];
      
      for (const selector of selectors) {
        const elements = document.querySelectorAll(selector);
        if (elements.length > 0) return elements.length;
      }
      return 0;
    });
    
    console.log(`Found ${finalCount} total tracks`);
    
    // Extract downloadable tracks
    const tracks = await page.evaluate(async (maxTracks) => {
      const delay = (ms) => new Promise(resolve => setTimeout(resolve, ms));
      const results = [];
      
      // Find track items using multiple selectors
      let trackItems = [];
      const selectors = [
        '.soundList__item',
        '.trackItem',
        'article[class*="sound"]',
        'li[class*="sound"]'
      ];
      
      for (const selector of selectors) {
        trackItems = document.querySelectorAll(selector);
        if (trackItems.length > 0) break;
      }
      
      const itemsToProcess = Math.min(maxTracks, trackItems.length);
      
      for (let i = 0; i < itemsToProcess; i++) {
        const item = trackItems[i];
        
        // Scroll item into view
        item.scrollIntoView({ block: 'center' });
        await delay(160);
        
        // Extract artist name
        let artist = 'Unknown Artist';
        const artistSelectors = [
          '.soundTitle__usernameText',
          '.trackItem__username', 
          '.soundTitle .usernameText',
          'a[href*="/"] .username'
        ];
        
        for (const selector of artistSelectors) {
          const artistEl = item.querySelector(selector);
          if (artistEl && artistEl.textContent.trim()) {
            artist = artistEl.textContent.trim();
            break;
          }
        }
        
        // Extract track title and URL
        let title = 'Unknown Title';
        let trackUrl = '';
        
        const titleSelectors = [
          '.soundTitle__title > .sc-link-dark',
          '.trackItem__trackTitle',
          '.soundTitle__title span:not(.soundTitle__usernameText)',
          'h3 a.sc-link-dark',
          'h3 a'
        ];
        
        for (const selector of titleSelectors) {
          const titleEl = item.querySelector(selector);
          if (titleEl && titleEl.textContent.trim()) {
            title = titleEl.textContent.trim();
            break;
          }
        }
        
        const linkSelectors = [
          '.soundTitle__title a[href]',
          '.trackItem__trackTitle[href]',
          'h3 a[href]',
          'a.sc-link-dark[href]'
        ];
        
        for (const selector of linkSelectors) {
          const linkEl = item.querySelector(selector);
          if (linkEl) {
            const href = linkEl.getAttribute('href') || '';
            if (href) {
              trackUrl = href.startsWith('http') ? href : `https://soundcloud.com${href}`;
              break;
            }
          }
        }
        
        // Check for download availability
        let hasDownload = false;
        
        const moreSelectors = [
          'button[aria-label="More"]',
          '.soundActions .sc-button-more',
          '.sc-button-more',
          'button[title="More"]'
        ];
        
        let moreBtn = null;
        for (const selector of moreSelectors) {
          moreBtn = item.querySelector(selector);
          if (moreBtn) break;
        }
        
        if (moreBtn) {
          try {
            // Click the more button
            moreBtn.scrollIntoView({ block: 'center' });
            await delay(80);
            moreBtn.click();
            
            // Wait for menu to appear
            let menuFound = false;
            for (let attempts = 0; attempts < 20; attempts++) {
              const menus = Array.from(document.querySelectorAll('.moreActions'));
              if (menus.some(m => m.querySelector('.sc-button-label'))) {
                menuFound = true;
                break;
              }
              await delay(100);
            }
            
            if (menuFound) {
              // Find nearest menu to clicked button
              const menus = Array.from(document.querySelectorAll('.moreActions'));
              if (menus.length) {
                const btnRect = moreBtn.getBoundingClientRect();
                let bestMenu = null;
                let bestDist = Infinity;
                
                for (const menu of menus) {
                  const menuRect = menu.getBoundingClientRect();
                  const dx = Math.abs((menuRect.left + menuRect.width / 2) - (btnRect.left + btnRect.width / 2));
                  const dy = Math.abs((menuRect.top + menuRect.height / 2) - (btnRect.top + btnRect.height / 2));
                  const distance = Math.sqrt(dx * dx + dy * dy);
                  
                  if (distance < bestDist) {
                    bestDist = distance;
                    bestMenu = menu;
                  }
                }
                
                if (bestMenu) {
                  const downloadBtn = bestMenu.querySelector('.sc-button-download, button[aria-label*="Download"]');
                  hasDownload = !!downloadBtn;
                }
              }
              
              // Close menu
              document.dispatchEvent(new KeyboardEvent('keydown', { key: 'Escape' }));
            }
          } catch (e) {
            // Silent error handling for individual items
          }
        }
        
        // Only include tracks with download buttons
        if (hasDownload && title !== 'Unknown Title' && trackUrl) {
          results.push({
            index: results.length + 1,
            artist,
            title,
            trackUrl
          });
        }
        
        await delay(200);
      }
      
      return results;
    }, maxTracksToScrape);
    
    console.log(`${tracks.length} downloadable tracks:\n`);
    
    tracks.forEach((track) => {
      console.log(`${track.index}. ${track.artist} - "${track.title}"`);
      console.log(`   ${track.trackUrl}\n`);
    });
    
    return tracks;
    
  } catch (error) {
    console.error('Scraping failed:', error.message);
  } finally {
    if (browser) {
      await browser.close();
    }
  }
}

scrapeProfileTracks();

const puppeteer = require('puppeteer');

const delay = (ms) => new Promise(resolve => setTimeout(resolve, ms));

async function scrapeProfileTracks() {
  let browser;
  
  try {
    browser = await puppeteer.launch({ 
      headless: true,
      args: [
        // PROXY CONFIGURATION
        '--proxy-server=http://gate.decodo.com:7000',
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-dev-shm-usage',
        '--disable-gpu',
        '--mute-audio',
        '--window-size=1366,768'
      ]
    });
    
    const page = await browser.newPage();
    await page.setViewport({ width: 1366, height: 768 });
    await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 Safari/537.36');
    
    // PROXY AUTHENTICATION - Replace with your proxy credentials
    await page.authenticate({
      username: 'YOUR_PROXY_USERNAME',
      password: 'YOUR_PROXY_PASSWORD'
    });
    
    // Block noisy resources
    await page.setRequestInterception(true);
    page.on('request', (req) => {
      const rt = req.resourceType();
      if (['image', 'media', 'font', 'manifest'].includes(rt)) {
        return req.abort();
      }
      return req.continue();
    });
    
    // Silence console noise
    page.on('console', () => {});
    page.on('pageerror', () => {});
    
    // TARGET URL - Replace with the Soundcloud profile's tracks page you want to scrape
    await page.goto('https://soundcloud.com/institute-of-ideas/tracks', {
      waitUntil: 'domcontentloaded',
      timeout: 45000
    });
    
    await delay(3000);
    
    // Handle cookie consent
    try {
      await page.waitForSelector('#onetrust-accept-btn-handler', { timeout: 5000 });
      await page.click('#onetrust-accept-btn-handler');
      await delay(500);
    } catch (e) {}
    
    // Handle login modal
    try {
      await page.waitForSelector('.modal__closeButton', { timeout: 5000 });
      await page.click('.modal__closeButton');
      await delay(300);
    } catch (e) {}
    
    // Wait for track list to load
    await page.waitForSelector('.soundList__item, .trackItem, [class*="sound"]', { timeout: 15000 });
    
    // INFINITE SCROLL CONFIGURATION - Adjust these values
    const maxTracksToScrape = 50; // Maximum number of tracks to process
    const maxScrollAttempts = 20; // Maximum scroll attempts to load content
    
    // Infinite scroll to load tracks
    console.log('Loading tracks through infinite scroll...');
    let previousCount = 0;
    let scrollAttempts = 0;
    
    while (scrollAttempts < maxScrollAttempts) {
      // Scroll to bottom
      await page.evaluate(() => window.scrollTo(0, document.body.scrollHeight));
      await delay(1500);
      
      // Count current tracks
      const currentCount = await page.evaluate(() => {
        const selectors = [
          '.soundList__item',
          '.trackItem',
          'article[class*="sound"]',
          'li[class*="sound"]'
        ];
        
        for (const selector of selectors) {
          const elements = document.querySelectorAll(selector);
          if (elements.length > 0) return elements.length;
        }
        return 0;
      });
      
      console.log(`Loaded ${currentCount} tracks (scroll attempt ${scrollAttempts + 1})`);
      
      // Stop if we have enough tracks or no new tracks loaded
      if (currentCount >= maxTracksToScrape || currentCount === previousCount) {
        break;
      }
      
      previousCount = currentCount;
      scrollAttempts++;
    }
    
    const finalCount = await page.evaluate(() => {
      const selectors = [
        '.soundList__item',
        '.trackItem', 
        'article[class*="sound"]',
        'li[class*="sound"]'
      ];
      
      for (const selector of selectors) {
        const elements = document.querySelectorAll(selector);
        if (elements.length > 0) return elements.length;
      }
      return 0;
    });
    
    console.log(`Found ${finalCount} total tracks`);
    
    // Extract downloadable tracks
    const tracks = await page.evaluate(async (maxTracks) => {
      const delay = (ms) => new Promise(resolve => setTimeout(resolve, ms));
      const results = [];
      
      // Find track items using multiple selectors
      let trackItems = [];
      const selectors = [
        '.soundList__item',
        '.trackItem',
        'article[class*="sound"]',
        'li[class*="sound"]'
      ];
      
      for (const selector of selectors) {
        trackItems = document.querySelectorAll(selector);
        if (trackItems.length > 0) break;
      }
      
      const itemsToProcess = Math.min(maxTracks, trackItems.length);
      
      for (let i = 0; i < itemsToProcess; i++) {
        const item = trackItems[i];
        
        // Scroll item into view
        item.scrollIntoView({ block: 'center' });
        await delay(160);
        
        // Extract artist name
        let artist = 'Unknown Artist';
        const artistSelectors = [
          '.soundTitle__usernameText',
          '.trackItem__username', 
          '.soundTitle .usernameText',
          'a[href*="/"] .username'
        ];
        
        for (const selector of artistSelectors) {
          const artistEl = item.querySelector(selector);
          if (artistEl && artistEl.textContent.trim()) {
            artist = artistEl.textContent.trim();
            break;
          }
        }
        
        // Extract track title and URL
        let title = 'Unknown Title';
        let trackUrl = '';
        
        const titleSelectors = [
          '.soundTitle__title > .sc-link-dark',
          '.trackItem__trackTitle',
          '.soundTitle__title span:not(.soundTitle__usernameText)',
          'h3 a.sc-link-dark',
          'h3 a'
        ];
        
        for (const selector of titleSelectors) {
          const titleEl = item.querySelector(selector);
          if (titleEl && titleEl.textContent.trim()) {
            title = titleEl.textContent.trim();
            break;
          }
        }
        
        const linkSelectors = [
          '.soundTitle__title a[href]',
          '.trackItem__trackTitle[href]',
          'h3 a[href]',
          'a.sc-link-dark[href]'
        ];
        
        for (const selector of linkSelectors) {
          const linkEl = item.querySelector(selector);
          if (linkEl) {
            const href = linkEl.getAttribute('href') || '';
            if (href) {
              trackUrl = href.startsWith('http') ? href : `https://soundcloud.com${href}`;
              break;
            }
          }
        }
        
        // Check for download availability
        let hasDownload = false;
        
        const moreSelectors = [
          'button[aria-label="More"]',
          '.soundActions .sc-button-more',
          '.sc-button-more',
          'button[title="More"]'
        ];
        
        let moreBtn = null;
        for (const selector of moreSelectors) {
          moreBtn = item.querySelector(selector);
          if (moreBtn) break;
        }
        
        if (moreBtn) {
          try {
            // Click the more button
            moreBtn.scrollIntoView({ block: 'center' });
            await delay(80);
            moreBtn.click();
            
            // Wait for menu to appear
            let menuFound = false;
            for (let attempts = 0; attempts < 20; attempts++) {
              const menus = Array.from(document.querySelectorAll('.moreActions'));
              if (menus.some(m => m.querySelector('.sc-button-label'))) {
                menuFound = true;
                break;
              }
              await delay(100);
            }
            
            if (menuFound) {
              // Find nearest menu to clicked button
              const menus = Array.from(document.querySelectorAll('.moreActions'));
              if (menus.length) {
                const btnRect = moreBtn.getBoundingClientRect();
                let bestMenu = null;
                let bestDist = Infinity;
                
                for (const menu of menus) {
                  const menuRect = menu.getBoundingClientRect();
                  const dx = Math.abs((menuRect.left + menuRect.width / 2) - (btnRect.left + btnRect.width / 2));
                  const dy = Math.abs((menuRect.top + menuRect.height / 2) - (btnRect.top + btnRect.height / 2));
                  const distance = Math.sqrt(dx * dx + dy * dy);
                  
                  if (distance < bestDist) {
                    bestDist = distance;
                    bestMenu = menu;
                  }
                }
                
                if (bestMenu) {
                  const downloadBtn = bestMenu.querySelector('.sc-button-download, button[aria-label*="Download"]');
                  hasDownload = !!downloadBtn;
                }
              }
              
              // Close menu
              document.dispatchEvent(new KeyboardEvent('keydown', { key: 'Escape' }));
            }
          } catch (e) {
            // Silent error handling for individual items
          }
        }
        
        // Only include tracks with download buttons
        if (hasDownload && title !== 'Unknown Title' && trackUrl) {
          results.push({
            index: results.length + 1,
            artist,
            title,
            trackUrl
          });
        }
        
        await delay(200);
      }
      
      return results;
    }, maxTracksToScrape);
    
    console.log(`${tracks.length} downloadable tracks:\n`);
    
    tracks.forEach((track) => {
      console.log(`${track.index}. ${track.artist} - "${track.title}"`);
      console.log(`   ${track.trackUrl}\n`);
    });
    
    return tracks;
    
  } catch (error) {
    console.error('Scraping failed:', error.message);
  } finally {
    if (browser) {
      await browser.close();
    }
  }
}

scrapeProfileTracks();

When you run the profile scraping script, the terminal will show the scrolling progress as it loads tracks from the page. You’ll see updates like "Loaded 20 tracks (scroll attempt 1)" as it continues until it hits your set limit or no new content appears.

Once finished, the script displays how many total tracks were found and how many are downloadable. The results appear as a numbered list with the creator name, episode or lecture title, and full SoundCloud file download URL.

Preparing the training dataset

After downloading the audio files, you can segment the speech by speaker using speaker diarization tools (or manually, for higher precision). Then:

Convert to a consistent format and sample rate
Generate transcripts using Whisper or similar tools
Tag by accent, gender, speaking style, or context (formal vs. conversational)

Podcasts are especially valuable because they contain natural speech: pauses, overlaps, informal language, and spontaneity. Educational content, on the other hand, provides clear articulation, making it useful for pronunciation learning or voice cloning models.

By combining both types, you’ll get a well-rounded dataset that helps your speech AI handle a wide range of human voices, just like it needs to in the real world.

Best practices for web scraping with Node.js

When scraping SoundCloud data, follow these established best practices to ensure more sustainable data collection:

Launch reliably. Set realistic HTTP headers, such as User-Agent, Accept-Language, viewport, and generous timeouts. Prefer one browser with multiple pages over many browsers.
Keep pages light. Intercept requests and abort images, media, fonts, beacons, and manifests to reduce noise and flakiness.
Navigate with retries. Wrap navigation in a short retry with backoff, wait for result cards instead of fixed delays, then scroll until enough items load.
Tame popups. If applicable, accept the cookie banner and close the login modal once per context before scraping.
Use resilient selectors. Combine a few candidate selectors for result cards and avoid brittle single-class matches.
Pace your actions. Add small random delays, limit concurrent navigations, and slow down when errors spike.
Recover gracefully. Catch per-item failures, return partial data, and fall back to the track page if the menu path fails.
Rotate signals when needed. Vary user agents and use quality proxies or sticky sessions if you start seeing many 403 responses.
Save progress. Append results as you go so restarts resume cleanly, and log only the essentials to guide tuning.

Final thoughts

SoundCloud offers a huge pool of diverse audio that can power great AI work. With the basics in place you can build datasets for music generation, audio enhancement, and voice or speech training. That is only a starting point, and the same approach can support many other creative ideas.

Results come from the setup you choose and the care you put in. Puppeteer handles dynamic pages, proxies keep sessions stable at scale, and smart targeting keeps the data relevant. Log your runs, watch for interface changes, and refine as you go. Most of all, remember that data quality often matters more than quantity. A curated dataset of high-quality, relevant audio samples will typically produce better AI model performance than a larger collection of random tracks.

Get high-quality residential IPs

Choose a residential proxy plan from Decodo and enjoy industry-leading performance and reliability.

Start free trial

About the author

Dominykas Niaura

Technical Copywriter

Dominykas brings a unique blend of philosophical insight and technical expertise to his writing. Starting his career as a film critic and music industry copywriter, he's now an expert in making complex proxy and web scraping concepts accessible to everyone.

Connect with Dominykas via LinkedIn

All information on Decodo Blog is provided on an as is basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Decodo Blog or any third-party websites that may belinked therein.

In this article

Industry-leading residential proxies

Access 115M+ residential IPs with fast response times and high success rates.

Start free trial

Frequently asked questions

What makes SoundCloud valuable for AI training?

SoundCloud's diverse content ecosystem provides access to a wide range of audio styles, production qualities, and genres that aren't available on traditional streaming platforms. This diversity is crucial for training robust AI models that can handle real-world audio variety.

How do I find tracks with download permissions?

On SoundCloud, search for Creative Commons licensed content, remix stems, or tracks explicitly marked as downloadable. Use search filters and specific keywords like "creative commons," "stems available," or "free download" to identify suitable content.

What audio quality should I target for AI training?

For most AI applications, you want the highest quality available. Look for tracks uploaded at 320kbps or higher when possible. For audio enhancement training specifically, you might intentionally collect lower-quality samples to use as training inputs.

How much data do I need for effective AI training?

The amount depends on your specific use case. Music generation models typically need thousands of tracks, while voice training might require hundreds of hours of diverse speech samples. Start with smaller datasets to validate your approach before scaling up.

Can I scrape private or premium content?

Focus only on SoundCloud's publicly available content. Accessing private tracks and premium content would require you to log into your account, but scraping non-publicly accessible content is not advised due to legal concerns. Private or premium content may also have additional usage restrictions that make them unsuitable for AI training datasets.

DATA COLLECTION

PYTHON

Scrape Discogs Marketplace with Python: A Step-By-Step Tutorial

Online marketplaces are beloved for offering everything from items we don’t need to treasures we didn’t know we were missing. For music lovers and collectors, Discogs stands out as the go-to destination – think of it as the IMDb of music records. Whether you're analyzing vinyl market trends, tracking the value of rare releases, or building a personal archive, Discogs is an unmatched resource. In this tutorial, you’ll learn how to scrape Discogs using Python, step by step, to unlock the data behind the music.

Dominykas Niaura

Jun 19, 2024

10 min read

UNBLOCK

How Promoters use Soundcloud Proxy to Get Unlimited Plays

Soundcloud is a perfect site for music enthusiasts and creatives. Too bad it, like so many social media sites, is dominated by the fortunate few. If you want to make your name known on Soundcloud, you have to use Soundcloud promotion agency, a tool like SCPlanner or create your own Soundcloud like bot with a rotating Soundcloud proxy service.

Thousands of artists reach an avid listener base through Soundcloud, but getting on their radar is not something that you can easily do in an oversaturated music market. If you’ve ever wondered how Soundcloud promotion works and what superstar Soundcloud promoters do, read on and get your mind blown, because you could be doing this right now.

James Keenan

May 28, 2021

6 min read

How to Scrape SoundCloud for AI Training: Step-By-Step Tutorial

What is SoundCloud?

Why scrape SoundCloud?

How to train an AI, ML, or LLM model?

What you need for scraping SoundCloud

Why you need proxies for scraping SoundCloud

How to run Node.js scripts

1. Music generation AI training

What to scrape on SoundCloud for music generation training

Scraping SoundCloud playlists

Next steps toward AI training

2. Audio enhancement AI training

What to scrape on SoundCloud for audio enhancement training

Scraping SoundCloud search results pages

Creating your training dataset

3. Speech AI training

What to scrape on SoundCloud for speech training

Scraping SoundCloud profile pages

Preparing the training dataset

Best practices for web scraping with Node.js

Final thoughts

Frequently asked questions

What makes SoundCloud valuable for AI training?

How do I find tracks with download permissions?

What audio quality should I target for AI training?

How much data do I need for effective AI training?

Can I scrape private or premium content?

Related articles