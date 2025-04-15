Scraping dynamic content

Many modern websites use JavaScript to load content dynamically, meaning traditional scraping methods may not get all the data. To handle this, you need a way to render the page like a real browser before extracting the content.

R offers support for various libraries, such as tidyverse, which can render dynamic content. Here's how you can implement it into your script:

1. Install tidyverse. Get the required library by running the following command in the RStudio console:

install . packages ( "tidyverse" )

2. Define the target website. In this scenario, use the "Quotes to Scrape" site – another sandbox for practicing web scraping.

base_url < - "https://quotes.toscrape.com"

3. Create a scraping function. Define a new function, scrape_page, that will hold the logic of how to scrape a page.

scrape_page < - function ( base_url ) {

4. Read the HTML contents. Get the HTML contents of the URL.

page < - read_html ( base_url )

5. Extract quotes and authors. Select and extract the quotes using CSS selectors. Like before, we find the required HTML nodes by inspecting the page via developer tools. The same process is repeated for quote authors.

quotes < - page % > % html_nodes ( ".quote .text" ) % > % html_text ( trim = TRUE ) authors < - page % > % html_nodes ( ".quote .author" ) % > % html_text ( trim = TRUE )

6. Extract tags. Repeat the process for quotes. However, since each quote has multiple tags, we use the gsub function to remove the prefix "Tags" and separate each data point with commas. As a result, we'll get all tags from the same quote in one table row.

tags_per_quote < - page % > % html_nodes ( ".quote .tags" ) % > % html_text ( trim = TRUE ) % > % gsub ( "Tags: " , "" , . )

7. Create a data frame. Create an empty data frame to store the scraped data. Finally, the function ends by returning the data frame as the result.

all_quotes < - data . frame ( Quote = quotes , Author = authors , Tags = tags_per_quote ) return ( all_quotes ) }

8. Implement pagination. Since many JavaScript pages have pagination, add code to go through multiple pages and extract the data from each. With the next_page object, we initialize the pagination process by starting with the first page. page_count sets a counter to keep track of the page number.

next_page < - "/" page_count < - 1

9. Initialize the data frame. It's similar to the one in step 7 but not contained within a function, and it also defines the data type.

all_quotes < - data . frame ( Quote = character ( ) , Author = character ( ) , Tags = character ( ) , stringsAsFactors = FALSE )

10. Set a loop. Loop through pages as long as there's a next page and the page count is less than or equal to 10. You can scrape more or fewer pages by adjusting the number.

while ( ! is . null ( next_page ) & & page_count <= 10 ) {

11. Combine the base URL with the relative path. To create a full URL, you'll need to combine the base URL with the path that defines which page should be shown. Currently, it will be the first page, as next_page has the "/" value. It’ll change with step 13.

current_url < - paste0 ( base_url , next_page ) message ( "Scraping page " , page_count , ": " , current_url )

12. Repeat the scraping function. Call the previously established scrape_page function to extract data from the current page. Then, append the newly scraped data to the main data frame.

page_data < - scrape_page ( current_url ) all_quotes < - bind_rows ( all_quotes , page_data )

13. Find the next page link. The next page link can be found by targeting the HTML node associated with the "Next" button and getting the href value. The page counter is also incremented, and the loop function ends here.

next_page < - read_html ( current_url ) % > % html_node ( ".pager .next a" ) % > % html_attr ( "href" ) page_count < - page_count + 1 }

14. Print and save the results. As the last step, print the results in the console and save them to a CSV file.

print ( all_quotes ) write . csv ( all_quotes , "quotes.csv" , row . names = FALSE )

The full script: