Use Python to extract URLs from an XML Sitemap - Tested code!

Have you ever wanted to use python to extract a nice clean list of URLs from a website's sitemap? In this quick guide, I'll explain a quick and easy way to do this using Python and Google Colab.

If you just want to test out the code, you can make a copy of this Google Colab file and try it out for yourself.

Required libraries

For this task, we will use the following libraries:

Beautiful soup 4

Beautiful Soup 4 (BS4) is a Python library used for web scraping and parsing HTML or XML documents, which we use here to extract URLs from the robots.txt file and sitemaps. To learn more about Beautiful Soup 4, you can visit the official website here.

Python Requests

The Requests library is a powerful Python module used for making HTTP requests, and it serves a crucial role in this code by fetching the robots.txt file and sitemaps from the provided URLs. To learn more about the Requests library, you can visit the official documentation here.

Pandas

Pandas isn't strictly required for this task, as we could simply output all the URLs to a list. However, I expect you'll probably want to do further analysis on the URLs once you've extracted them, so having them in a Pandas Dataframe will be useful. To learn more about Pandas, you can visit the official documentation here.

Understanding the Code

Let's break down the code step by step:

1. Importing Required Libraries

import requests
from bs4 import BeautifulSoup
import pandas as pd

2. Extracting URLs from Sitemap

def extract_urls_from_sitemap(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'xml')
    urls = []

if soup.find('sitemapindex'):
    sitemaps = soup.find_all('sitemap')
    for sitemap in sitemaps:
        sitemap_url = sitemap.find('loc').text
        print(f'found sitemap index: {sitemap_url}. Adding URLs to list')
        urls.extend(extract_urls_from_sitemap(sitemap_url))
elif soup.find('urlset'):
    print(f'no nested indexes found in the main sitemap file. Adding URLs to list')
    locs = soup.find_all('loc')
    urls = [loc.text for loc in locs]

return urls

The extract_urls_from_sitemap function takes a URL of a sitemap as input and recursively extracts all the URLs present in the sitemap. It makes use of the requests.get function to retrieve the sitemap's content and the BeautifulSoup library to parse the XML structure. The function checks if the sitemap is a sitemap index (contains nested sitemaps) or a regular sitemap (contains URL entries). It then proceeds accordingly, extracting the URLs and appending them to the urls list.

3. Creating a DataFrame

def create_dataframe(urls):
    df = pd.DataFrame(urls, columns=['URL'])
    return df

The create_dataframe function takes a list of URLs as input and creates a Pandas DataFrame to store the URLs. Each URL is placed in a separate row under the 'URL' column.

4. Retrieving URLs from Sitemap

def get_urls_from_sitemap(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

for line in soup.text.split('\n'):
    if line.startswith('Sitemap:'):
        sitemap_url = line.split(': ')[1]

all_urls = extract_urls_from_sitemap(sitemap_url)

df = create_dataframe(all_urls)
return df

robots_txt_url = 'https://sheetsociety.com/robots.txt'
urls_df = get_urls_from_sitemap(robots_txt_url)

The get_urls_from_sitemap function takes the URL of the website's robots.txt file as input. It retrieves the file and uses Beautiful Soup to extract the URL of the sitemap. It then calls the extract_urls_from_sitemap function to obtain all the URLs from the sitemap. Finally, it creates a DataFrame using the create_dataframe function, storing the extracted URLs.

Executing the Code

To execute the code and extract URLs from a sitemap, follow these steps:

Open Google Colab or any Python environment that supports the required libraries. If you have not used the libraries before, you will need to install them first. You can do this by running !pip install [the name of the library that's missing]
Paste the code into the code cell.
Replace the robots_txt_url variable with the URL of the robots.txt file containing the desired sitemap URL
Run the code.

Conclusion

In this article, we have explored a Python code snippet that extracts URLs from website sitemaps, even in cases with nested indexes. By leveraging the power of Beautiful Soup, Requests, and Pandas, SEO specialists can easily retrieve URLs from sitemaps for further analysis, optimisation, and other SEO-related tasks. Google Colab provides a convenient environment to run this code and obtain the desired URL DataFrame.