Have you ever wanted to use python to extract a nice clean list of URLs from a website's sitemap? In this quick guide, I'll explain a quick and easy way to do this using Python and Google Colab.
If you just want to test out the code, you can make a copy of this Google Colab file and try it out for yourself.
For this task, we will use the following libraries:
Beautiful Soup 4 (BS4) is a Python library used for web scraping and parsing HTML or XML documents, which we use here to extract URLs from the robots.txt file and sitemaps. To learn more about Beautiful Soup 4, you can visit the official website here.
The Requests library is a powerful Python module used for making HTTP requests, and it serves a crucial role in this code by fetching the robots.txt file and sitemaps from the provided URLs. To learn more about the Requests library, you can visit the official documentation here.
Pandas isn't strictly required for this task, as we could simply output all the URLs to a list. However, I expect you'll probably want to do further analysis on the URLs once you've extracted them, so having them in a Pandas Dataframe will be useful. To learn more about Pandas, you can visit the official documentation here.
Let's break down the code step by step:
import requests
from bs4 import BeautifulSoup
import pandas as pd
def extract_urls_from_sitemap(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'xml')
urls = []
if soup.find('sitemapindex'):
sitemaps = soup.find_all('sitemap')
for sitemap in sitemaps:
sitemap_url = sitemap.find('loc').text
print(f'found sitemap index: {sitemap_url}. Adding URLs to list')
urls.extend(extract_urls_from_sitemap(sitemap_url))
elif soup.find('urlset'):
print(f'no nested indexes found in the main sitemap file. Adding URLs to list')
locs = soup.find_all('loc')
urls = [loc.text for loc in locs]
return urls
The extract_urls_from_sitemap
function takes a URL of a sitemap as input and recursively extracts all the URLs present in the sitemap. It makes use of the requests.get
function to retrieve the sitemap's content and the BeautifulSoup
library to parse the XML structure. The function checks if the sitemap is a sitemap index (contains nested sitemaps) or a regular sitemap (contains URL entries). It then proceeds accordingly, extracting the URLs and appending them to the urls
list.
def create_dataframe(urls):
df = pd.DataFrame(urls, columns=['URL'])
return df
The create_dataframe
function takes a list of URLs as input and creates a Pandas DataFrame to store the URLs. Each URL is placed in a separate row under the 'URL' column.
def get_urls_from_sitemap(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for line in soup.text.split('\n'):
if line.startswith('Sitemap:'):
sitemap_url = line.split(': ')[1]
all_urls = extract_urls_from_sitemap(sitemap_url)
df = create_dataframe(all_urls)
return df
robots_txt_url = 'https://sheetsociety.com/robots.txt'
urls_df = get_urls_from_sitemap(robots_txt_url)
The get_urls_from_sitemap
function takes the URL of the website's robots.txt file as input. It retrieves the file and uses Beautiful Soup to extract the URL of the sitemap. It then calls the extract_urls_from_sitemap
function to obtain all the URLs from the sitemap. Finally, it creates a DataFrame using the create_dataframe
function, storing the extracted URLs.
To execute the code and extract URLs from a sitemap, follow these steps:
!pip install [the name of the library that's missing]
robots_txt_url
variable with the URL of the robots.txt file containing the desired sitemap URLIn this article, we have explored a Python code snippet that extracts URLs from website sitemaps, even in cases with nested indexes. By leveraging the power of Beautiful Soup, Requests, and Pandas, SEO specialists can easily retrieve URLs from sitemaps for further analysis, optimisation, and other SEO-related tasks. Google Colab provides a convenient environment to run this code and obtain the desired URL DataFrame.