How to convert any Sitemap to RSS Feed

Since you’re reading this tutorial on how to convert a Sitemap to RSS, it’s safe to assume that you already know about Sitemaps and RSS feeds. Here’s a quick primer:

A sitemap is an index of webpages of any website. It’s mostly there to help web-crawlers (think Google search bots etc.) crawl your website effectively and discover new (or old) content. Mostly for search engine ranking & discoverability.

An RSS feed is a standard XML file that contains a web-feed of any website. Why standard XML? So that any application can easily read it to display in any format it wants. Why web-feed? So that an application (or an end user) can keep track of updates to a website. E.g. keeping track of a news site using an RSS feed.

Why would you want to convert a Sitemap to RSS feed?

There could be many reasons. Maybe you’re developing a website from ground-up. Maybe you want to keep track of a competitor website.

But won’t that competitor website have their own RSS feed?

Mostly, yes. But since the world has moved on from using RSS readers (notable exceptions: Feedly users) to news aggregators, many website admins have started to disable their RSS feeds. This is to discourage mindless scraping of their original content and republishing it without permission.

So, let’s get started.

You would need the following:

  1. Any server with PHP enabled. I’m using an Ubuntu $5 server at DigitalOcean (you get $100 in credit over 60 days if you sign up using the referral link)
  2. MySQL (or a database of your choice)
  3. Scheduling mechanism. I’m using cron.

Step 1: Setting up the database

  1. Create a database named ‘test’
  2. Create a table named ‘article_counter’ with a counter column.
CREATE DATABASE test;

CREATE TABLE article_counter (article_id INT(1), last_known_url_count INT(6));

Step 2: Find the target sitemap

It’s critical to understand that every website may not follow a standard Sitemap structure. Hence the second step is to find the Sitemap and understand its structure. Tip: One way is to type in sitemap.xml at the end of the website name. E.g.

website-name.com/sitemap.xml

This will redirect you to the actual sitemap structure. It might redirect you to sitemap_index.xml with a bunch of child sitemaps. Now find out which sitemap you want to convert into RSS. It might contain month names, numbers, dates etc. We will have to convert those into variables in our code.

Step 3: Convert Sitemap to RSS pseudocode

You can take help from my pseudocode below:

run every 5 minutes {
    form sitemap link (as described in step 2 above)
    open the link
    count number of <loc> new_loc_count
    compare with last saved loc count (last_known_loc_count)
    if (new_loc_count > last_known_loc_count) {
        //find how many new loc
        (total_new_articles = last_known_loc_count-new_loc_count)
            for i from 1 to total_new_articles {
                extract the new loc text (which is actually the url)
                add to rss feed
            }
    }
    save new_loc_count as last_known_loc_count against month in DB
}
Here, we’re counting the number of <loc> nodes in the XML file. The <loc> nodes contain the URLs.

Step 4: Write the code and enjoy!

You’ve got the database structure and the logic. Now it’s time to get your hands dirty and get to coding. I did it in plain simple PHP, scheduled a cron to run it every 5 mins, plugged in the output RSS to IFTTT which  notifies me on Slack every time the website I want to track publishes a new article.

About Tonmoy Goswami

Founder, Storypick.
Read • Travel • Create • Experience⚡

Subscribe to Newsletter


Post navigation


Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Home Newsletter Latest