The ReRoto Transfer System– Why I’m Staying Up Until 3am On A Wednesday.

William McGonagle
6 min readFeb 22, 2024

Warning: Unlike my other posts, this will be very technical. So please, put your thinking caps on for the next 6.22 minutes.

To keep things simple… I am building a tool for Georgetown Disruptive Tech right now called ReRoto. The tool allows newspapers (mostly colegate level) to build and manage all of their operations. But, to get this marketed to these papers, we need a simple way for them to transfer all of their content over. So, without further adieu– I introduce to you the carrier agnostic transfer system (or cats for short).

What does the system do?

Well, that’s simple! You just input your current website domain and click transfer– then, all of your content is available on ReRoto. Before we can do that, we must understand who we would transfer content from…

Understanding the Competition

The lovely Chief Sales Officer at GDT, Felix Dosmond, made an excellent writeup about our current competition.

State News Works

Most definitely our closest competition, State News Works (or SNWorks for short) is developed and run by students at the Michigan State University. It is an excellent piece of software, and offers great features for its price tag when compared to competition. I love the work that the Michigan students have done, but SNWorks is our greatest competition in the collegiate ‘News Management System’ space. That said, I hope that our competitive relationship does more to advance both teams– like GDT and Hoyadevs.

SNWorks does not offer any sort of API (or way that my servers can talk to their servers for content). Fortunately, search engines require large sites to include sitemaps– which tells the computer where their articles are. So, I can use these sitemaps to get a list of the articles and then scan through each of them.

<urlset xmlns:news="http://www.google.com/schemas/sitemap-news/0.9" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://www.dukechronicle.com/article/2022/05/editors-note-2022-student-commencement-speaker-time-on-the-chronicle</loc>
<lastmod>2022-06-21</lastmod>
</url>
<url>
<loc>https://www.dukechronicle.com/article/2021/06/duke-university-chronicle-no-print-paper-upcoming-year-digital-news</loc>
<lastmod>2021-06-02</lastmod>
</url>
<url>
<loc>https://www.dukechronicle.com/article/2020/08/duke-university-editors-note-printing-one-day-a-week</loc>
<lastmod>2020-08-04</lastmod>
</url>
<url>
<loc>https://www.dukechronicle.com/article/2020/06/chronicle-leadership-we-stand-with-black-students</loc>
<lastmod>2020-06-08</lastmod>
</url>
</urlset>

I can then go through each of these pages and grab the article content from them, as they use Semantic HTML (great for accessibility). I am not sure if all of the websites use Semantic HTML, or if it is just the case on the pages I have seen, but it is an assumption I will make for now.

SNWorks doesn’t store information about the authors like their emails, biographies, or photos– so we will leave it empty for now. Though, they do include meta tags for the title, logos, and description, so we can grab those from the home page.

Figuring out how to replicate the layout of the page is a big hurdle to get over but we can figure that out later. We can steal the color scheme data for now though– hehe.

Also– I noticed that SNWorks offers readers the ability to listen to their articles which is awesome for accessibility. Don’t worry, we’ll add that soon because it is such a nice feature, and I love SNWorks for things like that.

Squarespace

Not many people know this about Squarespace, but it is used to host many collegiate newspapers– like Georgetown’s Caravel. The nice thing about Squarespace is that it offers a public API for website content… meaning we can just ask Squarespace nicely for its articles, and it gives them in a standardized way. Yay!

This API has data that includes article content, authors, site metadata, etc– pretty much everything needed to transfer a site over. The only thing that it is missing is the blog layout, but in depth layout customization doesn’t really exist for Squarespace.

Wordpress

I thought Wordpress would be the most difficult– it was not. I did a good amount of research into WXR, which is a poorly-standardized system based on RSS to enable the transmission of articles. I then followed Occam’s Razor and looked up ‘Wordpress Article API’… it’s shocking but Wordpress also offers a similar API to Squarespace.

So… just like Squarespace, I can grab article content and meta data. But, as I was testing this on The Hoya (which is a whole lesson on clients failing to communicate), I noticed that the author and template data is missing. So, I will have to account for that.

What is the plan?

So we know where we can source the data, but most of it needs to be cleaned and even then, there are massive holes in it. To account for this, we’re going to do my favorite thing– abstract pipelining!

Note: I felt so cool using Excalidraw– you can thank theprimeagen for inspiring the program architecture below.

Domain: Just enter the website’s domain– that is the only data that our ‘program’ needs to run.

Quick Scan: We need to figure out what software the site is using– this is as simple as checking the HTTP Server Headers or ‘SNWorks’ logo in the homepage HTML. This is technically part of the Abstract Scanner Class so that we can expand it out easier.

Abstract Scanner: We generalize two functions for this class: ‘scan’ and ‘check’. The check function takes in an HTTP response and returns a boolean on whether or not the page matches. The scan function takes in the domain and returns the “unpatched” data.

Patch Pipeline: To make sure that the data has no holes in it, we have to go through it with several rounds of editing. This editing is interactive with the user that is performing the transfer– they get given an interface where they can make changes. Once complete, this pipeline outputs the final data.

Query Generator: “The data isn’t final yet!” says the query generator. Right now, it is stored as JSON, but we need to get it into the database. The lazy part of me wants to loop through each article and upload it individually– the issue with that is ensuring ACID (atomicity, consistency, isolation, and durability).

In layman’s terms: if one of the articles isn’t transferred properly, then the whole transfer should be restarted. We can only ensure this by generating a single SQL query that includes all of the transfer content. Then, if an error occurs because of a database issue, we can just re-run the query.

A Tricky Situation…

It is no secret that ReRoto is built on Next.js and Vercel. I’m not ashamed because it has been so easy to quickly add new features while keeping the codebase small and simple. That said, there are drawbacks…

  • The vendor lock-in and costly pricing are annoying, but not the end of the world. We will likely switch the news frontend side to AWS soon anyways to speed things up, which will also reduce costs.
  • My issue lies in the restrictions– you can’t run a serverless function for more than 300s (5 mins) on Vercel. That might sound like a while, but a site transfer takes much, much longer. You can run an edge function indefinitely, but an edge function cannot be larger than 4MB.

To get around this, we can split the program up into steps– do the ‘check’ as a different request as the ‘scan’, and do the patching stage on the client side.

The biggest change will be having the scan request run as an edge function because then the request will not time out. This does have the disadvantage of forcing all of the scan code to be less than 4MB.

I’m going to spend the next few hours implementing this for Squarespace and Wordpress. And tomorrow, I will finish up SNWorks because it’ll take far more time and I do need to sleep eventually.

Once that’s done, I’ll write a little post-mortem to see how the new system actually performs. Until then, stay classy!

Cheers,
WMG

--

--

William McGonagle

Bridgeporter. Georgetown Student. Bass Player. Programmer. Brother. Celtics Fan.