Broken Link Checking in the Modern Age

Submitted by Kevin on

We all know and dread that inevitable website management task: going through the site and checking for broken links. It's a two-part process: identifying the ones that are broken, then actually fixing them. As a technical developer, I've mainly been focused on identifying broken links, but I do fix them sometimes, when it's obvious enough what needs to be done.  Our campus has had a lot of sites rebuilt going from Drupal 7 to Drupal 9, so it's often just finding the new URL for the old page and making the substitution.

Over the years, I've done what I'm sure many of you have done: gone through a site page by page, checking each link by hand. It's a tedious process, but there's never been a lot of good tools available to streamline the process. Fortunately, I'm pretty good with repetitive tasks and know how to find little shortcuts to speed them up as much as possible, but I'd still prefer to just avoid repetitive tasks altogether.

There are a few modules available for Drupal that promise to help you find broken links, but I'm always leery of anything that works in the background where I can't easily tell what it's actually doing. I like having some level of control over the process when I can, plus a large number of our sites are on shared campus web hosting, and I really don't want to add more stress to the shared hosting servers by asking them to regularly go out testing hundreds of links on each of our sites.

A few years ago I started building my own web crawler that lives on its own virtual machine, mainly as a way to catalog the pages on a site.  Over time I adapted it to test links it found and catalog the HTTP result codes. This may sound like an easy task, but trust me, it's not. When crawling a CMS based site (Drupal, WordPress, etc.) there's a lot of gotchas to watch for, such as the curse of infinite recursion. I've specially geared my crawler for both Drupal and WordPress with a blacklist of paths not to follow, and there's a separate blacklist of external URLs to avoid due to oddities that can produces false positive results. All of this I developed through extensive trial and error.

After a while, I got my crawler to a point where it was helpful, but the output was still difficult to parse, so the project fell to the wayside. Every now and then I'd run it against a site and try to fix the obvious issues, but it was far from an ideal solution.

Recently, our college web content manager expressed interest in streamlining this process. I learned that she had started taking time each month to try to go through our major sites by hand, but that is of course very tedious, and it's far too easy to miss links or make mistakes. Encouraged that someone else felt the issue needed to be tackled (which is not to say that previous content managers didn't care - I think they simply felt they had no time to deal with broken links), I returned to my web crawler project and started tinkering with it some more.

Taking feedback from our content manager, I reworked the output of the crawler into a "broken links by page" report, which is much easier to use. I used statistics to extract out the links that were likely in the page header/menu/footer areas (as opposed to the unique page content) and listed them together in their own section of the report.  I then started tackling the stickiest issue of making it so that the content manager could easily access this report.

To bring the report into Drupal, I wrote a Drupal module that adds an administrative report page that reads the report file and shows the report as an expandable tree, one branch for each page that has at least one broken link, with each branch having shortcut links to view or edit the page. The content manager still has to search through the page content to find the location of each link, but given the complexities of Drupal (particularly with Layout Builder pages), I don't think there is any better solution that doesn't involve putting the link checking load onto the website's own web server.

The last step was to automate running the checks and pushing the reports to a place where the websites could read them. This is all done via cron jobs on the web crawler's server along with using SCP and SSH keys to push the report files up to the filesystem of each website.  At the moment, there is no option to manually request an updated report, but I can trigger one from the back end if needed, and the system maintains a one month cache of good links. That way, if I do re-scan a site, it won't bother testing all of the links that scanned clean the last time, but will only re-scan the links that had been found to be broken. This speeds up re-scans significantly.

Now our content manager can go to a site on her standard monthly schedule and a fresh report will have been run a day or two earlier. She simply works through the report, using the quick links to jump to each page that needs updating and make the necessary updates.

There are still issues that I'd like to solve, but they seem to be hard-to-solve problems:

  • A lot of information heavy sites (think journal / magazine / newspaper type sites) use high end filters to try to detect web crawlers and block them. It's understandable that they would not want crawlers screen scraping all of their content, but it's frustrating when you just want to make sure a link is still valid.  There are guides out there for making your crawler look more like a real browser by sending specific HTTP headers, but I'm still not able to get that to work with some of the journal sites. As best I can figure, they don't consider the entity to be a normal web browser if it doesn't support compression, but adding compression support to a hand written crawler is more work than I've wanted to do so far.
     
  • Some links are not broken links, but bad links, like a mailto: URL entered without the "mailto:" prefix, or even a regular site URL entered without the "http:" or "https:" prefix. The challenge is that programmatically, any URL without a prefix is technically a local path to be accessed under the same port and protocol as the current page. So, it's really difficult to tell programmatically if something is a genuine local URL or a bad external URL. I may eventually add some regular expression checking for URLs that look like complete email addresses and/or start with something that looks like a valid hostname, but there's never going to be a perfect way to check for all cases of missing prefixes.
     
  • It would be ideal to identify where links are located on a page, as I've seen some accessibility scanners do, but that requires a huge amount of additional coding, and there's still too much possibility that page oddities would break that link highlighting code. In particular, you have to think about URLs that are in non-anchor tags, like image (IMG) tags, IFRAME tags, VIDEO tags, etc, not to mention embedded JavaScript.
     
  • There's a whole other realm of issues regarding how WYSIWYG editors manage HTML content. I've seen cases where what looks like a single link has been broken by the WYSIWYG editor into two directly adjacent links, likely due to how someone tried to edit the visible text, perhaps pasting some extra text into the middle of a link. On the screen this looks like a single link, so a content editor can be fooled into putting their mouse cursor over one end and seeing the link look correct, when in reality the other end is pointing to some other location. This issue can really make it hard to track down a broken link without switching to HTML source mode, which someone like myself can read fluently, but a lot of content managers these days would not understand at all.

In spite of these issues, I think my system is a big step forward, and in theory it could be expanded to handle our WordPress sites. The site crawler already works properly for them - it's just a matter of writing a Plugin to display the report within the site's admin interface to make the report easy to access.