This site uses modern web standards that aren't supported by your browser. For best results, please upgrade to Google Chrome, Microsoft Edge, or Mozilla Firefox.

Website architecture is key to creating better web archives


web-architecture-archiveOptimizing a website for thorough and efficient web archiving starts with the technological foundation of the site itself. It also is impacted by the directions the site gives—or does not give—to the web crawlers themselves.

One more aspect of website architecture also goes a long way toward creating a comprehensive and easy-to-browse web archive, and that is the way the content itself is presented. Creating URLs that are easy to be read by machines and humans, and posting content to its own unique web address are best practices to follow in order to optimize a website for both search engine and web archiving crawlers.

Give URLs meaning

Consider these two fictional URLs:


The first URL tells us essentially nothing. But even without any content, it is possible to discern something about the nature of the content that might have been associated with the second URL. For web archivists trying to restore the integrity of old, incomplete websites, the second URL is an important piece to a puzzle.

Indeed, an incomplete puzzle is what many websites become—to one degree or another—with the passage of time. One reason the advice to maintain stable URLs has survived since the earliest days of the web is that it is easier said than done. The evolution of the medium and the natural pace of content rotation are constantly tugging at websites’ link hierarchies. Sections are renamed and pieces of content are recategorized. URL redirection is effective at mitigating the effects of these changes, but staying on top of URL redirection requires strong organization and consistent support from all the personnel who manage a site. Staffs turn over; people leave organizations or move on to new responsibilities.

Inevitably, the organization of any large and sophisticated website will incrementally degrade over time; some links and/or the content attached to them are lost or orphaned, leaving only a dead link to an obsolete URL. Identifying, recovering, and contextualizing this material is part of the web archivist’s job. Website administrators can help future archivists maintain their archived site’s integrity by ensuring that URLs convey some degree of information about their associated resources. Ideally, a URL should include at least a truncated version of the content’s name and, if possible, a publication date.

Give separate pieces of content separate addresses

Some content management systems and website infrastructure models “wrap” all the content on a website so that it all appears under the root URL, e.g., http://website, or perhaps under the root URL and a small collection of subdirectory URLs, e.g., http://website/red, http://website/blue, and so forth. Within that structure, the website dynamically presents all its content according to a script or in response to user input.

While these methods sometimes can contribute to a unique or enhanced experience for website visitors, they make it much more difficult for either an archiving crawler or a conventional search engine crawler to figure out when it has finished indexing a site. When the resulting archived version of the site is created, it may contain significant errors in its appearance or omit sections of content entirely.

If design decisions require a content element’s true URL to be hidden from a visitor’s view, using the rel=canonical element to place the true URL in the content’s document header is a workaround that will allow the archiving crawler to index the site accurately.


Categories: Uncategorized