This site uses modern web standards that aren't supported by your browser. For best results, please upgrade to Google Chrome, Microsoft Edge, or Mozilla Firefox.


Web Archiving Technologies Explained

|

Since 1996, the non-profit Internet Archive and its popular “Wayback Machine” have invited millions of online visitors to browse the web as it once was. What most visitors to the Wayback Machine may not realize is that all the core technologies that power the world’s most famous web archive are free—and can be set up by anyone who wants to launch their own web archiving project.

All of the technologies and tools described here were developed in whole or in part by the non-profit Internet Archive. They are open source and freely available to use or modify by anyone who wants to download them. The expertise to implement that technology is often in short supply, however. That is why Archive-It™, the premium web archiving service operated by the Internet Archive, offers its partners web archiving services supported by the same web developers and archivists who helped develop these technologies.

archive_web_crawlWhen we talk about the Internet Archive “crawling” websites to look for content, we really are talking about Heritrix. Continuously developed by the Internet Archive and some of its partners since 2003, Heritrix is the archiving web crawler software at the front lines of the Internet Archive’s web archiving activities. Because it is open source and freely available, Heritrix also is used by various other public and private organizations around the world for their web archiving projects.

One of the most important features of Heritrix is its ability to adapt its collection routines in such a way as to avoid any disruption to that website’s normal service to its visitors. It will measure out its requests to the targeted website, ensuring that the website’s bandwidth or other resources are not impacted by the Heritrix collection processes.

However, those collection processes do not always catch everything on a web server. Dynamically rendered content, for example, can be missed by conventional crawling techniques. Consider a site like Facebook. When a Facebook user logs in and scrolls down their “wall” or feed, the Facebook website uses an algorithm to decide in real time what content to show that user as they continue scrolling. This is an example of dynamically rendered content. For websites that employ dynamic content rendering, Heritrix needs to be able to interact with their web servers appropriately.

If Heritrix is the brain of the crawling technology that the Internet Archive uses, then Umbra is its eyes. Umbra acts as an intermediary between Heritrix and the web servers. Its job is to allow Heritrix to interact with a website by mimicking natural user behavior, thus ensuring it “sees” content that is delivered by a web server only in response to a visitor’s activities. 

Working through Umbra, Heritrix is capable of collecting online material to create a practical, usable archival copy of any website it crawls. Collection is just the first step, though. What happens to a website once the collection has taken place? How is the actual archive created? We will explore the key stages of that process in a future post.



web_archiving_evolution_cta

Tags:
Categories: Uncategorized

Comments