Browsertrix Crawler

  • jan 2022
  • Zefi Kavvadia
  • ·
  • Aangepast 27 jun
  • 42
Zefi Kavvadia
Particuliere Websites en SoMe
  • Alle leden mogen wijzigen

Initially known as Browsertrix, the Browsertrix Crawler is the latest and revamped version of what used to be a system of multiple browser-based crawlers that worked together to capture complex web content, such as social media. Browsertrix Crawler is built by the team behind the online web recording service Conifer and the desktop app ArchiveWeb.page (formerly known as Webrecorder and Webrecorder Player respectively) and uses the Chrome and Chromium browsers to interact with web pages and record the content in WARC files.

This tool is more modular than its predecessor, meaning that its different components e.g. the crawling browsers, the web archive viewer, the crawling behaviors and configurations, can now more easily be deployed separately and in various configurations. All of the components are bundled together in a single Docker container, meaning its more compact and likely easier to install and maintain. Docker, the containerization technology that allows for packages of software to be used and transferred between computer environments without requiring users to install all the necessary dependencies themselves, gives Browsertrix Crawler its flexibility as a capturing tool that is meant for more large- scale crawling. Compared to ArchiveWeb.page for example, which needs more explicit guidance by a user, Browsertrix Crawler requires in principle much less human intervention in order to perform crawls.

Specifying scope of the capture

Browsertrix Crawler is a command-line tool that is meant to be used via the terminal. That might mean some extra learning effort for those unfamiliar with such software, but it also means great flexibility in different capturing options. There are two ways in which configurations of crawls is done: either directly by including various options in the crawl-initiating command, or by using a YAML file to write out a series of parameters to be applied on the crawl or crawls. The benefit of using YAML files is of course that they can be saved, edited, and reused.

To control what the tool will capture, one or multiple URLs can be used as seeds from which the browser-based crawler will start the capture. Using seed lists, one can specify URLs to be crawled, and then choose if they want only the specified seeds to be captured, or also include all the URLs belonging to the same domain or sub-domain as the seeds (useful if, say, a company website has different language versions), or set any kind of custom scoping rules they like. Setting the scope of the crawl likes this allows the user to control the content and size of the files they will end up with. This can be done by using specifying prefixes to be included e.g. anything that starts with https://twitter.com/besttwitteraccount/photos, or specific regular expressions that determine URLs to be included or excluded based on patterns of words in them e.g. any URL that contains the characters "abc".

Additionally, it is also possible to include or exclude parts of a page in the crawl, i.e. page resources. Page resources are basically separate pieces of a page which also have their own dedicated URL, for example trackers, widgets that display a music playlists, ads, etc. Being able to block or include them is very useful especially in cases that the user of the tool wants to capture multiple pages that might include unwanted features, without having to exclude these pages as a whole.

Defining the extent of the content that can be said to belong to a website or a social media page might be tricky, because in order to preserve the context and provenance of the material, the archivist might feel they need to include everything that appears on the page: images, videos, links to other websites, embedded content, etc. Putting aside the technical issues this might cause, it also brings about the conceptual question of how much of this content “belongs” to the page in question, and how far we should go when scoping our crawls. The answer is not easy to determine and will probably depend on a lot of situational factors, but the tool giving us the ability to choose what to include and what to exclude from a social media crawl is very important. It must be noted, however, that a learning curve does exist when learning to apply scoping rules, especially for a person only now familiarizing themselves with these technologies (see Brozzler for more information on this issue).

Capturing password-protected content

Browsertrix Crawler is able to capture password-protected websites, thus it is suitable to crawl logged-in versions of social media pages. This can be done by setting specific browser profiles, that call up the browser that Browsertrix Crawler uses for crawling. The user provides the username and password to be saved in the profile, and then the profile can be used by the browser to log in and capture content as needed. Different browser profiles can be made to store different sets of credentials for the same social media platform, or for different social media platforms and websites.

Importantly, the WARC files that will result from this crawl do not retain the credentials used to log into the platform, as the logged in profile only stores cookies.

Using multiple browser profiles

The profiles can be saved and reused, and together with the behaviors functionality that the tool offers, they can be a very flexible capturing tool for social media. Behaviors are basically pre-determined sequences of action that Browsertrix Crawler takes in order to efficiently interact with specific websites e.g. auto-scroll a social media feed to load all of its posts, click on all the videos on a page to play and then capture them, etc. There are already pre-made behaviors meant for different social media platforms or types of website, but users can also create and save their own behaviors.

Viewing captures

Another advantage of Browsertrix Crawler is that it allows the user to view the capture in real time as it is happening via screencasting, as well as take screenshots of these captures for reviewing later. After a crawl is completed, the pywb module of Browsertrix crawler can be used to view the captured content.

Trefwoorden