Challenges in archiving social media
Understanding what could possibly be considered social media content on the one hand, and also "archiv...
Alle leden mogen wijzigen
For those looking for large-scale harvesting solutions, Brozzler, like Browsertrix, is an interesting choice. Brozzler was developed and is still being maintained by the Internet Archive, and it is already used by organizations such as the Portuguese Web Archive.
It is a browser-based crawler which uses Chrome or Chromium to access web content and harvest it in a WARC file. Brozzler is one of the newer-generation capturing tools which leverages browser technologies to interact with pages and overcome the difficulties that dynamic content poses for traditional crawlers. It is enhanced with youtube-dl, a video download tool that is able to extract media from crawled content. To display captured content, Brozzler makes use of a custom version of the pywb web archive replay tool.
For Brozzler to work, a database must be deployed which is used to store and manage the crawl data that make harvest configuration and replay possible. Brozzler uses RethinkDB for this purpose. The option to use another database exists, but would probably require some tinkering in order to find out exactly how it interacts with RethinkDB and replicate that with an alternative one.
Brozzler also comes with a simple GUI that offers an overview of running crawl jobs, and can be useful to monitor concurrent captures. The actual crawl configuration though is done via YAML, a data serialization language that is often used to write configuration files. There is a number of examples available on the Brozzler GitHub repository, but through testing on various social media platforms, it is clear that the YAML files need to include more specific scoping rules than just a seed list to successfully harvest the content. For example, the specification shown below could be used for a simple website capture:
This fictional website is supposedly simple enough to warrant only a time-limit configuration and a request to ignore the robots.txt policy if it attempts to block the crawling.
However, a lot of social media content is unfortunately not that simple to capture. For example, capturing Facebook with Brozzler may keep resulting in WARCs that only contain the rudimentary user interface of the page, but no actual content. The solution could be to employ some scoping rules, e.g., to exclude the domain “facebook.com” from being crawled, and to block URLs which contain specific strings of characters. These settings are based on the instructions provided by the Archive-It team on their Help Center, targeted at users of the paid Archive-It service they provide. Archive-It actually leverages Brozzler to archive dynamic websites such as social media, thus the tips mentioned on the Help Center can come in handy when using the standalone version of Brozzler. Please note, however, that the instructions given are meant for the Archive-It service users who are given access to a GUI to configure the crawler – in order to configure the free and open-source Brozzler, the instructions must be written within a YAML file and implemented via the command line.
One of the most notable missing features in Brozzler, as in Browsertrix, is the native capability to schedule crawls for the future, either one-time or recurring. While such a practice could create a significant amount of harvested data requiring (temporary) storage and possibly appraisal, and brings with it the risk of redundant content being captured, it could greatly benefit organizations that would like to automate their social media archiving workflows. However, even if it is not available natively, it could be achieved by remotely controlling the browser e.g., with something like Puppeteer and scheduling jobs to run through it.