Crocoite
Note: According to its GitHub page, this tool is not in active development anymore at the time of this...
Alle leden mogen wijzigen
The WARC format is widely accepted enough to be considered one of the default formats for storing captured content from the web. It followed its predecessor, the ARC, as the main file format in use by the Internet Archive, and is maintained by the International Internet Preservation Consortium (IIPC).
The rationale behind the WARC format is that one file format for web archiving should preferably be able to hold not only the archived resources themselves, but also metadata about the resources and the capture. The WARC is thus an aggregator format that combines all the segments of a crawled website plus the HTTP requests and responses performed during the crawl together (The WARC Format 1.1). For this reason, WARCs are mainly used by crawler-based tools that capture social media content as it appears on the browser.
Even though the WARC is popular, there can be discrepancies between WARCs created by different software and systems, e.g., if there are inconsistencies between the application of the format standard across tools. This is a reason why WARC file format validation could be important for a social media preservation workflow (Veenendaal 2020).
Nevertheless, one of the most important things to note about WARCs is that the kind of access that they enable is meant to reproduce the experience of browsing the original website. The assumption is that the collection user will navigate from page to page and website to website, consuming the content similarly to how the users of the page’s live version would. This assumption is reflected in the interfaces of WARC replay tools, most characteristic of which is the Wayback Machine of the Internet Archive: a search bar for URLs, and then various versions of the desired URL arranged chronologically. While useful for those interested to go through a relatively small number of resources, or to perform a close reading on the content, such access interfaces do not make it easy to discover and manage the vast volumes of data that is often contained in web and social media archival collections.
This is why the WARC itself, and access methods based on browsing single websites, have in the last few years, and especially when it comes to social media archiving, been complemented by alternative methods of preserving content based on leveraging access to social media APIs to pull textual, structured data.