Crocoite

jan 2022
Zefi Kavvadia
·
Aangepast jun 2024
342

Alle leden mogen wijzigen

Note: According to its GitHub page, this tool is not in active development anymore at the time of this writing (January 2022). However, it is still available for download and it still functions as expected. In a way, crocoite is a good example of a tool arising from the open-source community that could prove problematic to use in a professional setting because of lack of ongoing support.

As browser-based crawling seems to become central in the practice of archiving the dynamic web including social media, there is a concurrent increase in interest in using headless browsers to crawl and capture online content. Headless browsers are in essence browsers stripped of their GUI - they are able to perform all other functions of a regular browser, but they do so in the background without displaying them to the user. Headless browsers are often used when testing a page to make sure all the interactions run smoothly without using up a lot of system resources (which GUIs often do). Their flexibility and speed have made them attractive to the web and social media archiving community, and more and more tools are experimenting with them e.g., Brozzler.

Headless browser for capturing

Crocoite is one such tool that, unlike others in this list, uses exclusively a headless browser for all its operations. Using Chrome in headless mode, crocoite is able to fetch JavaScript-heavy online content, such as social media, and store it in WARCs. Operating the tool happens exclusively via the CLI, and specifically on a Linux machine (testing on Mac was not successful). Crawl configuration is not extremely granular, but it is still useful, and allows a quick capture of a single seed, or more detailed instructions to follow and capture links from that seed (see the tool's GitHub repository for more detailed instructions for forming capture commands).

According to the developer, crocoite is able to archive the dynamic web so successfully because it bases its function on picking up the network traffic between the headless browser and the page and using it to reconstruct the URLs to be captured. Τhis means that, in essence, what is captured by crocoite is not necessarily what the website server sent to the client/browser: it is a reconstruction based on the data that crocoite picks up by listening to network events. The reconstruction might more often than not be accurate, at least on the level of the end user browsing an archived page, and for most collection users and archivists it will probably be undetectable as well, unless perhaps they forensically examine the harvested files and compare them against the live website traffic.

Nevertheless, it does underline the fact that what we archive when we archive the web is almost never an “original” – it is rather a reconstruction of elements from the content as it was at the time of capture combined with materials necessarily introduced during the archiving process to make it possible (Brügger 2011, pp. 32). This is the case not only with crocoite but with practically any tool we use to capture and reproduce online digital content: strictly speaking, even the automated “behaviors” we must use to programmatically trigger content that requires interaction to be loaded, are in a way an intervention, slightly altering the captured content to make it replayable. It is useful to keep this mind as we begin capturing with any tool.

Deel

Help

Crocoite

Headless browser for capturing

Trefwoorden

Deel

Help

Crocoite

Headless browser for capturing

Trefwoorden

Verken

The "structured data" approach

Structured data formats (JSON, CSV, XLSX)

Non-functional requirements (quality attributes) for social media archiving tools (general)