It is safe to say that most of the tools that output structured data are not the easiest or most intuitive to use. One notable exception then is TAGS (Twitter Archiving Google Sheet). TAGS is in essence an app built on Google Sheets, that uses the Twitter API to fetch structured data based on queries the user inputs in the spreadsheet. TAGS makes use of an already authenticated Twitter API app for its operation, but you are able to use your ow...
Note: According to its GitHub page, this tool is not in active development anymore at the time of this writing (January 2022). However, it is still available for download and it still functions as expected. In a way, crocoite is a good example of a tool arising from the open-source community that could prove problematic to use in a professional setting because of lack of ongoing support. As browser-based crawling seems to become central in th...
Munin (Munin-Indexer) uses Docker to wrap different scraping and archiving tools together and offer a scraping solution for Facebook, Instagram, and VKontakte. It indexes and scrapes posts, then crawls and captures them, and finally uses pywb to display them. Suitable for public social media content The important thing to note about Munin is that it is only able to archive public posts, i.e., only posts that do not sit behind a log-in. Conseq...
Initially known as Browsertrix, the Browsertrix Crawler is the latest and revamped version of what used to be a system of multiple browser-based crawlers that worked together to capture complex web content, such as social media. Browsertrix Crawler is built by the team behind the online web recording service Conifer and the desktop app ArchiveWeb.page (formerly known as Webrecorder and Webrecorder Player respectively) and uses the Chrome and C...
For those looking for large-scale harvesting solutions, Brozzler, like Browsertrix, is an interesting choice. Brozzler was developed and is still being maintained by the Internet Archive, and it is already used by organizations such as the Portuguese Web Archive. It is a browser-based crawler which uses Chrome or Chromium to access web content and harvest it in a WARC file. Brozzler is one of the newer-generation capturing tools which leverag...
The tools are divided based on the type of output they produce, and each page includes presentation of features as well as a more detailed review based on personal experience with implementing and using them.
Part of the research into social media archiving tools that was performed by NDE/IISG had to do with finding out what the requirements would be to consider these tools suitable for different kids of usage and users. In these pages, more general quality attributes for social media archiving software are presented, that could be relevant for any type of tool, free and open-source or not. Additionally, the functional requirements that were first ...
This wiki page includes the tool surveys that NDE/IISG performed between 2020 and 2021. The aim is to keep these pages up-to-date as much as possible, and encourage other organizations involved with web archiving in the Netherlands to contribute to them with content they consider useful.
Instead of focusing on the "look and feel" of the material, its visual form and its multimedia affordances, the "structured data" approach focuses on informational qualities and the raw data that derive from the captured social media content. The output of this is structured textual data, usually in tabular form. Social media platforms make structured data derived from their websites available via API services, i.e., specific interfaces create...
This method is based on common web archiving practices, which makes sense as social media archiving can be seen as an offshoot of web archiving. The most common method of web archiving, namely web crawling or web harvesting, attempts to preserve the so-called “look and feel” of online content, meaning the layout, structure, and style of a website, as well as its navigational features, like buttons and menus. However, this has proven to be rela...
Social media appears in many shapes and forms, aimed at many different audiences and serves many different purposes. This variety also affects the ways we can consider what a social media archival collection is, and how it is created. In this section, two broad approaches to social media archiving will be presented, i.e. "look and feel" and "structured data". These are not mutually exclusive, and in many cases it could be preferable or advisab...
Understanding what could possibly be considered social media content on the one hand, and also "archive-worthy" social media content on the other hand, is important for tool selection and assessment, but also for acknowledging that at times the solution might not lie with tools per se. The nature of the content is significant for tool selection For example, to capture Instagram content without capturing its media will result in a collection t...
Social media are identified with the rise of Web 2.0 and the era of increasing online interactivity, personalization, the use of mobile devices and cloud computing. A broad definition like the one proposed by Treem et al. (2016, pp. 770), who see social media as technologies that “create a way for individuals to maintain current relationships, to create new connections, to create and share their own content, and, in some degree, to make their...
The quality attributes mentioned below refer to the features of the tools as software itself, i.e. the ways it achieves what it has been designed to do. They themselves affect the quality of the user experience it offers and the tool’s sustainability (Chung et al. 2000). The requirements listed below were taken from the practice of software testing and software selection. While most of the academic and professional literature on software testi...
The list of functional requirements below is based on an earlier project carried out within IISG which focused on workflows for acquiring and preserving born-digital materials in the broad sense, and another project that looked into web archiving tools and workflows specifically. For this NDE/IISG research on social media archiving tools, we were particularly interested in testing tools and their outcomes with a focus on how they can be used p...
In de NDE-podcastserie ‘Paulus en De Nijs op reis’ trekken netwerkredacteur Ronald de Nijs en journalist Kirsten Paulus door Nederland. Ze spreken twaalf erfgoedprofessionals met inspirerende verhalen over digitaal erfgoed. In de tiende aflevering vertelt Migiza Victoriashoop, adviseur digitale informatievoorziening bij het Waterlands Archief, over haar successen, missers en dromen. Luister via SoundCloud of Spotify.
Komende donderdag (27-01) wordt er door het KP Dienstverlening een (digitaal) uitwisseluurtje georganiseerd. Nadine Groffen en Annet Waalkens van het NA gaan hierbij in op hoe zij ieder jaar Openbaarheidsdag organiseren. Wellicht is deze materie ook interessant voor de leden van het KP Informatierecht. Meer informatie en de vergaderlink vind je via /calendar_events/357