DPC: Digital Preservation Workflow Webinars and COW-a-thon on 8th and 10th of March 2022 Free to attend and open to all, the Digital Preservation Workflow Webinar series will showcase just some of the digital preservation workflow processes developed and implemented by DPC Member institutions. Webinar programme Episode 1 // Workflow presentations Tuesday 8th March 2022, 1330 – 1430 UTC ‘Integrating digital archiving and e-thesis submission’ ...
Op 21 januari trapten het Netwerk Digitaal Erfgoed, Podiumkunst.net en het Netwerk Archieven Design en Digitale Cultuur (NADD) 2022 af met het Digitaal Erfgoed Nieuwjaarsevent. Het online event, gepresenteerd vanuit de Bibliotheek Utrecht en met diverse primeurs, is terug te kijken op YouTube.
Werk je bij een archiefinstelling die bezig is een oplossing te vinden voor het registeren/documenteren van restauratiewerkzaamheden en materiele zorg (in Mais Flexis) dan willen we graag met je in contact komen.
The turn to more data-intensive access methods to web and social media archives, as indicated by the use of big data and digital humanities methods to analyze social media content calls for capturing social media in formats appropriate for these activities. Usually in formats like JSON, CSV, and XLSX, collections made up of structured data are more amenable to computational methods such as network analysis, topic modelling, and many other visu...
The WARC format is widely accepted enough to be considered one of the default formats for storing captured content from the web. It followed its predecessor, the ARC, as the main file format in use by the Internet Archive, and is maintained by the International Internet Preservation Consortium (IIPC). The rationale behind the WARC format is that one file format for web archiving should preferably be able to hold not only the archived resources...
The two general approaches to social media archiving presented here ("look and feel" and "structured data") also have implications for file format selection, which by extension has implications for preservation and collection quality. The choices made when capturing and preserving, part of which is selecting appropriate formats according to one's purpose, will affect the possible uses the collection can be put into, and by extension, the types...
This is a list of tools that captures social media content in the form of structured data, focusing on the information included e.g. text, URLs, number of posts, etc., and not on the visual features of the content.
This is a list of sources referenced in this wiki. Care has been taken to include every source, however additions and corrections for things that might have been accidentally overlooked are of course welcome!
It is safe to say that most of the tools that output structured data are not the easiest or most intuitive to use. One notable exception then is TAGS (Twitter Archiving Google Sheet). TAGS is in essence an app built on Google Sheets, that uses the Twitter API to fetch structured data based on queries the user inputs in the spreadsheet. TAGS makes use of an already authenticated Twitter API app for its operation, but you are able to use your ow...
Munin (Munin-Indexer) uses Docker to wrap different scraping and archiving tools together and offer a scraping solution for Facebook, Instagram, and VKontakte. It indexes and scrapes posts, then crawls and captures them, and finally uses pywb to display them. Suitable for public social media content The important thing to note about Munin is that it is only able to archive public posts, i.e., only posts that do not sit behind a log-in. Conseq...
Note: According to its GitHub page, this tool is not in active development anymore at the time of this writing (January 2022). However, it is still available for download and it still functions as expected. In a way, crocoite is a good example of a tool arising from the open-source community that could prove problematic to use in a professional setting because of lack of ongoing support. As browser-based crawling seems to become central in th...
Initially known as Browsertrix, the Browsertrix Crawler is the latest and revamped version of what used to be a system of multiple browser-based crawlers that worked together to capture complex web content, such as social media. Browsertrix Crawler is built by the team behind the online web recording service Conifer and the desktop app ArchiveWeb.page (formerly known as Webrecorder and Webrecorder Player respectively) and uses the Chrome and C...
For those looking for large-scale harvesting solutions, Brozzler, like Browsertrix, is an interesting choice. Brozzler was developed and is still being maintained by the Internet Archive, and it is already used by organizations such as the Portuguese Web Archive. It is a browser-based crawler which uses Chrome or Chromium to access web content and harvest it in a WARC file. Brozzler is one of the newer-generation capturing tools which leverag...