Blader door bijdragen

  • jan 2022
  • Zefi Kavvadia
  • 190

The turn to more data-intensive access methods to web and social media archives, as indicated by the use of big data and digital humanities methods to analyze social media content calls for capturing social media in formats appropriate for these activities. Usually in formats like JSON, CSV, and XLSX, collections made up of structured data are more amenable to computational methods such as network analysis, topic modelling, and many other visu...

Zefi Kavvadia
Particuliere Websites en SoMe
  • jan 2022
  • Zefi Kavvadia
  • 218

The WARC format is widely accepted enough to be considered one of the default formats for storing captured content from the web. It followed its predecessor, the ARC, as the main file format in use by the Internet Archive, and is maintained by the International Internet Preservation Consortium (IIPC). The rationale behind the WARC format is that one file format for web archiving should preferably be able to hold not only the archived resources...

Zefi Kavvadia
Particuliere Websites en SoMe
  • jan 2022
  • Zefi Kavvadia
  • 206

The two general approaches to social media archiving presented here ("look and feel" and "structured data") also have implications for file format selection, which by extension has implications for preservation and collection quality. The choices made when capturing and preserving, part of which is selecting appropriate formats according to one's purpose, will affect the possible uses the collection can be put into, and by extension, the types...

Zefi Kavvadia
Particuliere Websites en SoMe
  • jan 2022
  • Zefi Kavvadia
  • 143

This is a list of tools that captures social media content in the form of structured data, focusing on the information included e.g. text, URLs, number of posts, etc., and not on the visual features of the content.

Zefi Kavvadia
Particuliere Websites en SoMe
  • jan 2022
  • Zefi Kavvadia
  • 226

This is a list of tools that capture the experience of browsing social media, i.e. visual features, media, etc.

Zefi Kavvadia
Particuliere Websites en SoMe
  • jan 2022
  • Zefi Kavvadia
  • 182

Samenvatting

This is a list of sources referenced in this wiki. Care has been taken to include every source, however additions and corrections for things that might have been accidentally overlooked are of course welcome!
Zefi Kavvadia
Particuliere Websites en SoMe
  • jan 2022
  • Zefi Kavvadia
  • 265

It is safe to say that most of the tools that output structured data are not the easiest or most intuitive to use. One notable exception then is TAGS (Twitter Archiving Google Sheet). TAGS is in essence an app built on Google Sheets, that uses the Twitter API to fetch structured data based on queries the user inputs in the spreadsheet. TAGS makes use of an already authenticated Twitter API app for its operation, but you are able to use your ow...

Zefi Kavvadia
Particuliere Websites en SoMe
  • jan 2022
  • Zefi Kavvadia
  • 222

Note: According to its GitHub page, this tool is not in active development anymore at the time of this writing (January 2022). However, it is still available for download and it still functions as expected. In a way, crocoite is a good example of a tool arising from the open-source community that could prove problematic to use in a professional setting because of lack of ongoing support. As browser-based crawling seems to become central in th...

Zefi Kavvadia
Particuliere Websites en SoMe
  • jan 2022
  • Zefi Kavvadia
  • 204

Munin (Munin-Indexer) uses Docker to wrap different scraping and archiving tools together and offer a scraping solution for Facebook, Instagram, and VKontakte. It indexes and scrapes posts, then crawls and captures them, and finally uses pywb to display them. Suitable for public social media content The important thing to note about Munin is that it is only able to archive public posts, i.e., only posts that do not sit behind a log-in. Conseq...

Zefi Kavvadia
Particuliere Websites en SoMe
  • jan 2022
  • Zefi Kavvadia
  • 253

Initially known as Browsertrix, the Browsertrix Crawler is the latest and revamped version of what used to be a system of multiple browser-based crawlers that worked together to capture complex web content, such as social media. Browsertrix Crawler is built by the team behind the online web recording service Conifer and the desktop app ArchiveWeb.page (formerly known as Webrecorder and Webrecorder Player respectively) and uses the Chrome and C...

Zefi Kavvadia
Particuliere Websites en SoMe
  • jan 2022
  • Zefi Kavvadia
  • 358

For those looking for large-scale harvesting solutions, Brozzler, like Browsertrix, is an interesting choice. Brozzler was developed and is still being maintained by the Internet Archive, and it is already used by organizations such as the Portuguese Web Archive. It is a browser-based crawler which uses Chrome or Chromium to access web content and harvest it in a WARC file. Brozzler is one of the newer-generation capturing tools which leverag...

Zefi Kavvadia
Particuliere Websites en SoMe
  • jan 2022
  • Zefi Kavvadia
  • 175

The tools are divided based on the type of output they produce, and each page includes presentation of features as well as a more detailed review based on personal experience with implementing and using them.

Zefi Kavvadia
Particuliere Websites en SoMe
  • jan 2022
  • Zefi Kavvadia
  • 158

Part of the research into social media archiving tools that was performed by NDE/IISG had to do with finding out what the requirements would be to consider these tools suitable for different kids of usage and users. In these pages, more general quality attributes for social media archiving software are presented, that could be relevant for any type of tool, free and open-source or not. Additionally, the functional requirements that were first ...

Zefi Kavvadia
Particuliere Websites en SoMe