Challenges in archiving social media
Understanding what could possibly be considered social media content on the one hand, and also "archiv...
Alle leden mogen wijzigen
The turn to more data-intensive access methods to web and social media archives, as indicated by the use of big data and digital humanities methods to analyze social media content calls for capturing social media in formats appropriate for these activities.
Usually in formats like JSON, CSV, and XLSX, collections made up of structured data are more amenable to computational methods such as network analysis, topic modelling, and many other visualization and analysis methods. Indeed, critics of WARC-based collections claim that a lot of extra work is needed in order to make WARC data machine-actionable (Wang and Xie 2020), collections in JSON or CSV could potentially lower the threshold of that effort.
JSON
CSV
XLSX
However, there is a catch that involves the integrity and reliability of social media archives: such structured data, if they derive from social media APIs as they usually do, are not easy to control in terms of provenance, because the platforms themselves are not transparent in their policies of data exchange and publishing. If we do not know how the platforms choose the data that they give to us, we cannot make claims as to its completeness and integrity. Being required as we are by Twitter to delete data we have captured if a Twitter user deletes it from their live profile, puts the reliability of collections at risk, and also creates an extra task for archivists to monitor social media websites for changes.