Archiving the Web: How to Support Research of Future Heritage?

Post date: Apr 20, 2013 2:14:20 PM

Impressions from NWO-CATCH meeting 

Archiving the Web: How to Support Research of Future Heritage?

Hosted by the WebART project, at the KB, the National Library of the Netherlands

April 19 2013

On 19 April 2013 the WebART project hosted a CATCH-NWO meeting at the national library of the Netherlands. CATCH program (Continuous Access To Cultural Heritage), sponsor of the WebART project and of the event, is the Dutch science foundation (NWO) program that brings IT researchers and heritage managers work together on making heritage available digitally. The program encourages collaboration, innovation and the transfer of knowledge.

The meeting's theme was dedicated to supporting research of Web archives. We were delighted to have Mrs. Helen Hockx-Yu, Bernhard Rieder and Bill LeFurgy as keynote speakers.

The day started with a pre-meeting with Mrs. Helen Hockx-Yu, head of the UK Web Archive,  one of the world's leading Web archiving institutions renowned for its innovative approach to making Web archives accessible for use and research. In an informal setting, the staff of the KB involved in the Dutch Web archiving initiative, researchers from the WebART project and other interested scholars from the Digital Methods Initiative at the University of Amsterdam, openly discussed the challenges and barriers both Web archives face in making their data and collections accessible, and the types of research that can be done with them. 

After lunch, during which the WebART team presented a poster and a demo of WebARTist, our new search tool for temporal archived Web data, the CATCH meeting officially opened with a fascinating keynote lecture by Mrs. Hockx-Yu, who presented the history of the British Library's Web archiving initiative, and the milestone recently reached with the transition from an open-access selective archiving approach to a full-domain harvesting of the .uk domain. Mrs. Hockx-Yu described the challenges to accessing Web archiving data for research. Out of the 29 national Web archives that are members of the IIPC consortium, only 19 have some sort of access to their collections, which is often times only available in libraries' reading room, and is only permission-based. In other cases when there is online access to Web archives, usage rate is very low. 

With little evidence of scholarly use of Web archives, Hockx-Yu raised a chicken and egg problem: limited access reduces the use of Web archives in research, but therefore we cannot really know the needs from the the research community.

Since most Web archives currently allow a URL-based access to their data, Hockx-Yu said that full-text search is the next challenge on the road map for most national Web archives. 

Mrs. Hockx-Yu described the features and tools developed by the UK Web Arhicve, which consists ,as of April 2013 (before the transition to full-domain harvesting), of 35 TB of data and 4 million hosts. The UK Web archive has a search interface serving special collections of Websites curated and available online. Among other tools, the UK Web archive also supports visual browsing, RSS feeds of the latest instances added to the archive, N-gram search and other visualization tools. She also described their recent experience with TwitterVane; a tool recently developed by IIPC to curate recommended Websites for archiving from Twitter references. 

One of the strongest claims made by Hockx-Yu, was her description of the national turn of Web archives. According to her, the many Web archiving initiatives did not archive The Web, but rather de-constructed and appropriated it to countries, flattened it (archived Web -pages are stripped from their 'pretext', or contextual elements such as hyperlinks), and only allowed a document-based approach to it. The way forward, claims Hockx-Yu, is to move beyond the document approach to a big-data approach. Using tools such as Memento, which allows consulting archived versions of a URL across the Web, is one way of stitching the Web back.

The second keynote lecture, by Dr. Bernhard Rieder from the Media Studies department of the University of Amsterdam, introduced innovative tools for analyzing temporal Web collections from the new media researchers' perspective. Dr. Rieder explained the methodological challenges in creating collections of social media data for research. The platforms' APIs allow and preclude access to different (structured) data. This poses a methodological problem. To answer a research question, is it better to obtain the largest sample possible, or rather a smaller and more precise sample of the data? And against which baseline could the obtained data be compared? Dr. Rieder advocates a method for sampling a random 1% of all tweets and studying its characteristics as a baseline for comparison. He showed recent findings from the 2013 Digital Methods Initiative's Winter School,  where together with Dr. Caronline Gerlitz they collected the random sample of  1% of all tweets, allowed by the Twitter API over the course of 24 hours. Dr. Rieder's keynote lecture ended with a "wish list" of a new media researcher from Web archives. The wish list included moral, political and legal support for creating large collections of Web data; a fast, search based access to a continuous 1% sample with translated URLs and click data, as well as comprehensive statistics; and an easy, non-bureaucratic way to submit data collections, without having to use standardized formats.

Dr. Rieder's wish list did not fall on deaf ears. We saw the next keynote speaker, Mr. Bill LeFurgy, head of the digital initiatives program at the US Library of Congress, among others involved with archiving the Twitter collection, scribbling down the requests, before going on stage. Mr. LeFurgy's talk was dedicated to branding stewardship of big ("--really big") digital collections. He described the ways with which heritage institutions (such as galleries, libraries, archives and musea) can brand the added value of stewarding digital collections, Web archives among them, to increase stakeholders, researchers, and the public's interest in their importance. 

People love libraries more than other governmental institutions, claims Mr. LeFurgy, but they do not always understand the importance of their preservation and serving digital data. As an example, Mr. LeFurgy quoted from a UK magazine that covered the UK Web archives' transition to full-domain harvesting, by asking people on the streets about the claim made by Web archives, that archiving the Web is important to prevent the loss of future digital cultural heritage. The people on the streets, so it seems, were rather puzzled by this claim. "The Internet archives itself, doesn't it? It's called Google", was their answer. To change public misconceptions of digital stewardship initiatives, claims Mr. LeFurgy, heritage institutions should proactively engage with researchers and mainstream media. 

Researchers' engagement with Web Archives was also the topic of the meeting's closing panel. Mr. Ewoud Sanders, NRC journalist, historian and researcher, chaired the panel. The panelists were Hockx-Yu and LeFurgy, as well as Prof. Arjen de Vries (CWI/WebART) and Prof. Richard Rogers (UvA/ WebART). Mr. Sanders had asked each of the panelists to make a statement about their views of the current barriers to facilitating scholarly use of Web archives, and how to engage researchers.

The panelists’ answers to these questions turned into an accumulative list of barriers:

The main barrier, claims Prof. De Vries, is still the legal restrictions that only allow on-site access to Web archives, and the governmental institutions' lack of a "risk appetite" (a term previously mentioned by Hockx-Yu), to change this situation. Common Crawl, the non-for-profit Web crawl available for research is an example to that. Common Crawl has no access restriction, although it might be less well curated as institutional Web archives are. Another technical barrier, explains Prof. De Vries, is that unlike search engines, Web archives unfortunately miss out on the large amounts of usage data necessary to  to improve the algorithms and relevance of search results. 

Hockx-Yu added to De Vries' list the risky dependency on a common set of tools used by institutions to archive the Web. Web archiving at scale already takes place for 10 years, but the development of tools did not follow that path. Another barrier to scholarly use of Web archives, according to Hockx-Yu, is the monolithic document access to Web archives (which only allow URL search and browsing of its previous captures). The biggest barrier, however, is the pace of technological developments, with which institutional Web archives cannot keep up.

LeFurgy's view on the barriers to advancing scholarly use of Web archives focused on social and economic reasons. From a social point of view, he discussed the barriers to making the case of the value of Web archives and the institutions that hold them - which are run on tax money. Although Web archives hold many opportunities for research - in the fields of digital humanities, journalism and local history - the assumption that "if you build it they will come" is not always true, and more research engagement and PR is necessary.

From an economic perspective, there is a need in obtaining sustained money to archiving initiatives that will allow the long-term stewardship of digital data and hiring staff.

Prof. Rogers added to the list of barriers by questioning what it is that we are archiving, and what is the status of the Web more generally? Initially, the barriers to archiving the Web concerned its ephemerality. Subsequently, concerns were made about studying static archived Web pages compared to the dynamic live Web. Currently, however, the concerns focus on the current decline of the informational and open Web, something people such as Tim Berners-Lee are concerned about. The Web, claims Prof. Rogers, may be in decline, and that may change the way we will think about Web archives in the near future. The new Web that is rising (the one of corporate, social media platforms and mobile applications) is currently not archived. If we archive the new Web through an API - because this is what the platforms allow us to do - how does that change our object of study? In addition, one should consider the status of Web archives as a media source vis-à-vis other media sources. When writing history, are Web archives good as a source material? 

All panelists agreed that in order to stay relevant, Web archives should proactively engage and work with researchers. Engagement can be improved by looking at current scholarly practices. One way to do explore access to archived Web collections outside the reading rooms (from a big data approach), is to allow online access to aggregated metadata, rather than to the archived files.

The WebART project thanks NWO-CATCH, the KB staff, keynote speakers, panelists, chairs and audience for contributing to the meeting's success.