The vastly increasing number of events organized about Web archiving* has shown that there is more and more interest in this important topic. This year, the researchers in the WebART team have presented at a number of international conferences, including the 'Web Archives as Scholarly Sources' conference, but also our work has been published in various journal papers. This post summarizes a few of the highlights.
First of all, we have extended last year's paper "Finding Pages On the Unarchived Web" to a full journal paper for the International Journal on Digital Libraries, which has been published as Open Access . In this paper, using the link structure of the web pages in the archive, we uncovered and reconstructed unarchived pages referenced in the Dutch Web archive. The results of this work show that a substantial number of pages can be found, almost the same size of the actual archived contents. The journal paper further extends this work, and shows that creating site summaries (i.e. combinations of 'anchor text' of whole websites) can enhance the retrieval effectiveness of unarchived content.
Furthermore, there have been investigations whether temporal anchor text can be used as a proxy for user queries, and this work was presented at the Fifth International Workshop on Semantic Digital Archives .
As allowing access for research use is an important aim of the WebART project, we have presented our work at the Web Archives as Scholarly Sources international conference in Denmark. This conference brought together a vast number of researchers and practitioners, evidencing the increased interest in Web archives for research purposes. The conference presentation summarized the approach to move beyond sole URL-based and keyword search in Web archives, and to move towards 'research engines' supporting the whole research process . Based on a literature survey, different needs of researchers in research phases of corpus creation, analysis and dissemination were identified, demonstrating limitations of current Web archive access tools. Moreover, solutions for supporting these research phases developed in WebART were presented at the conference. A paper presented at the European Conference on Information Literacy  further looked at the theoretical underpinnings of providing stage-based search support.
Related to Information Retrieval, Contextual Suggestions and archived Web data, a paper was published in the Information Retrieval Journal . The effectiveness of Information Retrieval systems is usually measured using test collections, which contain collections of web pages that are indexed by experimental systems. In the area of Contextual Suggestions, this paper looks at the balance between reproducibility and representativeness when building these test collections, a topic of key importance for Information Retrieval research.
Finally, in collaboration with Spinque, work has been carried out towards supporting search strategies in the Web archive. An experimental prototype has been created, which allows researchers to search Dutch news data, while being able to precisely customize their search engine via visual 'building blocks'.
This allows for answering novel research questions. To take a hypothetical example, a researcher looking at rivalry between neighbouring countries could define a corpus with all news items of a specific news website, and select only the articles in the category 'sports' which mention neighbouring countries, just by connecting a small number of visual building blocks. This allows for new, fluid, ways for supporting the research process.
* For instance the Web Archives as Scholarly Sources conference and Web Archiving 2015: Capture, Curate, Analyze, but also regular conferences including the topic of Web archiving, such as iPres 2015, TPDL 2015 and JCDL 2015.
 Huurdeman, H. C., Kamps, J., Samar, T., Vries, A. P. de, Ben-David, A., & Rogers, R. A. (2015). Lost but not forgotten: finding pages on the unarchived web. International Journal on Digital Libraries, 1–19.
 Thaer Samar and Arjen P. de Vries. Temporal Anchor Text as Proxy for Real User Queries (2015). Proceedings of the Fifth International Workshop on Semantic Digital Archives, co-located with TPDL 2015, Poznań, Poland, September 14-18, 2015. Slides.
 Hugo C. Huurdeman (2015). Towards Research Engines: Supporting Search Stages in Web archives (2015). Paper presented at Web Archives as Scholarly Sources conference, Aarhus, Denmark.
 Hugo C. Huurdeman and Jaap Kamps (forthcoming). Supporting the Process: Adapting Search Systems to Search Stages. European Conference on Information Literacy (ECIL), Tallinn, Estonia, October 2015.
 Thaer Samar, Alejandro Bellogín and Arjen P. de Vries (forthcoming). The Strange Case of Reproducibility vs. Representativeness in Contextual Suggestion Test Collections. Information Retrieval Journal.
On the 30th of October, 2014, the NCDD Web archiving in the Netherlands (Webarchivering in Nederland) symposium took place. The main aims of the symposium were to provide insights into web archiving practices in the Netherlands, to provide practical examples of the use of Web archives, and to potentially foster collaboration in the Dutch Web archiving field.
More than 130 participants saw Helen Hockx-Yu, head of Web archiving at the British Library, opening the conference with her inspirational keynote Working together to archive the UK Web. She provided perspectives on Web archiving on a national scale, including the challenges and opportunities. Helen also presented the British Library's ongoing initiatives of actively supporting scholars to use Web archives for their research, and the impact this had on archiving practices.
Websites including audiovisual content are often very difficult to harvest. Julia Vytopil (Beeld en Geluid) and Chloé Martin (Internet Memory Research) highlighted these aspects in their presentations. Julia discussed ongoing efforts of B&G to archive broadcaster's and related websites. She explained that broadcaster's websites serve as an important "context" collection for their audiovisual archives. However, there are challenges in terms of privacy, and severe technical limitations in the crawling of audiovisual content of these sites.
In the afternoon, the focus was moved to the actual users of Web archives. Wim de Bie, a famous Dutch comedian, writer and singer, and GertJan Kuiper (VPRO Digitaal) presented "Bieslog: from digital pioneers to archiving puzzle". As it turned out, the award-winning blogging efforts by Wim de Bie were ultimately lost. They lively described the efforts to retrieve and recover the blog's contents, which only succeeded partially and highlighted the importance of archiving efforts.
Next, Hugo Huurdeman of WebART took the stage and presented ways to enable scholarly research using Web archives. Hugo described current limitations in analytical Web archive access, and potential ways to overcome these issues, based on intense collaboration of New media researchers and system designers in a Living Lab.
Finally, all speakers, plus Marcel Privé (archiefweb.eu) and Tjarda de Haan (Re:DDS) joined for the panel discussion. Diverse propositions were discussed, like the potential enforcement of 'archivability' for websites in the public sector, and the (potential) importance of archiving social media. A lively discussion followed, and the inspirational afternoon finished with a plea for collaboration, enthusiastically received by the panelists and audience.
2014 has been an eventful year for WebART: several papers were accepted at major conferences, and we have presented at a multitude of events. This post summarizes some of our work until September 2014.
In May 2014, we presented work at the IIPC General Assembly: "Scholarly use and issues of web archives". Anat Ben-David presented joint work on Search as Research, and the methodological implications of using searchable Web archives for research. A paper on this topic is published in a special issue of the Alexandria journal on web archiving .
Secondly, we have explored the contents of the Dutch Web archive, and moved from studying just the archived pages, to studying the unarchived Web. By using the archives´contents and the link structure, we categorized the archived contents, but also provided estimates on the size of the unarchived parts of the Dutch Web. In addition to that, we created representations of unarchived contents using aggregated anchor text (text describing the links in the archive). This work has been documented in a short paper for the SIGIR 2014 conference , and a full paper at the IEEE/ACM Joint conference on Digital Libraries 2014 . In the latter conference, our paper has been shortlisted for the best paper award.
 Anat Ben-David, Hugo C. Huurdeman. Web Archive Search as Research: Methodological and Theoretical Implications. Alexandria Journal, Volume 25, No. 1 (2014). Manchester University Press. In press.
 Thaer Samar, Hugo C. Huurdeman, Anat Ben-David, Jaap Kamps, Arjen P. de Vries. Uncovering the Unarchived Web. In Proceedings of the 37th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, New York NY, 2014.
 Hugo C. Huurdeman, Anat Ben-David, Jaap Kamps, Thaer Samar, and Arjen P. de Vries. Finding pages on the unarchived web. In DL'14: Proceedings of the Digital Library Conference. ACM Press, New York NY, 2014. Nominated best paper award.
 Hannes Mühleisen, T. Samar, J. Lin, A.J. de Vries. Column Stores as an IR Prototyping Tool. Advances in Information Retrieval - 36th European Conference on IR Research, ECIR 2014, Amsterdam, The Netherlands, April 13-16, 2014.
 A. Bellogin Kouki, T. Samar, Arjen P. de Vries, A. Said. Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track. Proceedings of European Conference on Information Retrieval (ECIR 2014), Lecture Notes in Computer Science, 2014.
 Hugo C. Huurdeman and Jaap Kamps. From multistage information-seeking models to multistage search systems. In IIiX'14: Proceedings of the Fifth Information Interaction in Context Conference. ACM Press, New York NY, 2014.
September 11, 2013 (9:30-16:00)
The field of Web Archiving is at a turning point. In the early years of Web Archiving, the single URL, has been the dominant unit for preservation and access. Access tools such as the Internet Archive’s Wayback machine, reflect this notion as they allowed consultation, or browsing, of one URL at a time. The single URL access point to Web archives has also constructed early Web archive research methods.
In recent years, however, the single URL approach to Web archiving is being gradually replaced by a big data approach. Several web archives and research initiatives are already engaged in developing advanced search interfaces, access to aggregated metadata, visualisation tools, annotation and enrichment features of future web archives.
On September 11 2013 , researchers from the Digital Methods Initiative and from the WebART project at the University of Amsterdam convened to discuss the theoretical and methodological implications of searching, mining and visualizing the archived Web.
The first part of the day included several talks that addressed web archive research from various perspectives.
Anat Ben-David (WebART) discussed the past and future of Web archive research and “search as research” methods for social research of archived web data.
Hugo Huurdeman (WebART) introduced WebARTist as a “(Re)Search Engine” and the idea of building a web archive search system that supports scholarly research as a “tool-maker’s tool”
Tjarda de Haan (Amsterdam Museum) described the preservation project of the Amsterdam Digital City: “Re:DDS”
And Jules Mataly (MA graduate, University of Amsterdam), presented his MA thesis “The Three Truths of Margaret Thatcher: Creating and Analysing Archival Artefacts” as an example of a cross-archival search and collection critique.
The second part of the day included a hands-on session with WebARTist, the Web Archives Temporal Information Search System, which the WebART project developed during its pilot year (2012-2013) to offer new possibilities for exploration, extraction, analysis and visualization of archived web data.
The day concluded with an evaluation of WebARTist and recommendations for its future development to fit researchers’ needs. The general response to WebARTist was positive, and researchers indicated that the system provided new ways to explore web archives, substantially augmenting the existing “Wayback Machine” interface to the Dutch Web archive. WebARTist “supports the shift to studying web archives through queries”, as one participant noted. Another participant indicated that the system allows one to “look at ‘data’ rather than single sites”, and that it allows one to be “reflexive about collection policies”. The system could also advance the types of research questions that can be answered using Web archives: “It made it possible to build new research questions that go beyond the web site history approach. It also offers hope that web archiving is evolving in a more creative field of research.”
 The event was held on September 11 as a tribute to the September 11 Web Archive collection, curated by Kirsten Foot and Steve Schneider, in collaboration with the Internet Archive and the Library of Congress. The September 11 Web Archive has pioneered social research of archived Web collections.
It has been quiet on our news page for a short while, but this certainly does not imply that nothing happened. In this post we summarize the events that took place in the months following the NWO-CATCH meeting hosted by WebART.
IIPC General Assembly 2013 (25-26 April, 2013)
On the 25th of April, 2013 we headed for Ljubljana, Slovenia and presented at the symposium “Scholarly Access to Web Archives: Progress, Requirements and Challenges”. This day, organized by the International Internet Preservation Consortium (IIPC), was dedicated to scholarly access to Web archives in a broad sense, and included presentations by Niels Brügger, Meghan Dougherty, Helen Hockx-Yu and Julien Masanès. Our presentation “WebART: Facilitating Scholarly Use Of Web Archives” discussed the potential of scholarly research using Web archives, as well as current barriers to success, based on the experiences gained during the pilot project of WebART.
ACM Web Science 2013 (2-4 May, 2013)
The experience of applying the tools developed by the WebART project to a research setting was also documented in a full paper for the ACM Web Science 2013 conference. The “Palais des Congres” in Paris, France was the venue for the conference in early May, which included a keynote by Vint Cerf, the “godfather of the internet”, and interesting presentations by academics from a wide range of disciplines. Our paper, ”Sprint Methods for Web Archive Research” was presented as a poster at the conference, resulting in a useful variety of (interdisciplinary) discussions about Web archiving and our project.
Scholarly use of Web archives: studying Israeli politics on the Web (28-30 May, 2013)
An indication that the tools and methods developed by the WebART project are not only useful in the context of the Dutch Web, but also in other settings, was provided during the workshop “Scholarly Use of Web Archives: Studying Israeli Politics on the Web”. This event consisted of a conference and workshop, organized by the program for Science, Technology and Society at Bar-Ilan University, the National Library of Israel, and the Digital Humanities Incubator. It addressed new methods for scholarly use of Web archives, while paying attention to special collections of election campaigns. A group of 12 participants from different disciplines made use of the WebART search tools, applied to Israeli Web archives of the elections in 2009 and 2013. The participants of the workshop provided ample feedback on the pilot system of WebART, and new inspiration for the creation of the next generation of Web archive search systems. Talks from this conference are available here.
Impressions from NWO-CATCH meeting
Archiving the Web: How to Support Research of Future Heritage?
Hosted by the WebART project, at the KB, the National Library of the Netherlands
April 19 2013
On 19 April 2013 the WebART project hosted a CATCH-NWO meeting at the national library of the Netherlands. CATCH program (Continuous Access To Cultural Heritage), sponsor of the WebART project and of the event, is the Dutch science foundation (NWO) program that brings IT researchers and heritage managers work together on making heritage available digitally. The program encourages collaboration, innovation and the transfer of knowledge.
The meeting's theme was dedicated to supporting research of Web archives. We were delighted to have Mrs. Helen Hockx-Yu, Bernhard Rieder and Bill LeFurgy as keynote speakers.
The day started with a pre-meeting with Mrs. Helen Hockx-Yu, head of the UK Web Archive, one of the world's leading Web archiving institutions renowned for its innovative approach to making Web archives accessible for use and research. In an informal setting, the staff of the KB involved in the Dutch Web archiving initiative, researchers from the WebART project and other interested scholars from the Digital Methods Initiative at the University of Amsterdam, openly discussed the challenges and barriers both Web archives face in making their data and collections accessible, and the types of research that can be done with them.
After lunch, during which the WebART team presented a poster and a demo of WebARTist, our new search tool for temporal archived Web data, the CATCH meeting officially opened with a fascinating keynote lecture by Mrs. Hockx-Yu, who presented the history of the British Library's Web archiving initiative, and the milestone recently reached with the transition from an open-access selective archiving approach to a full-domain harvesting of the .uk domain. Mrs. Hockx-Yu described the challenges to accessing Web archiving data for research. Out of the 29 national Web archives that are members of the IIPC consortium, only 19 have some sort of access to their collections, which is often times only available in libraries' reading room, and is only permission-based. In other cases when there is online access to Web archives, usage rate is very low.
With little evidence of scholarly use of Web archives, Hockx-Yu raised a chicken and egg problem: limited access reduces the use of Web archives in research, but therefore we cannot really know the needs from the the research community.
Since most Web archives currently allow a URL-based access to their data, Hockx-Yu said that full-text search is the next challenge on the road map for most national Web archives.
Mrs. Hockx-Yu described the features and tools developed by the UK Web Arhicve, which consists ,as of April 2013 (before the transition to full-domain harvesting), of 35 TB of data and 4 million hosts. The UK Web archive has a search interface serving special collections of Websites curated and available online. Among other tools, the UK Web archive also supports visual browsing, RSS feeds of the latest instances added to the archive, N-gram search and other visualization tools. She also described their recent experience with TwitterVane; a tool recently developed by IIPC to curate recommended Websites for archiving from Twitter references.
One of the strongest claims made by Hockx-Yu, was her description of the national turn of Web archives. According to her, the many Web archiving initiatives did not archive The Web, but rather de-constructed and appropriated it to countries, flattened it (archived Web -pages are stripped from their 'pretext', or contextual elements such as hyperlinks), and only allowed a document-based approach to it. The way forward, claims Hockx-Yu, is to move beyond the document approach to a big-data approach. Using tools such as Memento, which allows consulting archived versions of a URL across the Web, is one way of stitching the Web back.
The second keynote lecture, by Dr. Bernhard Rieder from the Media Studies department of the University of Amsterdam, introduced innovative tools for analyzing temporal Web collections from the new media researchers' perspective. Dr. Rieder explained the methodological challenges in creating collections of social media data for research. The platforms' APIs allow and preclude access to different (structured) data. This poses a methodological problem. To answer a research question, is it better to obtain the largest sample possible, or rather a smaller and more precise sample of the data? And against which baseline could the obtained data be compared? Dr. Rieder advocates a method for sampling a random 1% of all tweets and studying its characteristics as a baseline for comparison. He showed recent findings from the 2013 Digital Methods Initiative's Winter School, where together with Dr. Caronline Gerlitz they collected the random sample of 1% of all tweets, allowed by the Twitter API over the course of 24 hours. Dr. Rieder's keynote lecture ended with a "wish list" of a new media researcher from Web archives. The wish list included moral, political and legal support for creating large collections of Web data; a fast, search based access to a continuous 1% sample with translated URLs and click data, as well as comprehensive statistics; and an easy, non-bureaucratic way to submit data collections, without having to use standardized formats.
Dr. Rieder's wish list did not fall on deaf ears. We saw the next keynote speaker, Mr. Bill LeFurgy, head of the digital initiatives program at the US Library of Congress, among others involved with archiving the Twitter collection, scribbling down the requests, before going on stage. Mr. LeFurgy's talk was dedicated to branding stewardship of big ("--really big") digital collections. He described the ways with which heritage institutions (such as galleries, libraries, archives and musea) can brand the added value of stewarding digital collections, Web archives among them, to increase stakeholders, researchers, and the public's interest in their importance.
People love libraries more than other governmental institutions, claims Mr. LeFurgy, but they do not always understand the importance of their preservation and serving digital data. As an example, Mr. LeFurgy quoted from a UK magazine that covered the UK Web archives' transition to full-domain harvesting, by asking people on the streets about the claim made by Web archives, that archiving the Web is important to prevent the loss of future digital cultural heritage. The people on the streets, so it seems, were rather puzzled by this claim. "The Internet archives itself, doesn't it? It's called Google", was their answer. To change public misconceptions of digital stewardship initiatives, claims Mr. LeFurgy, heritage institutions should proactively engage with researchers and mainstream media.
Researchers' engagement with Web Archives was also the topic of the meeting's closing panel. Mr. Ewoud Sanders, NRC journalist, historian and researcher, chaired the panel. The panelists were Hockx-Yu and LeFurgy, as well as Prof. Arjen de Vries (CWI/WebART) and Prof. Richard Rogers (UvA/ WebART). Mr. Sanders had asked each of the panelists to make a statement about their views of the current barriers to facilitating scholarly use of Web archives, and how to engage researchers.
The panelists’ answers to these questions turned into an accumulative list of barriers:
The main barrier, claims Prof. De Vries, is still the legal restrictions that only allow on-site access to Web archives, and the governmental institutions' lack of a "risk appetite" (a term previously mentioned by Hockx-Yu), to change this situation. Common Crawl, the non-for-profit Web crawl available for research is an example to that. Common Crawl has no access restriction, although it might be less well curated as institutional Web archives are. Another technical barrier, explains Prof. De Vries, is that unlike search engines, Web archives unfortunately miss out on the large amounts of usage data necessary to to improve the algorithms and relevance of search results.
Hockx-Yu added to De Vries' list the risky dependency on a common set of tools used by institutions to archive the Web. Web archiving at scale already takes place for 10 years, but the development of tools did not follow that path. Another barrier to scholarly use of Web archives, according to Hockx-Yu, is the monolithic document access to Web archives (which only allow URL search and browsing of its previous captures). The biggest barrier, however, is the pace of technological developments, with which institutional Web archives cannot keep up.
LeFurgy's view on the barriers to advancing scholarly use of Web archives focused on social and economic reasons. From a social point of view, he discussed the barriers to making the case of the value of Web archives and the institutions that hold them - which are run on tax money. Although Web archives hold many opportunities for research - in the fields of digital humanities, journalism and local history - the assumption that "if you build it they will come" is not always true, and more research engagement and PR is necessary.
From an economic perspective, there is a need in obtaining sustained money to archiving initiatives that will allow the long-term stewardship of digital data and hiring staff.
Prof. Rogers added to the list of barriers by questioning what it is that we are archiving, and what is the status of the Web more generally? Initially, the barriers to archiving the Web concerned its ephemerality. Subsequently, concerns were made about studying static archived Web pages compared to the dynamic live Web. Currently, however, the concerns focus on the current decline of the informational and open Web, something people such as Tim Berners-Lee are concerned about. The Web, claims Prof. Rogers, may be in decline, and that may change the way we will think about Web archives in the near future. The new Web that is rising (the one of corporate, social media platforms and mobile applications) is currently not archived. If we archive the new Web through an API - because this is what the platforms allow us to do - how does that change our object of study? In addition, one should consider the status of Web archives as a media source vis-à-vis other media sources. When writing history, are Web archives good as a source material?
All panelists agreed that in order to stay relevant, Web archives should proactively engage and work with researchers. Engagement can be improved by looking at current scholarly practices. One way to do explore access to archived Web collections outside the reading rooms (from a big data approach), is to allow online access to aggregated metadata, rather than to the archived files.
The WebART project thanks NWO-CATCH, the KB staff, keynote speakers, panelists, chairs and audience for contributing to the meeting's success.
NWO-CATCH Meeting, hosted by WebART.
April 19 2013, Aula, National Library of the Netherlands (KB), the Hague
The web has become the central medium of our time -- all our traditional media have become digital and even our own lives are increasingly taking place 'on the web'. Preservation and archiving practices haven't kept pace, resulting in our future heritage being lost to posterity, rapidly and indefinitely. Web Archives constantly struggle with challenges of preserving an ephemeral medium, and have to make crucial decisions on selection policies, storage constrains, and the desired frequency of crawling and harvesting. The evolution of Web-based technologies and services -- such as dynamic content, social media, RSS feeds, Tweets, Mobile Applications and API's -- create new challenges. Also the demands of researchers using the web archive evolve rapidly, requiring novel access tools for exploring the existing archived Web strata, or new types of data that are currently not preserved by Web archives. What is the best way forward to make both ends meet?
- Helen Hockx-Yu, Head of Web Archiving, British Library.
- Bill LeFurgy, Digital Initiatives Program Manager, Library of Congress.
- Dr. Bernhard Rieder, Assistant Professor in New Media, Media Studies, University of Amsterdam.
Keynote abstracts and speaker biographies are available here.
This meeting is open to any interested party, and there is no registration fee. If you would like to attend this meeting, please register via: http://www.nwo.nl/en/forms/catch.
Slides from lectures presented to students in the MA program of Heritage and Digital Culture, University of Amsterdam.
The WebART project will host the upcoming CATCH meeting at the National Library of the Netherlands.
Archiving the Web: How to Support Research of Future Heritage?
National Library of the Netherlands, the Hague, April 19 2013, 12:00-18:00
The web has become the central medium of our time -- all our traditional media have become digital and even our own lives are increasingly taking place 'on the web'. Preservation and archiving practices haven't kept pace, resulting in our future heritage being lost to posterity, rapidly and indefinitely. Web Archives constantly struggle with challenges of preserving an ephemeral medium, and have to make crucial decisions on selection policies, storage constrains, and the desired frequency of crawling and harvesting. The evolution of Web-based technologies and services -- such as dynamic content, social media, RSS feeds, Tweets, Mobile Applications and API's -- creates new challenges. Also the demands of researchers using the web archive evolve rapidly, requiring novel access tools for exploring the existing archived Web strata, or new types of data that are currently not preserved by Web archives. What is the best way forward to make both ends meet?