results

This is a brief summary of the results of the WebART project up to 2016. Related research papers can be found on the publications page.

(Re)search methods

Modern web archive research methods are complex, relying on extensive tools suggesting a “search as research” paradigm where analysis can be cast as complex search over rich web data. Such methods open up a variety of analytical practices such as the re-assemblage of existing collections around a theme or an event, the study of archival artifacts and scaling the unit of analysis from the single URL to the full archive, by generating aggregate views and summaries. This leads to the co-design of research methods and complex search tools.

This research stream has resulted in the creation of WebARTist, a novel Web archive access system, and in various publications, for instance in the Alexandria journal and the proceedings of the Web Science conference.

Show what’s there and what’s not thereWeb archives attempt to preserve the fast changing web, yet they will always be incomplete. Due to restrictions in crawling depth, crawling frequency, and restrictive selection policies, large parts of the web are unarchived and therefore lost to posterity. We propose an approach to uncover unarchived web pages and websites, and to reconstruct different types of descriptions for these pages and sites, based on links and anchor text in the set of crawled pages. Experiments with this approach on the Dutch web archive and demonstrated the usefulness of page and host-level representations of unarchived content.

This stream has lead to papers at the SIGIR 2014 conference, the DL 2014 conference, a paper in the International Journal on Digital Libraries as well as for the TPDL 2016 conference. In addition, work has been carried out towards the development of search tools which visualize the incompleteness of the Web archive, thus providing context to researchers utilizing the archive.

Supporting complex search tasks

Models of information seeking describe fundamentally different macro-level stages. Current search systems usually do not provide support for these stages, but provide a static set of features predominantly focused on supporting micro-level search interactions. We investigate the utility of micro-level search user interface (SUI) features at different macro-level stages of complex tasks and identify significant differences in the utility of SUI features between each stage. This suggests novel adaptive interfaces that track user task progress and show SUI features at the right times.

This research stream has resulted in full papers at conferences such as Web archives as Scholarly Sources, IIiX 2014, ECIL 2015 and CHIIR 2016, as well as a book chapter.

Web scale and beyond The web is the largest corpus ever, dwarfing the classical large collections of knowledge. Web archives, however, are even orders of magnitude larger. Embracing distributed Hadoop-based solutions from the start is crucial, but also complicating many processing steps, and challenging performance to have tools operate in interactive systems, rather than process batch jobs that take considerable time to complete.

This research stream has, among other things, led to papers for ECIR, TREC and journal papers in the Information Retrieval Journal and International Journal on Digital Libraries.

Valorisation

As part of the valorisation of the WebART project, Spinque developed LuceneByStrategy, which is an open-source implementation of their Search by Strategy paradigm. More information and the source code is available on Github.