André Mourão, from the Arquivo.pt team, explains all about the functionality of this service from the FCCN Unit, which allows users to search for images from the past.
Arquivo.pt launched, on the 24th of March, the Dionisius project. Can you tell us what this initiative consists of?
At Arquivo.pt, we have a periodic release model for new versions of our portal.
In these releases, we group all the improvements made, and these are usually focused on a central goal of the release.
Dionisius had a special impact as we launched the new version of the image search, the result of years of work. We went from a prototype with 22 million searchable images, to a system that provides more than 1.8 billion images, while maintaining the responsiveness and ease of use of our web portal.
Since then, as you say, 1.8 billion images from the Web's past have become searchable on Arquivo.pt. How do you classify this result?
This process went quite well and far exceeded our most optimistic expectations. We processed more than 8 billion pages, for a total of 520TB of archived data, corresponding to the time period from 1992 to 2019.
By May 2020, we were predicted to find 18 times as many images; the end result was an 81-fold increase, in the number of searchable images.
This solution is described as "an innovative system" by Arquivo.pt. How does this version innovate in relation to what has been done by other web archives?
Apart from the scale, the greatest innovation of this Arquivo.pt search is the focus on extracting relevant information from the pages for each image. For all images on all pages, we extract a textual legend, corresponding to the portion of the page text that is closest to the image.
This is especially relevant on pages that have many images, as it allows users to find the specific image that illustrates their search.
Other functionalities to be highlighted are related to the automatic classification of potentially offensive content for users, advanced search with multiple content filters and automatic access from APIs, which allows the data collected by Arquivo.pt to be used in innovative projects(https://arquivo.pt/apis).
What kind of added value and potential does this new functionality represent for the Arquivo.pt user?
External studies show that around a quarter of general searches on the web are for images. In the case of Arquivo.pt, image searches represent around one fifth of the total searches performed. Searching on archived data allows to have an insight on generalist image search engines such as Google Images. These are focused on searching images of the present, especially popular and recent content.
Arquivo.pt allows a retrospective search with a special focus on time. Old versions of images and pages are available for consultation, making it possible to see how pages and images have evolved. Our search allows greater exemption in the results returned, as we are not focused on popularity metrics. We also allow greater granularity in filtering search results (for example, filter results by date, website, file type, among others).
Arquivo.pt has already given origin to many projects with potential of positive impact in society. In the specific case of this research, a scientific article was recently published by Ricardo Campos and co-authors, where the image search API is used to find images to illustrate the results of the temporal division of a news story.
Giving a personal example, I found many records of book reviews made by my great-aunt at the Calouste Gulbenkian Foundation. These records were scanned from the originals from the 1960s and 1970s, placed on the Gulbenkian website and are now available for consultation and research at Arquivo.pt.
Is there anything you would like to add?
Arquivo.pt will organize an online session, where I will talk about how we made these 1.8 billion images searchable. The event will take place on April 23rd at 3pm (with free pre-registration).
I would like to reinforce that our web portal and APIs are open source and free to access and are available for personal use or for research projects without prior registration.
Finally, I would also like to mention the Arquivo.pt Prize(https://arquivo.pt/premio2021), now in its 4th edition, which aims to reward innovative works based on historical information preserved by Arquivo.pt with up to 10,000€. The works may focus on themes from any area (e.g. Education, History, Sociology, Communication, Health, Information Technology) and applications are open until 4 May 2021.