In 2021, almost all information is produced, disseminated and consumed in digital format. This is the information that governs every second of our societies. However, when we look at the various existing digital preservation initiatives, they focus on digitising holdings that were not born in digital format, for example by digitising printed books. Most digital information published online is lost.
Recent events have reinforced this trend. About a year and a half after the start of the Covid-19 pandemic, we see how this event has generated an immense amount of information (e misinformation) that has been produced, replicated, adulterated, disseminated, and which, whether we agree with it or not, has influenced social, political and economic action.
It should be noted that many of the great torrents of information that currently influence political decision-making processes in democracy are informal, being conveyed through online channels such as social networks. This information is as quick to appear and influence as it is to disappear and become inaccessible (except for the multinational companies that own the platforms and are aware of the value of preserving this information). Without the memory of the first online information (and misinformation) that proliferated about SARS-CoV-2, what lessons can governments and citizens learn? What history can be written about other contemporary events without the memory of online?
There is official digital information that is carefully preserved. Examples are the publications of the Electronic Official Gazette (Diário da República Eletrónico) or the Authentic Digital Objects preserved in the RODA of the Directorate General of Books, Archives and Libraries. However, these official communications document the effect of events and are hardly sufficient, by themselves, to analyse the causes of a phenomenon, drawing lessons to better react to similar situations in the future.
Online archives vs. from online
What is meant by an "online archive"? Examples such as the above are excellent "online archives" that may continue to evolve through the conventional adaptation of legislation and technology. After all, the importance of preserving the Law Decrees of a Republic is incontestable. My concern is mainly related to the from online archives", since there is not yet an established awareness of the need for them, be it at academic, governmental or individual level.
That "Information is power" is an accepted truth. Modern organisations communicate strategically, sharing information through their online channels such as websites or social media. But how many organisations are aware of the value of preserving their online information? How many are aware of the risk of losing that information? How many teachers of various scientific areas alert their students to the importance of preserving online information or to the impacts of losing it? If information is power, then losing information is losing power.
It is technologically impossible to preserve all online information. But it is absurd not to be aware that we have to preserve some of the information online for short, medium and long-term access (and consequently act accordingly). After the arrival of the Information Age, which solved the problem of access to information, archives have to contribute to combat the current Disinformation Age. The role of online archives is crucial in this fight because analysing an information from various sources over time contributes to identifying inconsistencies or attributing credibility. The greater the volume of information, the more possibilities there are to assess the veracity of a piece of information.
The advantage of online archives is that information, once born digital and quickly available, can be processed automatically and in multiple ways. But it requires the creation of a new type of institutions to carry out archiving from online because it is a task with very specific challenges that require experts and adequate resources.
The cost of not preserving digital information born online will be Dantesque for future generations because it will be impossible for them to learn from the mistakes of the past. In this sense, the main challenge for from archives is to make the world aware that they are needed today.
Online archive: difficult but not impossible
Technically, most of the content we consume online is served via the HTTP (or HTTPS) protocol, i.e. it is web content. However, about 80% of the content available on the web is changed or disappears after just one year.
The Internet Archive is a US non-profit organisation that archives web content worldwide. However, it is difficult for a single organisation to make an exhaustive archive of all published content because the web is constantly changing and much information disappears before it can be archived.
Furthermore, the documentation of historical events of national relevance for a given country is not a priority for the Internet Archive and much of the information published, for example, on the Portuguese web is irremediably lost. This problem is also felt by other national communities. there are already at least 93 web archive initiatives spread around the world.
In Portugal, Arquivo.pt is an example of an online archive that allows searching and accessing web pages archived since 1996. This is a public service managed by the Foundation for Science and Technology that is accessible to any citizen. Arquivo.pt stands out for providing a search service on pages and images from the past. A kind of Google, but for the Web's past.
The system that supports Arquivo.pt periodically collects and stores information published on the Web. It then processes this information to make it searchable and accessible. This preservation process is carried out automatically through a large-scale distributed computer system. The search and access service can be used automatically through Application Programming Interfaces (APIs) to develop innovative applications that take advantage of the archived information.
Arquivo.pt provides a free preservation service to web authors and at the same time a valuable research resource that has already been used by researchers, for example, to automatically measure the accessibility of the Portuguese web for disabled people. The Arquivo.pt Award annually awards works that use the information preserved by Arquivo.pt. The ten awarded papers to date are real examples that the social and scientific potential of online archives is immense and has only just begun to be tapped.
Arquivo.pt holds more than 10 billion archived files (700 TB). However, the biggest challenge is not the disk space to store this information. The challenge is to keep this information searchable and accessible in a timely manner, which these days means providing answers to users within seconds and suitable for any device. The second challenge is to recruit and train specialised human resources. How to archive online is not yet taught in universities and so a permanent effort is needed to train the new team members.
The third, and most unexpected challenge for me, is the difficulty in disseminating the existence of the service. I hope I have been able to make the case so far that the online archives are necessary. Arquivo.pt has been publicly available since 2010. How long have you known about it? We live in an attention economy. The attention of human beings has become a scarce commodity, for which the world's most powerful companies compete fiercely with each other using almost unlimited resources and ethically questionable strategies. In the online world, which is Arquivo.pt's home, this will be the great short-term challenge: to capture attention so that this public service may be useful to more people.