WEBVTT

00:00.000 --> 00:11.280
Hello, I am Dorothy Benamu and I am working at the National Library of France as part

00:11.280 --> 00:18.280
of the web accounting team and this presentation is about how we manage the tension between

00:18.280 --> 00:23.920
hosting large amounts of copyrighted data with limited access and meeting researchers

00:23.920 --> 00:32.040
needs to explore these data with open source software.

00:32.040 --> 00:37.560
So since the early 2000s, the National Library of France has been collecting the

00:37.560 --> 00:46.360
French web, that is to say that we regularly collect samples of the cultural production

00:46.360 --> 00:52.400
that are online and which are, as you know, very ephemeral, so this allows researchers

00:52.400 --> 01:00.320
to study the political, social, scientific debates that either primarily take place

01:00.320 --> 01:07.360
on the web or at least find significant echoes there and to study as well the major

01:07.360 --> 01:12.800
transformation that the web has brought into a rare aspect of our lives.

01:12.800 --> 01:22.360
So the B&F is authorized to do this under a low from 2006 and what does it mean to actually

01:22.360 --> 01:29.320
collect the web so we use software web crawlers boats to do so.

01:29.320 --> 01:34.800
We are using open source tools that we develop together with other institutions and

01:34.800 --> 01:40.920
organisms that are part of the international internet preservation consumption, the

01:40.920 --> 01:50.640
AI PC and so you might be familiar with the internet archives way back machine so

01:50.640 --> 01:56.840
internet archive is part of the AI PC2 and we use the same kind of tool for accessing

01:56.840 --> 02:01.680
and browsing past versions of the web.

02:01.680 --> 02:06.600
The main difference is that we focus on the French web so we have more comprehensive collections

02:06.600 --> 02:11.240
of websites, hostility or produced inference.

02:11.240 --> 02:17.240
So what is the scope of the collections of web archives that are available?

02:17.240 --> 02:24.760
So obviously we cannot preserve everything so we try to collect representative samples

02:24.760 --> 02:30.280
regularly of the French web and our harvesting model is a mixed one combining two types

02:30.280 --> 02:32.080
of crows.

02:32.080 --> 02:38.280
Once a year we run a national domain crawl or a broad crawl for this call there is no selection

02:38.280 --> 02:39.280
process.

02:39.280 --> 02:46.840
We are collaborating with diverse organisms to gather lists of domain name that are hosted

02:46.920 --> 02:55.120
in France and for each of these domains we are collecting around 2000 euro.

02:55.120 --> 03:01.520
So last cherry representatives almost 6 million domain names and complementary to this

03:01.520 --> 03:08.040
we also have thematic or curated crows so these are thematic selections made by a network

03:08.040 --> 03:15.640
of librarians mostly but also researchers and associations within their fields of expertise.

03:15.720 --> 03:22.880
We have selected websites that are harvested more in-depth and more frequently so fragrances

03:22.880 --> 03:26.280
ranging from daily to annually.

03:26.280 --> 03:32.880
They cover all disciplines, literatures, sciences, history and we also have a thematic crows for

03:32.880 --> 03:39.560
example for the COVID-19 pandemic or for the European Games or for the electoral for the

03:39.560 --> 03:40.560
elections.

03:40.640 --> 03:49.120
So to develop this example for each major French election we collect the electoral debate

03:49.120 --> 03:50.120
around the election.

03:50.120 --> 03:58.680
It means we select websites, blogs, social networks when possible and diverse online content

03:58.680 --> 04:07.160
from political parties, from unions, candidate associations, sciences, tumourist bolsters

04:07.240 --> 04:14.280
and from any cities and expressing themselves on the internet about the election for example.

04:14.280 --> 04:20.320
So how do we make this data more open to scientific research?

04:20.320 --> 04:28.920
So these are great material for researchers but there are several limitations on those collections.

04:28.920 --> 04:37.920
So the first one is that due to legal restriction, due to privacy and copyright concerns,

04:37.920 --> 04:42.800
the web archives can only be accessed on-site at the B&F in the research reading rooms

04:42.800 --> 04:49.120
or in a network of 20-regional library or questions that are specifically listed in a

04:49.120 --> 04:50.120
decree.

04:50.120 --> 04:55.760
So this is why they can be considered as close data and as a major challenge is that there

04:56.160 --> 05:03.080
are massive data so we try to provide specific tools and services to allow to explore those

05:03.080 --> 05:10.320
data such as Pondore within the B&F data lab and finally there are digital artefacts

05:10.320 --> 05:17.400
so there are traces or recalls of what was on the web at a given period and if researchers

05:17.400 --> 05:22.520
want to use them for their research they have to understand how they were constituted

05:22.600 --> 05:29.720
the intellectual and technical choices that were made and have to deal with the biases and

05:29.720 --> 05:35.560
they have to deal with the incompleteness, the multiple version of the same page, maybe time

05:35.560 --> 05:39.880
and consistencies etc.

05:39.880 --> 05:47.320
So making them more open for research remains a major challenge and this is one of the

05:47.400 --> 06:00.680
issue that Pondore is addressing and I will let Gium explain you why.

06:00.680 --> 06:07.080
All right so hi everyone I'm just going to try to stick that right here if it works.

06:07.080 --> 06:14.760
My name is Gium Luvoye I'm a political scientist I'm a researcher and as such well again

06:14.840 --> 06:20.200
we've talked about this earlier today already for me the practice of doing research is about

06:20.200 --> 06:25.960
building a method that serves an epistemological goal trying to build something that we try to call

06:25.960 --> 06:32.680
scientific knowledge. The thing is today that label is very rarely enforced what we call

06:33.480 --> 06:39.480
scientific knowledge in the common speech is usually things that you find in articles that are

06:39.480 --> 06:46.760
in peer review journals but what that means defecto is often left and said as in the reviewers

06:46.760 --> 06:52.120
that are supposed to vet the research usually don't have access to raw data or to the method.

06:52.760 --> 06:59.400
So it becomes more and again it depends on the discipline but it becomes more about the

06:59.400 --> 07:06.440
plausibility of a narrative rather than the actual work. Hence the need again this is why we're here

07:06.600 --> 07:13.240
for two that are free open source with intelligible source code whose execution can be decentralized

07:13.240 --> 07:18.040
when it's possible and whose outputs are in the control of the user and then that's how we connect

07:18.040 --> 07:24.680
with the previous presentation. We hope that having both the process and the tools available

07:24.680 --> 07:33.320
over a long period of time enables to give us a better chance as being able to reproduce the work

07:33.880 --> 07:39.800
and hence the very relevant question we had on the hardware is through that sometimes the hardware

07:39.800 --> 07:44.680
can be a big issue but since it's the best we can do given the means that we have.

07:46.120 --> 07:50.520
But as researchers we're always trying to reach all available data sources for once research

07:50.520 --> 07:55.080
which collides a little bit with the idea of being reproducible and accessible to most

07:55.800 --> 08:00.520
which is why I started building Penderway which is a software that does

08:01.240 --> 08:06.920
three things or has many processes that can be abstracted into three things harvesting data,

08:06.920 --> 08:13.080
standardizing data and exploring this data. So basically how it works is that it has a first

08:13.080 --> 08:18.600
process called flux that is basically connected to different types of APIs that it calls in a way

08:18.600 --> 08:25.960
that is respectful of the API in order not to drown it. Then so the data it queries has to be

08:26.520 --> 08:32.120
abstractable as documents to be poured into the terro which is useful because researchers usually

08:32.120 --> 08:36.520
even if they're not very good at using computer they know how to use the terro and the

08:36.520 --> 08:43.400
terro is both a web service it's a desktop software it's an SQL database behind it so it provides

08:43.400 --> 08:50.520
a lot of tools and all documents can have attachments as well and notes so it's quite a powerful

08:50.600 --> 08:54.600
open source software and I think everyone is thankful in the scientific community for the terro

08:55.320 --> 09:01.960
and types is a series of data vis systems that are calibrated to explore

09:01.960 --> 09:09.400
copies of documents that rely on D3JS. What does that what does this mean in the context of the

09:09.400 --> 09:16.040
web archives at the national library of friends so the BNF? It means that we need to first

09:16.040 --> 09:24.040
identify that when you're on site when you have access to those to the archives that again are

09:24.040 --> 09:30.280
protected you need to check whether there is something in the archive that's interesting to you

09:30.280 --> 09:37.000
and then you do the full circle of querying the data abstracting it into the terro which is tricky

09:37.000 --> 09:43.160
because you cannot get the data out then visualizing the corpus and then having the opportunity

09:43.160 --> 09:47.320
of looking at each capture of web pages that you want to explore into the web archive browser.

09:48.120 --> 09:54.200
So this is the tension that we had that the whole thing mentioned earlier it's open source

09:54.200 --> 09:59.000
but it's closed data of course we're not the first people to have that problem but we have to

09:59.000 --> 10:05.560
negotiate the specific constraints that we have both in terms of law in friends and in terms of

10:05.560 --> 10:12.120
technical capacity from that institution to be able to provide access to such data so quickly this is

10:12.120 --> 10:17.000
how it looks like it's the form it's a search engine you check that you have a number of results

10:17.000 --> 10:21.800
that's interesting to you here I was looking for final in the 2002 election in France

10:22.600 --> 10:28.680
and once in you are in that software in Pandora it detects that you are within a network that is

10:28.680 --> 10:36.280
authorized gives you the opportunity to do the same request on the same data set tells you that you have

10:36.280 --> 10:41.400
so this is a different request actually it's on dolly the little sheep it tells you that you have

10:42.280 --> 10:48.760
a request under captures that are relevant to your request over time in the collection that's

10:48.760 --> 10:55.640
relevant and gives you an example of the kind of websites that contain those terms because those

10:55.640 --> 11:02.280
are full text indexed collections and then the solution that we found is that we sent only the

11:02.280 --> 11:08.840
metadata to Zootero so we did not upload the actual content of the fields but we only selected we

11:08.920 --> 11:15.400
remapped the documents and selected a series of metadata not all of them that were relevant to

11:17.160 --> 11:22.680
to the corpus and so here you have an example have the website title the host the date

11:23.480 --> 11:29.800
and we used to pass a JSON object stringified JSON object at the short title because it's

11:29.800 --> 11:34.760
common to also Tero documents but now we cannot do that anymore we used to be able to do that so now

11:34.760 --> 11:41.480
we upload notes instead and then we re-stand it back to paint a ray when you can see it as a corpus

11:42.520 --> 11:48.200
here you have in blue the captures and the links towards documents that are different and so when you click

11:48.200 --> 11:57.560
a link if you are within the bnf it sends a query and gives you access to the full content so it tells you

11:57.560 --> 12:03.240
how many captures there are over time and you can look forward within the content so it tells you

12:03.240 --> 12:08.200
and it gives you a short exert so you can have an idea of how relevant the term you're looking for

12:08.200 --> 12:14.040
is in the context of that page so the way we found to conceptualize this kind of relationship

12:14.040 --> 12:21.240
that enables us to build open software with in that instance close data is the one way mirror model

12:21.240 --> 12:28.200
so in this model there is we consider the web archive as a bit like a suspect in an interrogation room

12:29.160 --> 12:33.480
and they have a one way mirror on their side so they only see themselves and you can ask them

12:33.480 --> 12:40.280
question so we build that model on metaphor based on three properties one is that you can harvest the

12:40.280 --> 12:45.000
data when you're in the room but you cannot have right access to it so you can talk to the person

12:45.000 --> 12:50.520
which you cannot change what the person knows you can take parts of each record so here the

12:50.520 --> 12:55.720
metadata and not whole thing but only some fields that we determined but you cannot send requests

12:55.720 --> 13:01.320
when you're away you have to be in here to ask the questions and you can come back for highlight

13:01.320 --> 13:06.840
on a specific piece so that refers to the that part of the exploration when you click on a note and

13:06.840 --> 13:13.880
send your request information on a specific note but you cannot take the whole body out so you cannot

13:13.880 --> 13:19.720
ask for the whole corpus to to be extracted because there would be a risk that then you would just put

13:19.800 --> 13:26.440
it on a USB key and walk home with it so that's that's how we build our model thank you for

13:26.440 --> 13:45.080
attention we'll be taking questions yep hi so how much of this is opposed by right holders on

13:45.160 --> 13:52.200
a continual basis so you always fighting the problem that you have to be careful as to what

13:52.200 --> 13:59.880
people can take out of the closed room yes yes oh yes sorry the question I should know that so the

13:59.880 --> 14:05.960
question is how do we manage the fact that everything is under the rights of the creators of the

14:05.960 --> 14:12.280
authors and how much we have to fight that so there is a context that is still to be I guess proven

14:12.280 --> 14:19.640
by practice but basically the idea is that you don't have you're not justified in taking anything out

14:19.640 --> 14:27.640
of there except if you're a researcher and you need to take an excerpt to show a good faith that

14:27.640 --> 14:33.480
the argument actually sends back to something that is empirically in the database so if someone wants

14:33.480 --> 14:38.200
to if a researcher reads your paper and wants to know more or to check that what you're quoting is

14:38.200 --> 14:43.800
actually accurate and you have that number of documents that are in the family of the phenomenon that

14:43.800 --> 14:50.680
you're trying to describe they can come to the NF and use hopefully the same tool on the same data

14:50.680 --> 14:56.680
and find the same results so it's extremely coercitive in a way there's no way of taking out the data

14:56.680 --> 15:01.800
you're just you're supposed to be able to to quote it and that's the liberty I took as a researcher

15:01.800 --> 15:07.480
to just show you those little sentences also I think this is still on the live web so you might

15:07.560 --> 15:13.560
find it but so yeah it's very it's very narrow and it's a problem for research but for now

15:13.560 --> 15:23.160
it's how it is I think maybe tell you one more question otherwise let's take this speaker