WEBVTT

00:00.000 --> 00:10.000
Okay, hello everyone, my name is Hugo, and I will finish the

00:10.000 --> 00:16.000
San Tox, so my colleagues talk about how we started date on this, how we started

00:16.000 --> 00:20.000
date on tapes, but actually also how we distribute the data, because

00:20.000 --> 00:24.000
Seren is not the only place where we started data. And this presentation is about

00:24.000 --> 00:28.000
a project is an open-source project called Rousseau, it's a Python

00:28.000 --> 00:32.000
Britain software, and before entering into this project in detail,

00:32.000 --> 00:36.000
I would like to say a little bit the scene. This is probably the most beautiful

00:36.000 --> 00:40.000
slide that you're going to see in my whole presentation. But

00:40.000 --> 00:44.000
Serenna, as my colleague mentioned, we have all these different open-source

00:44.000 --> 00:48.000
systems that we develop, or we contribute upstream, like for example

00:48.000 --> 00:54.000
theft. So we have focused in last on the last one, which is Rousseau

00:54.000 --> 01:00.000
on this presentation. This is a plot that I present a few weeks ago, just to

01:00.000 --> 01:06.000
show you a little bit the scale of the data we have to manage. So in the last two years,

01:06.000 --> 01:10.000
we actually have to start more data than in the last seven

01:10.000 --> 01:16.000
years of the lab. And as you can see, this curve is getting more aggressive, and in the next five

01:16.000 --> 01:20.000
years, it's going to be more and more aggressive. Which means that on a constant

01:20.000 --> 01:26.000
budget, we have to be very creative about how to tame that data delusion.

01:26.000 --> 01:34.000
Okay, so who knows about WCG? Okay, one person, two, three person

01:34.000 --> 01:38.000
for person. Okay, so it's not very known because it's okay, it was developed

01:38.000 --> 01:40.000
to a hundred-delete program, but actually this is the largest

01:40.000 --> 01:44.000
computing grid in the world. Serenna sits in the middle,

01:44.000 --> 01:48.000
so this is like an onion. We're all the data produced by the LHC machines.

01:48.000 --> 01:52.000
They are stored at the cell, which is what we call the TR0.

01:52.000 --> 02:00.000
And then we have to distribute the data around 170 different data centers across 42 countries.

02:00.000 --> 02:06.000
And this infrastructure is used by more than 42,000 people to actually do analysis of this data.

02:06.000 --> 02:10.000
We have physicists around the world, not only at cell.

02:10.000 --> 02:16.000
So how would we distribute this data? How we can access the data that is originally stored

02:16.000 --> 02:20.000
at Seren from other places. So if you run a storage system,

02:20.000 --> 02:24.000
sometimes in your career, you will get a user asking you, okay,

02:24.000 --> 02:28.000
my analysis at Seren, it's working very fast. I'm very close to the data.

02:28.000 --> 02:32.000
Can you actually make your storage system faster? Because I'm connecting

02:32.000 --> 02:38.000
from the US, and the performance is not great. I got this request very often,

02:38.000 --> 02:42.000
which means in our storage group, this is what actually means the user.

02:42.000 --> 02:48.000
Okay, how can you optimize a local POSIX workflow? The physicist will write some in script

02:48.000 --> 02:52.000
that is using basically some system calls on a link, which is from the US,

02:52.000 --> 02:56.000
to Geneva, or the transatlantic fibers, over why they are network,

02:56.000 --> 03:00.000
you see a few protocols. And the question, okay, how can we actually optimize that part?

03:00.000 --> 03:06.000
And the answer is like, we can make some tricks, but we are bound by the speed of light,

03:06.000 --> 03:12.000
and we are not really going over it. So inside Seren, and this is just an example,

03:12.000 --> 03:16.000
or real example, actually of the latency is not we have. So inside the data center,

03:16.000 --> 03:22.000
we have 0.25 milliseconds between two machines. So if you have another five system,

03:22.000 --> 03:26.000
this latency, let's say for the analysis use cases that we have,

03:26.000 --> 03:30.000
is barely noticeable. But once you go to the US,

03:30.000 --> 03:34.000
or to other continents, where you have much bigger latency,

03:34.000 --> 03:38.000
this is for example a thing I did this morning. It's around 125 milliseconds.

03:38.000 --> 03:42.000
So this means for every operation that you have to do over the network,

03:42.000 --> 03:46.000
the operation is 400 times slower, which is not nice.

03:46.000 --> 03:50.000
So what do we do actually? How can we optimize this?

03:50.000 --> 03:56.000
The answer is like, we cannot do it. So we have actually to copy the data from one place to another.

03:56.000 --> 04:00.000
So the data is analyzed locally on the remote system.

04:00.000 --> 04:04.000
This is an activity example. I did this morning, okay.

04:04.000 --> 04:08.000
I just straight a stupid program to come some lines. And just, no,

04:08.000 --> 04:10.000
this is stupid program. It has around 125.

04:10.000 --> 04:14.000
It's that system calls and 400 reads. To just count how many times

04:14.000 --> 04:18.000
the word your appears in the text, you actually was a Russian text.

04:18.000 --> 04:22.000
So there is not many, zero. Just to show you that, you know,

04:22.000 --> 04:26.000
the physical analysis and much more complicated than that. So there are a lot of more

04:26.000 --> 04:30.000
system calls over the local kernel mount or the field mount.

04:30.000 --> 04:34.000
And when you have this white area network links and the performance

04:34.000 --> 04:38.000
rapidly decreases. So yeah, the answer is like,

04:38.000 --> 04:42.000
when you have to really have local IO performance,

04:42.000 --> 04:46.000
you actually need to ship the data to some place and then you do local operations on

04:46.000 --> 04:50.000
the metal on top of a local file system or you have a fast network

04:50.000 --> 04:54.000
interconnected with inside your data center.

04:54.000 --> 04:58.000
So this means that you actually need to have a way. You know, a mechanism

04:58.000 --> 05:02.000
to move data from server to over this 170 different data centers

05:02.000 --> 05:06.000
that are distributed in the globe. So how we do it?

05:06.000 --> 05:10.000
You know, this is the typical, when you have a student coming to my office

05:10.000 --> 05:14.000
when I arrive for the first day of job and how can you transfer this data?

05:14.000 --> 05:16.000
Okay, let's see please some data.

05:16.000 --> 05:20.000
Okay, you have a few megabytes, not a problem. When you have to transfer like

05:20.000 --> 05:24.000
billions of files containing around one petabyte of data,

05:24.000 --> 05:28.000
you can still leave it at the right, but there are many things that you will go

05:28.000 --> 05:32.000
around, right? And this is a few of the things that we have to deal on a daily

05:32.000 --> 05:36.000
ways. Okay, we have network failures. Sometimes the network is not reliable.

05:36.000 --> 05:40.000
The server has some access. So at the end, you may basically fail

05:40.000 --> 05:44.000
some transfers. Sometimes we have a lot of different countries that

05:44.000 --> 05:48.000
we have to transfer the data. Sometimes you have high-speed restrictions between

05:48.000 --> 05:52.000
so, for example, you may not transfer data to a Chinese computing center

05:52.000 --> 05:54.000
from the US, so you have to pass through Europe, etc.

05:54.000 --> 05:58.000
So there are a lot of logic involved to how to basically see the data from one place to another.

05:58.000 --> 06:02.000
And also the typical question, you know, how we transfer. Are we confident

06:02.000 --> 06:06.000
that we transfer all the data and that we have not corrupted any of this data, right?

06:06.000 --> 06:10.000
So for the end data, we cannot just use STP or whatever tool.

06:10.000 --> 06:14.000
We have to heavily get robust and system that have some estate,

06:14.000 --> 06:18.000
but understand the data that we are transferring. And to do some analysis

06:18.000 --> 06:22.000
of these data to see if the data has been corrupted.

06:22.000 --> 06:26.000
So this is what I'm going to talk about. This project is called Rousseau,

06:26.000 --> 06:30.000
and actually solves many of these problems. So Rousseau is an open source project

06:30.000 --> 06:34.000
that started in the Atlas experiment at 7, so it has over 10 years of

06:34.000 --> 06:38.000
operational experience. And the Atlas experiment decided to open up for the community

06:38.000 --> 06:42.000
a few years ago, and now it's used in many scientific communities that I will

06:42.000 --> 06:46.000
mention later. So Rousseau's free open source is from GitHub, has

06:46.000 --> 06:50.000
a permissive license, Apache V2, and it provides actually a scientific data

06:50.000 --> 06:54.000
management platform. So you have data that you need to distribute to many of

06:54.000 --> 06:58.000
the different sites. You register this data in Rousseau. You don't need to run

06:58.000 --> 07:02.000
a Rousseau null with your computer center. This is a nice part about Rousseau.

07:02.000 --> 07:06.000
It's a middleware that runs in some place, and then it just has a catalogue of the data,

07:06.000 --> 07:10.000
where the data is sitting, has a pointer to all these data centers, which means that

07:10.000 --> 07:14.000
location are where any city originals. Because imagine you have 170

07:14.000 --> 07:18.000
data centers in different countries, you cannot just say, okay, you all use

07:18.000 --> 07:22.000
these same technologies. So we, you know, the life is great. So every country

07:22.000 --> 07:26.000
will have their own technologies that they would like to use, etc.

07:26.000 --> 07:28.000
So you need to have something that actually under this

07:28.000 --> 07:32.000
heterogeneous technologies. So what are the

07:32.000 --> 07:36.000
functionalities that Rousseau provides? I will not enter into all of them,

07:36.000 --> 07:40.000
but the view of them is that it's horizontal, it's scalable.

07:40.000 --> 07:44.000
The name space is a SQL database that you can buy. You can use publish,

07:44.000 --> 07:48.000
SQLite, or Oracle. It transfers between different

07:48.000 --> 07:54.000
facilities. You can be these facilities types or HPC cloud

07:54.000 --> 07:58.000
systems as well. And actually provides different

07:58.000 --> 08:02.000
authentication mechanism. So the user can authenticate with

08:02.000 --> 08:06.000
open ID tokens, or with X iPhone 9, or like TLS client certificates.

08:06.000 --> 08:12.000
This is a high level overview of the architecture. So in the

08:12.000 --> 08:16.000
top, we have the clients. We have CLI clients, Python SDKs,

08:16.000 --> 08:20.000
REST API, and web interface. So you can actually use

08:20.000 --> 08:24.000
these interfaces to access the system. And then there are different

08:24.000 --> 08:26.000
components. The most important ones are the demos.

08:26.000 --> 08:30.000
Demons are like independent processes, these are Python

08:30.000 --> 08:32.000
features, multi-threaded, that what they do is like

08:32.000 --> 08:36.000
they connect to the database. They obtain some job, they do some processing,

08:36.000 --> 08:38.000
and then they store it back in the database, so other demos

08:38.000 --> 08:42.000
can work on that. And at the end, we have different

08:42.000 --> 08:46.000
transfer protocols, like S3, or like pure HTTP with web

08:46.000 --> 08:48.000
dub, and some other ones that probably do

08:48.000 --> 08:52.000
know about, like logos, which is more orientated to

08:52.000 --> 08:58.000
scientific data. Yeah, and that's basically it.

08:58.000 --> 09:02.000
Yeah, but this is the most important thing about RUSIO, is that

09:02.000 --> 09:06.000
is a system that is declarative. You don't tell RUSIO how

09:06.000 --> 09:10.000
you want to transfer the data. You basically express how

09:10.000 --> 09:12.000
you want to see the data in different data centers with

09:12.000 --> 09:14.000
the declarative language. So for example, you can say

09:14.000 --> 09:18.000
I want three copies of this data in three continents, and at

09:18.000 --> 09:22.000
least one on the tape system. You can say with a rule, I want one

09:22.000 --> 09:26.000
copy of this file anywhere, but on a fast disk. And you can

09:26.000 --> 09:30.000
create a center, about, which let's say, what are the capabilities of each

09:30.000 --> 09:34.000
data center. This is what we call the replication rules.

09:34.000 --> 09:38.000
So when there is a need to copy the data to some place, just create a

09:38.000 --> 09:42.000
replication rule. This one simple command in the CLI, and then there are

09:42.000 --> 09:46.000
thousands, if not millions of transfers that are created on the back of the

09:46.000 --> 09:50.000
system, to transfer this data, and to verify that the data is actually

09:50.000 --> 09:54.000
correct, and not corrupt, etc. All you can do it automatically. This is

09:54.000 --> 09:58.000
one of the reasons you can say all the data that comes from this

09:58.000 --> 10:02.000
computer center, or has this metadata, because you can attach

10:02.000 --> 10:04.000
metadata to every file or catalog. You can create

10:04.000 --> 10:08.000
transfers. So for example, you have data workflow that

10:08.000 --> 10:10.000
you are writing into the system. You can trigger

10:10.000 --> 10:12.000
replication rule automatically through the

10:12.000 --> 10:16.000
subscription, so then the data is copied to other systems.

10:16.000 --> 10:20.000
Yeah, this is just the concept in RUSIO. It's what we call

10:20.000 --> 10:24.000
the RUSIO name space. The RUSIO name space is consisting of files

10:24.000 --> 10:28.000
and data sets and containers, containers, we will disappear,

10:28.000 --> 10:32.000
by quality meter, I can tell you more about this. But yeah,

10:32.000 --> 10:36.000
files and data sets, this is our digital abstraction about a file, another

10:36.000 --> 10:40.000
directory, and a normal file system. And this is how we

10:40.000 --> 10:44.000
explain basically all the data that is cataloging RUSIO, so we have a scope,

10:44.000 --> 10:48.000
and then we have a name. And this is unique. You cannot change it,

10:48.000 --> 10:52.000
but once you basically mint it, it's like, yeah,

10:52.000 --> 10:56.000
you cannot replace it. Let's say, it's persistent.

10:58.000 --> 11:02.000
What I mentioned before also is that in RUSIO you can explain

11:02.000 --> 11:06.000
different queries with metadata, so when you register the data, you can

11:06.000 --> 11:10.000
actually add likeization payload, describing which metadata

11:10.000 --> 11:14.000
you want, this is arbitrary, and then you can create RUSIO based on that metadata.

11:14.000 --> 11:16.000
So it's quite flexible on that regard, and this interface is actually even

11:16.000 --> 11:20.000
enhanced, being enhanced right now with different projects.

11:20.000 --> 11:24.000
On the operation side, what is in the day to day of operating this

11:24.000 --> 11:28.000
service? The objective actually to minimize the

11:28.000 --> 11:32.000
operations of human people, because at the end, this is one of the

11:32.000 --> 11:36.000
most of the costly factors in every organization, which is to optimize

11:36.000 --> 11:40.000
the human course, and not having a person that has to do it

11:40.000 --> 11:44.000
but the system that does it automatically for you, is really great.

11:44.000 --> 11:48.000
And moreover, it also means that when you have a system like that,

11:48.000 --> 11:52.000
you have to understand all the protocols, and you know, if

11:52.000 --> 11:54.000
Amazon decides to change the SDK and then breaks in

11:54.000 --> 11:57.000
compatibility with, you know, on premise,

11:57.000 --> 12:00.000
is three clusters, but you have to follow up this. But in this case,

12:00.000 --> 12:03.000
it's the RUSIO server that will handle this

12:03.000 --> 12:07.000
upgrades for you. And the last thing is like,

12:07.000 --> 12:10.000
you know, how the transfers, they actually move from one place to another,

12:10.000 --> 12:14.000
and this is a great topic. But actually, RUSIO does not

12:14.000 --> 12:18.000
proxy the data from one place to another. We actually

12:18.000 --> 12:22.000
expanded this community for WCC, created an

12:22.000 --> 12:26.000
expansion of web-dub, is what we call the third-party

12:26.000 --> 12:28.000
copy mechanism, and this is a peer-to-peer copy.

12:28.000 --> 12:32.000
If you have a client, this is an HTTP

12:32.000 --> 12:34.000
copy command to server A, and there is a

12:34.000 --> 12:37.000
header, which is called source, about five

12:37.000 --> 12:41.000
data. And then the server A, it will use an HTTP

12:41.000 --> 12:44.000
guide to get the data, while the client gets some

12:44.000 --> 12:47.000
performance metrics out of the server A. And this is how we

12:47.000 --> 12:50.000
work on, because you don't want to have one single

12:50.000 --> 12:52.000
system, or even if it is distributed, and you can

12:52.000 --> 12:55.000
originally scale to proxy all the data through one single

12:55.000 --> 12:58.000
planes for 170 computer centers. So this is peer-to-peer,

12:58.000 --> 13:02.000
and this is just simply like extension of web-dub.

13:02.000 --> 13:06.000
And that's it. Just to mention that this project is

13:06.000 --> 13:10.000
used in many physics, scientific organizations, and

13:10.000 --> 13:13.000
some of our physics ones. And we are looking for what

13:13.000 --> 13:16.000
actually, if there is other interest in other

13:16.000 --> 13:18.000
communities, so you are looking, or you have the same

13:18.000 --> 13:21.000
needs for transferring big data sets across different

13:21.000 --> 13:24.000
countries, or different institutions. This could be

13:24.000 --> 13:28.000
an ice project to contribute to. Yeah, so that's

13:28.000 --> 13:30.000
it from my side. Thank you.

13:37.000 --> 13:43.000
I'm fast. Any questions?

13:43.000 --> 13:46.000
Yeah? You said, once you created, or you

13:46.000 --> 13:51.000
cannot change the network? Yeah. So what happens if you

13:51.000 --> 13:55.000
do a typo? Yeah. So basically, the question is,

13:55.000 --> 13:59.000
what happens if we change an identifier? So in

13:59.000 --> 14:01.000
Russia, there is a content that when you create an

14:01.000 --> 14:04.000
identifier, you keep it open, until you are very sure

14:04.000 --> 14:06.000
that this is the state that you want, and then you

14:06.000 --> 14:08.000
have like a double commit phase. You can say, okay,

14:08.000 --> 14:11.000
I close now this DID, and then if you

14:11.000 --> 14:14.000
broadly mistype it, then what you do, actually,

14:14.000 --> 14:18.000
the metrothinatlas is a good answer.

14:18.000 --> 14:23.000
The metroth was with us in the same team.

14:23.000 --> 14:27.000
So you can grab him as you want a specific

14:27.000 --> 14:29.000
like, you know, from operations perspective, if it

14:29.000 --> 14:32.000
happens in real life, what are the tricks that you

14:32.000 --> 14:39.000
can do? Okay. Any other questions? Yeah?

14:39.000 --> 14:47.000
Yeah. Yeah. So the question is, can we add metadata?

14:47.000 --> 14:50.000
So the files are set several? Yes, absolutely.

14:50.000 --> 14:54.000
You can use a key value pairs when you subscribe

14:54.000 --> 14:57.000
to the data, and you can also put like a JSON file,

14:57.000 --> 15:01.000
and then basically you can issue queries on this metadata.

15:01.000 --> 15:08.000
Okay. One more question?

15:08.000 --> 15:14.000
Yeah. So basically, when you tell the truth that

15:14.000 --> 15:18.000
sorry, the question is, if you want to download the data,

15:18.000 --> 15:22.000
if you want to download the data, if you want to download the data,

15:22.000 --> 15:28.000
if you want to download the data, if you want to download the data,

15:28.000 --> 15:32.000
if you want to download the data, if it proxy through

15:32.000 --> 15:34.000
Rousseau, or it's basically obtained the

15:34.000 --> 15:37.000
rally from the storage systems, and the answer is,

15:37.000 --> 15:40.000
you can get it directly from the storage systems.

15:40.000 --> 15:42.000
So as Rousseau, where did the data sit in, and then

15:42.000 --> 15:45.000
Rousseau will do the magic to find the best copy

15:45.000 --> 15:48.000
available to you. Actually, it has some logic,

15:48.000 --> 15:51.000
using basically JYP to find the closest computer

15:51.000 --> 15:54.000
center close to you, so it gives you the closest one,

15:54.000 --> 15:56.000
and then you download the rally from the site.

15:56.000 --> 15:58.000
And you don't have to deal with authentication.

15:58.000 --> 16:03.000
The rally, this is handled behind your back.

16:03.000 --> 16:05.000
Yeah.

16:05.000 --> 16:11.000
You mentioned that the data placement are based on Rousseau,

16:11.000 --> 16:13.000
or the Rousseau, right?

16:13.000 --> 16:17.000
Have these rules that you find by the creation of the data

16:17.000 --> 16:21.000
about data consumers, and how they,

16:21.000 --> 16:25.000
like the dynamic report properly by a business in the US,

16:25.000 --> 16:28.000
and try to access some data that is not in Rousseau,

16:28.000 --> 16:31.000
whether the data can do a bit change or,

16:31.000 --> 16:34.000
like, some part of the data is in Rousseau,

16:34.000 --> 16:38.000
whether it's not for the next instance.

16:38.000 --> 16:40.000
Yeah. So the question is,

16:40.000 --> 16:42.000
Hohan does let's say the creation of the Rousseau,

16:42.000 --> 16:44.000
and who defines the Rousseau.

16:44.000 --> 16:46.000
So depends on the community.

16:46.000 --> 16:49.000
This will be my generic answer, because every community has

16:49.000 --> 16:50.000
different needs.

16:50.000 --> 16:53.000
But for example, what I see on a daily basis,

16:53.000 --> 16:55.000
is that there are some systems and administrators

16:55.000 --> 16:58.000
that runs the Rousseau, let's say, instance of this project,

16:58.000 --> 17:01.000
and if the users they get some quota,

17:01.000 --> 17:04.000
so they say they can transfer up to 50 terabytes across,

17:04.000 --> 17:06.000
let's say, all the computer centers,

17:06.000 --> 17:08.000
but if the request is more than 50 terabytes,

17:08.000 --> 17:10.000
it's an approval from someone else.

17:10.000 --> 17:12.000
And Rousseau has built in this approval workflow.

17:12.000 --> 17:15.000
So you can define the threshold, so the rule can be approved.

17:19.000 --> 17:25.000
Okay, so that's it, thanks.