WEBVTT

00:00.000 --> 00:12.720
Okay, so I think we are good to start, please welcome Damien Closchard.

00:12.720 --> 00:18.400
He will talk about posgurskwell anonymizing privacy, but the privacy.

00:18.400 --> 00:22.560
This is a very actual topic, so yes, welcome, and thank you.

00:22.560 --> 00:32.560
Thanks a lot, are you here with me?

00:32.560 --> 00:35.960
It's all right, let's go.

00:35.960 --> 00:43.840
So my name is Damien Closchard, and one of the co-founders of Dalibo, which is a French posgurskwell

00:43.840 --> 00:44.840
company, basically.

00:44.840 --> 00:50.120
And I also an active member of the French posgurskwell community, if you've done some posgurskwell

00:50.120 --> 00:55.040
in France, you probably know me for that.

00:55.040 --> 00:56.880
So what brings me here?

00:56.880 --> 01:00.080
Why am I here before in front of you?

01:00.080 --> 01:04.040
Like five years, seven years ago, actually.

01:04.040 --> 01:08.760
Someone asked me to anonymize a database, and I say, well, easy, quick.

01:08.760 --> 01:10.920
I'm going to do it for next week.

01:10.920 --> 01:16.560
So I just got a bunch of PLPG scripts to do it.

01:16.560 --> 01:22.400
I just named it posgurskwell anonymizer, because it was for the topic.

01:22.400 --> 01:27.560
And somehow I'm still there, seven years later, still working on it.

01:27.560 --> 01:32.520
The funny things is, I never actually anonymize the database, in fact, but I did a tool

01:32.520 --> 01:35.440
to an anonymizer to do it.

01:35.440 --> 01:38.280
So this is my story.

01:38.280 --> 01:43.160
And I'm here to talk about the privacy and how you can protect that.

01:43.160 --> 01:51.720
So I'm going to go a little bit on the definition of what privacy is before going to a

01:51.720 --> 01:59.160
more concrete example, and yeah, the principles of data protection and how to do it in

01:59.160 --> 02:05.160
with posgurskwell anonymizer, because that's the thing I did, but those principles could

02:05.160 --> 02:09.720
be used with other tools, too, okay?

02:10.680 --> 02:18.520
So the paradox of privacy, so yeah, basically, it's a strange concept, basically, privacy,

02:18.520 --> 02:26.520
because compared to other concepts like your private property, private properties, quite

02:26.520 --> 02:34.800
well defined, I think we all know that this computer is mine, but this is not mine, okay?

02:34.800 --> 02:42.000
This is very clear, but privacy is not so clear, actually.

02:42.000 --> 02:49.880
So yeah, we could do 45 minute essay about what is privacy and pretty sure we would all

02:49.880 --> 02:55.560
write different things about what is private, what is intimate for you.

02:55.560 --> 03:01.280
And actually, it's all, it's also, so it's different between people, it's different between

03:01.360 --> 03:05.640
eras and also it's different between regions, okay?

03:05.640 --> 03:13.360
Someone in Asia may think privacy means something else, all right?

03:13.360 --> 03:21.800
So yeah, and basically, it's quite new, it's new in the history of humanity, it's a new

03:21.800 --> 03:31.240
thing, because basically, 200 years ago, everybody slept in the same bedroom, all right?

03:31.240 --> 03:36.800
So there was no intimacy in the sense that we talk about it now, you know, in the villages

03:36.800 --> 03:41.320
of everybody knew everything about everyone.

03:41.320 --> 03:49.560
If you had secret, maybe the only way to talk about it was to go to the local priest.

03:49.560 --> 03:53.960
And even that, do you remember that thing?

03:53.960 --> 04:01.160
So that was, yeah, 40, 50 years ago, trying to explain that to a teenager now, you know?

04:01.160 --> 04:08.160
We had three hours, had this book, and in this book, there were the names and the phone

04:08.160 --> 04:13.040
numbers of everyone in the area, okay?

04:13.040 --> 04:15.840
Asked two pieces of that.

04:15.840 --> 04:22.200
I mean, I could give you a piece of paper now in this room, ask you for your phone numbers

04:22.200 --> 04:24.040
in your names.

04:24.040 --> 04:29.040
We would give you, would you give me your names and your phone numbers?

04:29.040 --> 04:34.120
Yeah, maybe someone, but probably not at all.

04:34.120 --> 04:42.640
So you see, it's relatively new, it's evolving, and basically, this is just the beginning.

04:42.640 --> 04:51.480
As we, as our life is getting more and more and more digitized, in reverse, we want more

04:51.480 --> 04:52.480
and more privacy.

04:52.720 --> 05:00.880
It's never going to stop, we will, our, I would say that the grand children of our grand

05:00.880 --> 05:06.920
children will look at us just like we look at those poor guys.

05:06.920 --> 05:11.840
They will find privacy like a basic human right, actually.

05:11.840 --> 05:20.120
Okay, so, but no, no, no, everyone agrees with me about that, and then some guys, some people,

05:20.160 --> 05:25.160
like this one, Eric Schmidt, that would say something like, if you have something that you

05:25.160 --> 05:31.040
don't want anyone to know, maybe you shouldn't be doing it in the first.

05:31.040 --> 05:32.040
Right?

05:32.040 --> 05:39.560
Yes, in the comments, and now, well, this guy is a CEO of Google, was a CEO, is not a CEO at

05:39.560 --> 05:47.240
the time, but as this time, but he was, and this is the stupidest argument about privacy

05:47.320 --> 05:52.280
that you can either bring in a discussion, right?

05:52.280 --> 06:01.000
We all have the right to privacy, it's not, yeah, you can publish anything you want, but

06:01.000 --> 06:03.640
you also have the right to hide things.

06:03.640 --> 06:05.640
All right?

06:05.640 --> 06:12.840
So, yeah, of course, Big Tech companies, they want, they don't want this definition

06:12.920 --> 06:19.320
of privacy, because basically, the business model is collecting and selling new data.

06:19.320 --> 06:25.760
So, of course, they want to redefine what the meaning of privacy is, right?

06:25.760 --> 06:35.800
And so, there's a market for personal data that is blooming currently, so at this point

06:35.880 --> 06:46.040
right now, in 25, so that's a broker industry is worth 2,000 billion, yeah, and this

06:46.040 --> 06:50.600
is just ungary, okay?

06:50.600 --> 06:56.040
And you've got this for companies, do you know any of these companies, anyone here knows

06:56.040 --> 06:57.040
of what?

06:57.040 --> 06:58.040
Oh, yes.

06:58.040 --> 07:01.520
Does some of you work for them?

07:02.480 --> 07:10.960
I can ask, just asking, just asking, well, this guy doesn't know about you, you're probably

07:10.960 --> 07:18.960
in one or the five of them, they sell you data on every day basis, and they make a lot

07:18.960 --> 07:26.560
a lot of money, and you probably never give your consent for that, all right?

07:26.560 --> 07:36.480
And so, thing is, that's just the top five, they're about 200 of them, okay?

07:36.480 --> 07:41.720
And the five, the top five are bigger than redat on their own.

07:41.720 --> 07:51.400
So, yeah, in theory, legally, you can send a letter to each one of them and ask for them

07:51.400 --> 07:56.040
to remove your data, and probably, probably, they will do it.

07:56.960 --> 08:03.640
But then next month, you will have to do it again, and you will have to do it for all the 200

08:03.640 --> 08:07.640
companies that have brokers that take this, then you will have to share, to look for new ones

08:07.640 --> 08:10.440
that will appear every day.

08:10.440 --> 08:18.600
So, of course, capitalism has a response for that, so you can hire companies that will

08:18.600 --> 08:25.000
write these letters for you, and it will only cost you 10-door apartments.

08:25.320 --> 08:30.920
So, this is a great example of our capitalism, we'll sell you the disease, and then sell

08:30.920 --> 08:32.520
you the coronavirus.

08:32.520 --> 08:39.160
All right, so the black market is rising, because these guys, the needs of data, it's

08:39.160 --> 08:43.760
hard to get, and it's cheaper on the black market, basically.

08:43.760 --> 08:52.280
All right, of course, I have no way to prove that taking your data on the black market,

08:52.360 --> 08:56.200
I can sue me for saying that, but I'll take the risk.

08:58.200 --> 09:05.960
So, yeah, this market is absolutely not regulated, and GDPR does not change anything about that.

09:07.080 --> 09:11.880
And so, logically, the number of data breach is just exploding.

09:13.080 --> 09:20.520
Last year, the number of notification people received went up for three hundred percent.

09:21.080 --> 09:25.240
All right, you know those notifications, where you receive an email, when they say,

09:25.240 --> 09:33.240
oh, we value your data privacy, and we may have lost some of our data, but yeah, keep stay with us.

09:33.240 --> 09:36.280
All right, we have this email devices.

09:39.080 --> 09:46.200
So, this is a battle, basically, people want more privacy and take companies, want more data.

09:47.160 --> 09:52.520
All right, and so, if you are a possessive data on any kind of DBA, actually,

09:53.320 --> 09:56.440
you're writing the middle of this right now.

09:57.640 --> 10:08.520
So, your target for the data leaks, and you have the responsibility of the data you hold.

10:10.200 --> 10:11.720
All right, so we're going to war.

10:12.040 --> 10:20.040
Just, we're going to try to apply, you have to have a matter, if you want to win this war.

10:21.880 --> 10:27.880
So, we're going to check six basic principles for this battle.

10:28.520 --> 10:33.160
So, first one is privacy by design, which basically means

10:34.520 --> 10:40.200
your masking policy should not be written by, should be written by the application developers.

10:40.280 --> 10:43.000
It's something you should do at the start of the project.

10:43.560 --> 10:48.120
You don't do that at the end of the project, once the database is in production, it's too late.

10:48.920 --> 10:52.120
You need to think about it right now, right?

10:53.160 --> 11:00.200
When you add a colon, when you add a new feature, just ask yourself, okay, what impact does it have

11:00.200 --> 11:02.600
on the privacy of our users?

11:03.880 --> 11:04.200
Right.

11:05.160 --> 11:07.320
Then you need role separation.

11:07.320 --> 11:15.240
As a DBA, you're going to give access to a lot of people to the data.

11:15.240 --> 11:24.040
So, don't just give them just one role, just separate the role and give different rules to different roles.

11:26.200 --> 11:27.800
I take so far, it's a quick introduction.

11:27.800 --> 11:32.200
I'm going to say AASF for now, but

11:32.600 --> 11:36.120
this is a basic principle in security.

11:36.120 --> 11:37.640
It's not just for database.

11:37.640 --> 11:46.280
It's just the idea that you should reduce the places where your data is, okay?

11:47.560 --> 11:50.680
Data minimization is basically the opposite of big data.

11:51.240 --> 11:52.920
Big data is completely dead.

11:52.920 --> 11:55.560
This data is officially illegal.

11:55.560 --> 12:01.800
You should not collect data if you don't have a real usage for it, right?

12:02.440 --> 12:07.880
Which means when I don't know, maybe in your form, in your forms,

12:07.880 --> 12:11.960
you collect the birth date of your users.

12:11.960 --> 12:13.720
What do you use that for?

12:13.720 --> 12:18.680
Is it useful to have the birth date of every users you have?

12:18.680 --> 12:19.720
Probably not.

12:19.720 --> 12:21.080
Maybe, I don't.

12:21.080 --> 12:22.920
But you have to think about it.

12:22.920 --> 12:28.280
And you can't have this philosophy or let's collect everything and maybe let

12:28.280 --> 12:32.120
turn two years or four years, we will find something useful to do with this.

12:32.120 --> 12:33.080
It's not possible.

12:35.720 --> 12:40.680
Risk evaluation, like I said, the concept itself of privacy is always evolving.

12:41.720 --> 12:44.040
The technologies are evolving.

12:44.040 --> 12:47.240
You have new kinds of attacks, new ways to to to to to

12:48.120 --> 12:51.840
who only need to be constantly

12:53.320 --> 12:56.480
evaluate your policy.

12:57.720 --> 13:01.240
And last one is privacy by default, which is potentially simple.

13:01.240 --> 13:04.280
If you don't know if the colon is contained,

13:04.280 --> 13:09.720
the private data, you should treat it like if it contains private data, okay?

13:09.720 --> 13:11.240
If you don't know, you know.

13:12.360 --> 13:15.640
Okay, so again six principle,

13:15.640 --> 13:18.240
I'm going to show you how you can implement them

13:18.240 --> 13:20.360
with post-resculine animator.

13:20.360 --> 13:25.120
But again, you could do that with any of the tools.

13:25.120 --> 13:26.040
So what is this?

13:26.040 --> 13:28.840
This is an open source post-res extension.

13:31.920 --> 13:34.200
And it's been production for the last five years,

13:34.200 --> 13:37.840
I guess it's written in Rust and PGSQL.

13:37.840 --> 13:41.760
I did a talk yesterday about Rust and Rust extension

13:41.760 --> 13:42.680
in Post-res.

13:42.680 --> 13:44.880
It should be on the first-day website

13:44.880 --> 13:49.880
in a few hours, I guess.

13:49.880 --> 13:54.280
You can install it through RPM, DBR, Dr. Ansible,

13:54.280 --> 13:55.080
whatever.

13:55.080 --> 14:00.080
It's also available on most cloud platforms,

14:00.080 --> 14:04.480
such as Google Cloud as your crunching area, et cetera.

14:04.480 --> 14:07.560
And we do have a lot of experimental tutorials.

14:07.560 --> 14:11.280
So digital, digital, digital for the known French people.

14:11.280 --> 14:13.320
It's the Ministry of Finance.

14:13.320 --> 14:15.680
And I, and I, and I, and I, and I,

14:15.680 --> 14:22.040
is the National Research French Research Agency.

14:22.040 --> 14:24.680
Well, they've found their gave us money

14:24.680 --> 14:26.240
to develop this technology.

14:29.160 --> 14:29.800
So what is it?

14:29.800 --> 14:31.960
It's a masking engine.

14:31.960 --> 14:36.400
So you have different ways to mask your data.

14:36.400 --> 14:40.400
You can do static masking dynamic, dumps, et cetera.

14:40.400 --> 14:45.160
I'm just going to talk about the free first one.

14:45.160 --> 14:46.960
Because that's enough for you to understand,

14:46.960 --> 14:52.920
but you, there's a lot of way to mask that, actually.

14:52.920 --> 14:54.520
And it's also masking toolbox.

14:54.520 --> 15:00.560
So it's, at the same time, is what you want to mask

15:00.560 --> 15:03.320
and how you want to mask the data.

15:03.320 --> 15:06.800
And there's a lot of different ways to mask the data,

15:06.800 --> 15:09.000
depending on what you want to do.

15:09.000 --> 15:10.680
You've got pseudonymization.

15:10.680 --> 15:11.560
You have noise.

15:11.560 --> 15:13.880
You have fake data.

15:13.880 --> 15:17.640
You have partial destruction, generalization.

15:17.640 --> 15:21.640
You can also manipulate images.

15:21.640 --> 15:24.600
So again, I'm not going to go into each one of them,

15:24.600 --> 15:28.360
but you get the ID.

15:28.360 --> 15:29.320
OK, let's go.

15:29.320 --> 15:33.320
So I have installed the extension.

15:33.320 --> 15:34.760
It's a binary extension.

15:34.760 --> 15:36.520
I need to load it to my data.

15:36.520 --> 15:40.280
So the best way to do it would be to use session

15:40.280 --> 15:42.120
prelude library.

15:42.120 --> 15:45.160
All right, practice fairly simple.

15:45.160 --> 15:47.560
And then connect to the data as a gain

15:47.560 --> 15:50.600
and just create the extension.

15:50.600 --> 15:51.800
And that's it.

15:51.800 --> 15:54.600
Let's go.

15:54.600 --> 15:55.240
Let's go.

15:55.240 --> 15:57.720
And let's implement privacy by design.

15:57.720 --> 16:01.960
So again, privacy by design is a declarative approach

16:01.960 --> 16:04.600
of anonymization.

16:04.600 --> 16:09.480
We are going to write our masking rules inside the database

16:09.480 --> 16:10.520
model.

16:10.520 --> 16:11.640
All right?

16:11.640 --> 16:15.880
And but how can we do that?

16:15.880 --> 16:20.520
How can we add metadata inside the tables

16:20.520 --> 16:23.000
and inside the database model?

16:23.000 --> 16:25.800
Well, there is this thing called security label.

16:25.800 --> 16:28.520
That probably most of you don't know.

16:28.520 --> 16:30.280
It's a feature of post-vascular access.

16:30.280 --> 16:32.440
It's SQL.

16:32.440 --> 16:36.600
And you can, with this, you can attach label metadata

16:36.600 --> 16:39.080
on objects in your database.

16:39.080 --> 16:41.080
All right.

16:41.080 --> 16:42.680
So let's get to an example.

16:42.680 --> 16:43.960
So I've got the table.

16:43.960 --> 16:47.480
People and they have an ID, first name, last name, and phone number.

16:47.480 --> 16:49.000
All right.

16:49.000 --> 16:50.200
Let's go.

16:50.200 --> 16:56.760
So what I want to do is, in my database model,

16:56.760 --> 17:01.720
this line that is just going to say how this

17:01.800 --> 17:04.920
current, so in this case, the colon is last name.

17:04.920 --> 17:08.920
And this is how I'm going to replace it to the mask user.

17:08.920 --> 17:10.360
So I'm going to replace it.

17:10.360 --> 17:14.120
I'm going to mask it with the function dummy last name,

17:14.120 --> 17:18.120
and function dummy last name, which is a generic last name,

17:18.120 --> 17:20.040
any kind of last name.

17:20.040 --> 17:28.280
But so this is just like putting a constraint on your colon.

17:28.280 --> 17:31.080
You would say that last name for example,

17:31.080 --> 17:35.240
is not new, or maybe there's a, I don't know,

17:35.240 --> 17:39.320
some kind of check on this.

17:39.320 --> 17:41.800
And it's quite the same thing, actually.

17:41.800 --> 17:43.960
You're just saying, oh, this is colon.

17:43.960 --> 17:52.200
This is how I'm going to apply transform this colon when I need to.

17:52.200 --> 17:57.080
So you can also destroy the data.

17:57.080 --> 18:00.680
In here, I'm replacing the data with a function,

18:00.680 --> 18:03.800
but I can also just replace it with the value,

18:03.800 --> 18:05.640
with a static value.

18:05.640 --> 18:13.240
And so if you get just one thing to keep from this talk,

18:13.240 --> 18:17.480
is that destruction is the best and an amazement.

18:17.480 --> 18:21.640
Yeah, you can install you something you have destroyed.

18:21.640 --> 18:25.880
So again, with data minimization, we'll talk about it later.

18:25.880 --> 18:30.360
But yeah, if you want to be sure that some data will not leak,

18:30.440 --> 18:34.920
just replace it with a static value.

18:34.920 --> 18:38.920
But a third example is, for example,

18:38.920 --> 18:42.600
we want the phone to be partially destroyed.

18:42.600 --> 18:49.720
So we're going to just destroy the digits in the middle of the phone number.

18:49.720 --> 18:57.560
And if we, then I'm going to apply those frameworks on the table people.

18:57.560 --> 18:59.080
And here I am.

18:59.080 --> 19:02.120
So the table has been statically masked.

19:02.120 --> 19:06.440
It's now, if it's done forever.

19:06.440 --> 19:11.720
And as you see, the idea has disappeared this new.

19:11.720 --> 19:13.480
The first name is the same.

19:13.480 --> 19:15.480
I didn't mask it.

19:15.480 --> 19:24.680
The last name is a fake last name, but it's as schema model.

19:24.680 --> 19:26.760
OK, let's go with world separation.

19:26.760 --> 19:30.920
Again, you're going to have some rules that will be masked.

19:30.920 --> 19:36.520
So the masking rules will be applied automatically

19:36.520 --> 19:38.680
for these people.

19:38.680 --> 19:42.280
And by definition, so a mask rule will be a redoney.

19:42.280 --> 19:45.880
But while the mask rule will be a redoney,

19:45.880 --> 19:52.120
the other rule will be able to read the data and write the data.

19:52.120 --> 19:53.320
So let's go.

19:53.320 --> 19:56.360
We're going to create a new rule, which is sky net.

19:56.360 --> 19:57.880
And is around connect.

19:57.880 --> 20:02.440
We're going to activate transparent dynamic masking for him.

20:02.440 --> 20:05.800
And we're going to say, again, with the security label,

20:05.800 --> 20:08.200
that this rule is masked.

20:08.200 --> 20:12.520
And we're going to go, actually, I'm going to give it

20:12.520 --> 20:16.760
the residual data, redone data, a privilege

20:16.760 --> 20:20.360
to because it's easier because I could do things.

20:20.360 --> 20:26.360
Early, the more subtle, but let's go with this.

20:26.360 --> 20:29.800
So now, when sky net will connect,

20:29.800 --> 20:31.880
it will connect to the people's table.

20:31.880 --> 20:37.480
It will try to read the people's table, and it will be masked.

20:37.480 --> 20:41.000
And as you see, the last name has changed, because every time

20:41.000 --> 20:46.440
it will query the data, the function will be called again.

20:46.440 --> 20:49.800
So we will see, for the fake data generator,

20:49.800 --> 20:53.960
we will see different fake data every time.

20:53.960 --> 20:59.240
But for the phone, it will be always the same result.

20:59.240 --> 21:03.240
But now, if I connect back as a normal user, as postgres,

21:03.240 --> 21:04.840
sorry.

21:04.840 --> 21:07.000
The data is actually changed for the people,

21:07.000 --> 21:08.920
because it's a viewer.

21:08.920 --> 21:11.640
It's called masked.

21:11.640 --> 21:14.600
So we can re-changer that in the row,

21:14.600 --> 21:18.760
which is when you read it, it just changes on the fly when you read it.

21:18.760 --> 21:22.600
Yes, sorry.

21:22.600 --> 21:26.680
So the question is, is the data change on the view

21:26.680 --> 21:28.520
or change on the fly?

21:28.520 --> 21:30.520
It's changed on the fly, actually.

21:30.520 --> 21:35.800
Basically, what's happened is that the select clause

21:35.800 --> 21:38.760
will be intercepted.

21:38.760 --> 21:43.720
And I'm the extension will rewrite the query

21:43.720 --> 21:46.760
to display this.

21:46.760 --> 21:50.520
But actually, the masked rule doesn't even know

21:50.520 --> 21:51.480
is masked.

21:51.480 --> 21:53.480
All right?

21:53.480 --> 21:54.920
What if you have it in the index?

21:54.920 --> 21:57.000
Will the mask be available?

21:57.000 --> 21:58.920
No, if you...

21:58.920 --> 22:00.280
Do you repeat the question?

22:00.280 --> 22:01.960
Sorry.

22:01.960 --> 22:04.040
What about the index?

22:04.040 --> 22:06.040
Yes, the index is not masked.

22:06.040 --> 22:10.200
So it's still useful in that case.

22:10.200 --> 22:11.960
So basically, this is it.

22:11.960 --> 22:16.600
You have masking rules written by the post-res administrator.

22:16.600 --> 22:20.360
Some guy, a regular user, can read and write the data.

22:20.360 --> 22:24.200
And the mask can only read the mask data.

22:24.200 --> 22:26.200
All right, let's go.

22:26.200 --> 22:28.920
We're going to talk about attack surface reduction.

22:28.920 --> 22:34.040
OK, this one, I think it's going to be a clear example

22:34.040 --> 22:35.240
for you all.

22:35.240 --> 22:37.320
Let's say we have a prediction database

22:37.320 --> 22:40.840
with an user DBA, and you've got the developer

22:40.840 --> 22:43.960
and data scientists and they want the data.

22:43.960 --> 22:48.040
The developer wants to run some tests with realistic data.

22:48.040 --> 22:51.720
And the data scientists need to, I don't know,

22:51.720 --> 22:57.080
run a weekly reporting BigQuery about the start,

22:57.080 --> 22:59.640
about the company and everything.

22:59.640 --> 23:00.520
All right.

23:00.520 --> 23:04.920
So what happens to worst case scenario is this?

23:04.920 --> 23:08.120
You're going to send the real data to both those M

23:08.120 --> 23:13.320
that are going to have a copy of the data on the desktop.

23:13.320 --> 23:18.520
Maybe one of them has a Windows laptop, all right?

23:18.520 --> 23:21.560
Just stay.

23:21.560 --> 23:23.320
And so yeah, you're still fast.

23:23.320 --> 23:26.920
The attack surface is everyone.

23:26.920 --> 23:31.640
So if I want to steal your data, I'm not going to target this,

23:31.640 --> 23:35.960
because this is very hard, because you're a very good DBA,

23:35.960 --> 23:38.840
and you have protected this area very good.

23:38.840 --> 23:42.200
And I'm going to attack this guy, or maybe this guy.

23:42.200 --> 23:45.800
Maybe I'm just going to store this laptop on the bus.

23:45.800 --> 23:50.440
Or maybe I'm just going to put some kind of trojan

23:50.440 --> 23:55.960
off in this guy, PC, all right?

23:55.960 --> 23:57.640
So we don't want that.

23:57.640 --> 24:00.520
This is the worst scenario.

24:00.520 --> 24:03.640
So what most people would do is this,

24:03.640 --> 24:08.120
is extract the data, transform it, and then push it

24:08.120 --> 24:11.880
to this, to this two environments.

24:11.880 --> 24:12.880
This is nice.

24:12.880 --> 24:17.320
This is good, I'm not judging this.

24:17.320 --> 24:18.520
It's correct.

24:18.520 --> 24:22.120
But as you see, the attack surface is not

24:22.120 --> 24:23.520
reduced that much.

24:23.520 --> 24:28.760
So those two guys are not an attack vector now,

24:28.760 --> 24:33.440
but you've got a new guy in the loop in the pipeline.

24:33.440 --> 24:38.560
And this guy is also a vector of attack, right?

24:38.560 --> 24:41.880
So in some ways, you've reduced the surface,

24:41.880 --> 24:47.000
but the world pipeline is a bit more complicated.

24:47.000 --> 24:49.520
And of course, in the edge of AI.

24:49.520 --> 24:50.880
So you have this now.

24:50.880 --> 24:55.280
A lot of new startups saying, oh, just send us your data,

24:55.280 --> 24:59.080
and we have a new AI thing that will analyze this.

24:59.080 --> 25:02.840
So this is the worst ID in the entire story

25:02.840 --> 25:08.240
of the type privacy first, because there's someone

25:08.240 --> 25:12.040
in this cloud, which will be basically

25:12.040 --> 25:15.240
able to see the logs of this AI things.

25:15.240 --> 25:19.440
And you don't even know how the AI has been trained.

25:19.440 --> 25:22.400
You don't even know what they do about it.

25:22.400 --> 25:24.240
So yeah, this is a nightmare.

25:27.120 --> 25:32.240
So what we want to do is this.

25:32.240 --> 25:35.080
We reduce the surface of attack to this,

25:35.080 --> 25:38.240
but which is, as I said earlier, very secure,

25:38.240 --> 25:40.520
because you're very good DBA.

25:40.520 --> 25:46.080
And we're going to either push the data with an analyzer

25:46.080 --> 25:51.000
dump, which we'll pitch it up, right, to this environment.

25:51.000 --> 25:54.880
Or we're just going to give access to this guy.

25:54.880 --> 25:57.040
So it doesn't have a copy anymore.

25:57.040 --> 26:00.000
It's just allowed to connect to the prediction.

26:00.000 --> 26:01.800
But it is masked.

26:01.800 --> 26:06.240
So you won't see a personal data.

26:06.240 --> 26:08.680
All right.

26:08.680 --> 26:10.320
So let's do it.

26:10.320 --> 26:15.080
We can reduce this PG dump thing, the anonymous dump thing.

26:15.080 --> 26:18.120
Well, again, we're going to create a new user,

26:18.120 --> 26:22.200
a new hole, which will be used just for the dumps.

26:22.200 --> 26:25.880
And we will activate transparent dynamic masking.

26:25.880 --> 26:29.800
We're going to say this masked, and let's go.

26:29.800 --> 26:33.120
Now I'm just going to use PG dump with this user.

26:33.120 --> 26:35.880
And I'm going to get a masked dump.

26:35.880 --> 26:38.680
And now with this anonymous dump,

26:38.680 --> 26:42.280
I can share it everywhere on my network.

26:42.280 --> 26:46.240
I can send it by email to someone.

26:46.240 --> 26:50.800
It's completely out of the surface attack,

26:50.800 --> 26:54.880
because there's no personal data in the dump.

26:54.880 --> 26:57.920
Of course, if you use PG dump with a regular user,

26:57.920 --> 27:03.040
you will get a dump with a regular data.

27:07.040 --> 27:10.240
So we're going to use this dump to refresh environment.

27:10.240 --> 27:13.040
And this is, again, this is a regular PG dump.

27:13.040 --> 27:17.360
It's not a wrapper, and you can use any kind of option you

27:17.360 --> 27:20.720
use with a classic PG dump.

27:20.720 --> 27:26.600
And notably, the custom function format works.

27:26.600 --> 27:27.720
OK.

27:27.720 --> 27:33.520
So let's go again, next principle is data minimization.

27:33.520 --> 27:38.160
So in most case, when you analyze something for someone,

27:38.160 --> 27:41.400
they don't really need all of the data.

27:41.400 --> 27:44.280
A sample of the data is sufficient for tests,

27:44.280 --> 27:47.520
for analytics, for demo, for training data.

27:47.520 --> 27:51.640
Most of the time, just one small part of the entire data

27:51.640 --> 27:53.440
set is enough.

27:53.440 --> 27:55.920
So this is called sampling.

27:55.920 --> 27:57.960
Maybe did you know that Postgres already

27:57.960 --> 27:59.160
are this?

27:59.160 --> 28:02.760
And this close calls that are simple as any one ever

28:02.760 --> 28:04.800
used it here?

28:04.800 --> 28:06.080
Oh, two guys.

28:06.080 --> 28:07.080
Great.

28:07.080 --> 28:09.240
So yeah, Postgres are this already,

28:09.240 --> 28:11.920
since I don't know, years and years.

28:11.920 --> 28:16.280
And you're able to say, I just want this fraction

28:16.280 --> 28:19.080
of the result.

28:19.080 --> 28:23.520
So let's say we have a big HTTP log table with the date

28:23.520 --> 28:28.040
with the IP address that you are all everything.

28:28.040 --> 28:31.800
So we're going to put a security label on the IP address

28:31.800 --> 28:34.440
thing, so we're just going to destroy the IP address.

28:34.440 --> 28:37.240
We don't need it for the stats, for example.

28:37.240 --> 28:41.120
And you were going to just send 10% of the table

28:41.120 --> 28:42.520
to the mask users.

28:42.520 --> 28:46.600
So we'll see one or two percent of the table.

28:46.600 --> 28:51.040
The mask users will see only 10% of the table.

28:51.040 --> 28:53.640
All right.

28:53.640 --> 28:58.880
And we can also sample with RLS policies.

28:58.880 --> 29:01.640
Another great feature of Postgres.

29:01.640 --> 29:07.240
Again, we used our role-level security policies here.

29:07.240 --> 29:10.000
OK, yeah, good.

29:10.000 --> 29:12.760
So basically, role-level security policies

29:12.760 --> 29:14.840
is like filters, you're going to make

29:14.840 --> 29:18.080
at the role-level of each table.

29:18.080 --> 29:25.400
And so with this one, again, we're going to apply those rules.

29:25.400 --> 29:30.280
So this is, again, pure SQL, nothing fancy.

29:30.280 --> 29:33.320
So we're going to create a policy on the logs.

29:33.320 --> 29:38.800
And we're going to say that we're going to apply this rule

29:38.800 --> 29:40.400
only for the mask user.

29:40.400 --> 29:45.080
Any users that has, if the user, the current user as a mask,

29:45.080 --> 29:50.040
and we're going to only apply it for the values

29:50.040 --> 29:52.840
that are six months old.

29:52.840 --> 29:57.200
So the mask user will only see the latest data.

29:57.200 --> 30:01.000
Only the data that has six months of the last six months.

30:01.000 --> 30:03.120
All right?

30:03.120 --> 30:04.720
OK.

30:04.720 --> 30:08.880
I really like this example, actually.

30:08.880 --> 30:11.280
OK, let's go with risk evaluations.

30:11.280 --> 30:12.280
This one is hard.

30:12.280 --> 30:16.680
It's not where we still have a lot of work to do in this area.

30:16.680 --> 30:19.560
But we have two features for that.

30:19.560 --> 30:21.680
One is called K and Animiti.

30:21.680 --> 30:23.640
And the other is a detection function.

30:23.640 --> 30:24.800
So let's go.

30:24.800 --> 30:28.040
And talk about K and Animiti.

30:28.040 --> 30:29.760
It's an industry standard, basically.

30:29.760 --> 30:34.160
You will find it on other tools too, hopefully.

30:34.160 --> 30:40.920
And it says that it's a factor that computes the risk

30:40.920 --> 30:45.200
of re-identifying someone within your data set.

30:45.200 --> 30:48.160
So maybe you applied your masking rules,

30:48.160 --> 30:51.640
but just feel some guy that is on edge case

30:51.640 --> 30:55.240
and you're still able to find one unique person

30:55.240 --> 30:56.520
within your data set.

30:56.520 --> 30:57.560
Right?

30:57.560 --> 31:00.240
So it's just a function, actually,

31:00.240 --> 31:02.520
that you can run on your table.

31:02.520 --> 31:09.680
And it will try to guess how many single people you can find.

31:09.680 --> 31:14.760
No, not how many single, but how difficult it

31:14.760 --> 31:17.160
would be to identify someone.

31:17.160 --> 31:19.960
So here's the factor is free, which is good,

31:19.960 --> 31:22.720
but not good, not the best way.

31:22.720 --> 31:24.760
The higher the value is the better.

31:24.760 --> 31:25.840
All right?

31:25.840 --> 31:28.120
If you want, it means that there's one guy

31:28.120 --> 31:32.920
that is your unique guy unit in your data set.

31:32.920 --> 31:36.560
And we also have a detection function.

31:36.560 --> 31:43.080
So we're going to scan all your tables and all the columns.

31:43.080 --> 31:46.080
And with U.S. ticks, we're going to try to guess

31:46.080 --> 31:51.680
that maybe this customer first name colon

31:51.680 --> 31:53.400
doesn't have a masking rule.

31:53.400 --> 31:59.640
And maybe this one is a dialectite identifier.

31:59.640 --> 32:01.880
So it's not perfect.

32:01.880 --> 32:05.720
I think we have a lot of areas to improve this one.

32:05.720 --> 32:09.160
But this one is like a checker and a void

32:09.160 --> 32:10.760
to miss one colon.

32:10.760 --> 32:18.200
And finally, privacy by default, like I said,

32:18.200 --> 32:24.800
if you don't know, if a colon is all the personal data,

32:24.800 --> 32:28.360
just mask it by default.

32:28.360 --> 32:33.360
So again, we're going to take the HTTP log table.

32:33.360 --> 32:38.680
And so yeah, we have a date, the IP address, et cetera,

32:38.840 --> 32:47.200
and we're going to just activate the privacy by default parameter.

32:47.200 --> 32:50.600
And instead of masking the data, we're going to unmask the data.

32:50.600 --> 32:54.040
So now, we've activated privacy by default.

32:54.040 --> 32:56.840
So everything is going to be masked.

32:56.840 --> 33:00.480
And we're going to unmask the thing we want to see.

33:00.480 --> 33:08.280
So maybe, yeah, maybe as a log URL, we want to see it.

33:08.280 --> 33:10.240
So we're going to unmask it.

33:10.240 --> 33:12.520
I'm going to say it's not masked.

33:12.520 --> 33:17.320
And the date, we're going to just keep the year.

33:17.320 --> 33:20.680
And we move the month and the date.

33:20.680 --> 33:25.160
So we generalize the date.

33:25.160 --> 33:31.560
And so with this, once I animate the table,

33:31.560 --> 33:35.760
what I get is, OK, the date has been changed.

33:35.760 --> 33:40.320
I just keep the year.

33:40.320 --> 33:42.560
The IP address has disappeared.

33:42.560 --> 33:45.320
For the present, I jump, I took the default value.

33:45.320 --> 33:48.120
So I have the default value of this current.

33:48.120 --> 33:52.160
And I unmask the URL.

33:52.160 --> 33:55.000
Right.

33:55.000 --> 33:57.080
OK, I'm almost finished now.

33:57.080 --> 34:01.800
So yeah, the battle for privacy is happening right now.

34:01.800 --> 34:05.360
Whether you want it or not in the middle of it.

34:05.360 --> 34:07.000
So you have a responsibility.

34:07.000 --> 34:08.800
And you also target for our data rigs.

34:12.240 --> 34:17.920
So you need to step up and take actions.

34:17.920 --> 34:23.520
But you don't have to take all actions or at once.

34:23.520 --> 34:28.480
You can improve things on iterations.

34:28.480 --> 34:31.080
And so the first things first, the one you need

34:31.080 --> 34:33.720
to do is privacy by design.

34:33.720 --> 34:40.720
So you need to go talk to the developers about privacy

34:40.720 --> 34:45.760
and ask them which current old private data, which one

34:45.760 --> 34:49.080
is OK to publish, et cetera, et cetera.

34:49.080 --> 34:54.720
And if, because maybe if this is a software that you

34:54.720 --> 35:00.720
bought, you need to go and reach the editor of software.

35:00.720 --> 35:04.880
And say it's their responsibility, their software.

35:04.880 --> 35:09.120
They have designed a database model.

35:09.120 --> 35:14.120
It's up to them to tell you how you should apply masking

35:14.120 --> 35:15.200
upon this thing.

35:15.200 --> 35:17.920
It's not up to you to reverse engineer

35:17.920 --> 35:21.120
that that is a model to guess which current should

35:21.120 --> 35:22.400
be masked or not.

35:22.400 --> 35:23.680
It's their responsibility.

35:23.680 --> 35:27.760
It don't let them avoid that responsibility.

35:27.760 --> 35:30.000
And once again, they can do it with processing measure.

35:30.000 --> 35:31.560
It's free to open source.

35:31.560 --> 35:34.760
If you want to do it with another tool, maybe they

35:34.760 --> 35:37.440
would want to develop their own tooling.

35:37.440 --> 35:40.400
Maybe a lot of editors know right now,

35:40.400 --> 35:45.800
we would provide you with their own scripts.

35:45.800 --> 35:49.120
I would say that it's just like running your own crypto

35:49.120 --> 35:53.080
or running your own database, the backup scripts.

35:53.080 --> 35:53.720
It's nice.

35:53.720 --> 35:57.040
It's fun to do, but basically, just don't reinvent

35:57.040 --> 36:00.240
so we'll use a proper masking tools.

36:00.240 --> 36:04.360
Do not reinvent your own masking tools.

36:04.360 --> 36:05.440
I did it.

36:05.440 --> 36:06.840
I know what it cost.

36:10.840 --> 36:13.040
Then once you've done this, yeah,

36:13.040 --> 36:16.080
philosophical, you have this philosophical mindset

36:16.080 --> 36:20.080
on doing privacy by designing everything else is easier.

36:20.080 --> 36:22.320
Once you've said everything, we had something

36:22.320 --> 36:24.800
we think about privacy at the very beginning,

36:24.800 --> 36:26.800
everything else is easier.

36:26.800 --> 36:31.440
And so the next steps are wall separation.

36:31.440 --> 36:34.480
Attacks surface reduction.

36:34.480 --> 36:37.400
And finally, that I meanization, a privacy

36:37.400 --> 36:39.040
by default, extra.

36:39.040 --> 36:42.240
So of course, you can do it as there was a war one,

36:42.240 --> 36:45.800
but that's my advice.

36:45.800 --> 36:48.120
And do not fight alone.

36:48.120 --> 36:49.920
Get a team.

36:49.920 --> 36:56.040
Because again, it involves a lot of different talents

36:56.040 --> 36:56.880
to this.

36:56.880 --> 36:59.760
So you have to have the developers with you.

36:59.760 --> 37:01.840
You have to have the system in the DPO,

37:01.840 --> 37:06.440
the software editor, if that's a software that you both,

37:06.440 --> 37:08.880
etc., etc., yeah.

37:08.880 --> 37:12.640
Do not do this by yourself on your own.

37:12.640 --> 37:14.040
You need it's a teamwork.

37:16.200 --> 37:19.920
And again, different strategies for different use case.

37:19.920 --> 37:24.920
There's not one single strategies that was for everyone.

37:24.920 --> 37:27.240
You even have some database where you

37:27.240 --> 37:29.320
have multiple masking policies.

37:29.320 --> 37:33.440
So maybe one policy just for GDPR.

37:33.440 --> 37:37.520
And another policy for, I don't know, commercial secrets.

37:37.520 --> 37:39.960
And commercial secrets are not concerned

37:39.960 --> 37:42.600
by GDPR or the world's world.

37:42.600 --> 37:45.320
So different tools, different techniques

37:45.320 --> 37:48.440
for different contexts.

37:48.440 --> 37:51.040
And that's about it.

37:51.040 --> 37:56.640
So yeah, the extension is available on GitLab.

37:56.640 --> 37:58.720
I've wrote a four-hour tutorial, if so,

37:58.720 --> 38:00.720
if you want to try it as a documentation,

38:00.720 --> 38:03.280
you can try the tutorial.

38:03.280 --> 38:06.320
And you can contact me if you have any questions.

38:06.320 --> 38:09.680
And also, if some people already use the extension

38:09.680 --> 38:13.840
in the world, I should have as a beginning.

38:13.840 --> 38:16.720
Does anyone already use the extension?

38:16.720 --> 38:19.000
OK, one guy, yeah, good.

38:19.000 --> 38:20.960
So if anyone's listening on GitLab,

38:20.960 --> 38:23.360
then there's a server right now where

38:23.360 --> 38:26.760
I try to get feedback from the users to know

38:26.760 --> 38:28.280
what they do with the extension.

38:28.280 --> 38:34.240
So yeah, you can answer that if you're using it.

38:34.240 --> 38:36.240
And that's it for me.

38:36.240 --> 38:39.760
So please do not move, wait for the end of the questions

38:39.760 --> 38:42.960
to move it easier for me to answer if that's not.

38:42.960 --> 38:43.960
OK.

38:43.960 --> 38:54.640
That was very interesting, Tolk.

38:54.640 --> 38:55.840
Thank you.

38:55.840 --> 38:56.680
Any question?

38:56.680 --> 38:57.680
Ooh.

39:06.000 --> 39:07.640
All right, yeah, sure.

39:07.640 --> 39:08.800
I have a question.

39:08.800 --> 39:09.840
One more.

39:09.840 --> 39:10.640
One more.

39:10.640 --> 39:13.480
OK, so just a bit for the phone.

39:13.480 --> 39:17.080
And after this one, OK.

39:17.080 --> 39:18.440
Hey, thanks for the great talk.

39:18.440 --> 39:19.680
I have a question.

39:19.680 --> 39:24.840
Can you use logic or replication with the Mascad data?

39:24.840 --> 39:29.280
I thought because sometimes dumping a big database

39:29.280 --> 39:31.480
is very time consuming and still we

39:31.480 --> 39:35.160
want to have Mascad data somewhere

39:35.160 --> 39:38.200
on the staging environment, let's say.

39:38.200 --> 39:41.240
Yeah, that's a great gratration.

39:41.240 --> 39:44.720
We actually, so currently, we are in V2.

39:44.720 --> 39:48.040
And we're working on D3, person 3.

39:48.040 --> 39:50.480
And so we are working on something

39:50.480 --> 39:55.360
that we were going to call the masking logical decoder

39:55.360 --> 39:58.520
when you can subscribe to the prediction

39:58.520 --> 40:02.480
and have your logical replication, but with the masking

40:02.480 --> 40:05.080
rules are placed on the fly.

40:05.080 --> 40:08.880
So yeah, it's not ready at all, but yeah,

40:08.880 --> 40:12.560
we're really thinking about that a lot of people.

40:12.560 --> 40:15.000
You can, for most of some people,

40:15.000 --> 40:20.960
can not, well, as I say, as well as this.

40:20.960 --> 40:23.320
Yeah, a lot of people cannot do this,

40:23.320 --> 40:24.840
because there's too much data.

40:24.840 --> 40:28.040
It takes too long, OK?

40:28.040 --> 40:32.080
So yeah, again, because maybe the regular dump

40:32.080 --> 40:33.760
will took one hour.

40:33.760 --> 40:38.400
But depending of how much rules you have on your data set,

40:38.400 --> 40:41.320
the anonymized dump will take maybe three hours.

40:41.320 --> 40:42.080
All right?

40:42.080 --> 40:43.960
It does a cost.

40:43.960 --> 40:44.320
OK.

40:47.960 --> 40:49.320
Yeah, thanks.

40:49.320 --> 40:51.640
I was wondering about determinism in the masking.

40:51.640 --> 40:55.400
So is it deterministic also across different databases?

40:55.400 --> 40:58.760
If I have the same input values in one of my databases,

40:58.760 --> 41:01.480
anonymized according to the same rule, it will end up

41:01.480 --> 41:03.200
fully consistent across my data?

41:03.200 --> 41:05.960
Yeah, actually, it depends.

41:05.960 --> 41:06.840
You choose.

41:06.840 --> 41:11.400
So you have some, where is it?

41:11.400 --> 41:12.200
All right.

41:12.200 --> 41:13.040
Yeah.

41:13.040 --> 41:17.520
So among these types of masking rules,

41:17.520 --> 41:19.480
some of them are deterministic.

41:19.480 --> 41:22.000
Meaning that we always return the same result

41:22.000 --> 41:24.520
for the same value, all right?

41:24.520 --> 41:27.800
But some are completely random.

41:27.800 --> 41:33.320
So basically, random and faking are always different.

41:33.320 --> 41:38.520
But if you want to have the same fake value across

41:38.520 --> 41:46.000
different database, this is called pseudo-sudo-sudo-neemization.

41:46.000 --> 41:50.640
And so pseudo-neemization is not anonymization.

41:50.640 --> 41:52.240
It's not the same thing.

41:52.240 --> 41:57.760
Because there's a way to go back, because there's a deterministic,

41:57.760 --> 42:00.120
once there's a deterministic transformation,

42:00.120 --> 42:01.760
you can always go back.

42:01.760 --> 42:04.240
So you need to be very careful about that.

42:04.240 --> 42:10.360
And the GDPR regulation is very clear that if you have

42:10.360 --> 42:14.760
done pseudo-neemization, your data is still

42:14.760 --> 42:18.640
a concern about a still personal data, actually.

42:22.440 --> 42:24.960
My question is a bit of a follow-up.

42:24.960 --> 42:29.240
The table's simple functionality.

42:29.480 --> 42:33.000
Well, that always return the same end elements,

42:33.000 --> 42:37.400
or will it's a completely random sample on each and every call?

42:37.400 --> 42:39.720
Now, it's a random sample.

42:39.720 --> 42:40.880
It's random deadbossed.

42:40.880 --> 42:43.840
You can't know which one we want.

42:43.840 --> 42:45.400
If you want something deterministic,

42:45.400 --> 42:47.320
you need something like that.

42:47.320 --> 42:51.560
So if you want to use that function to limit the accessibility

42:51.560 --> 42:55.600
of some analyzer, if he calls the function off

42:55.600 --> 42:57.680
enough, he can still get your full data set.

43:00.960 --> 43:03.160
Yeah, that's a good question.

43:03.160 --> 43:07.200
So this is linked to another topic

43:07.200 --> 43:10.240
called differential privacy, where at some point

43:10.240 --> 43:13.080
with some of this, actually this, you have also

43:13.080 --> 43:14.200
when you have noise.

43:14.200 --> 43:22.200
If you put noise upon, for example, a date of birth,

43:22.200 --> 43:28.320
the masking roll can ask for the date of birth multiple times.

43:28.320 --> 43:32.520
And the noise will be applied with the same

43:32.520 --> 43:34.360
a random function upon it.

43:34.360 --> 43:38.800
So it's going to find out the truth at some point,

43:38.800 --> 43:42.440
just by doing the mean of all the results he got.

43:42.440 --> 43:46.120
So this is an area we're working on, which is called

43:46.120 --> 43:51.360
differential privacy, but for now, yes,

43:51.360 --> 43:56.040
you need to somehow track how many times your mask roll

43:56.040 --> 44:00.640
will access to the data and how close they are getting

44:00.640 --> 44:03.760
to the truth.

44:03.760 --> 44:06.760
Just a little addiction to the table sample.

44:06.760 --> 44:08.560
Is support solving?

44:08.560 --> 44:13.080
So it's possible, maybe it's possible to repeat

44:13.080 --> 44:14.600
at least the same that set.

44:14.600 --> 44:15.280
Yeah, sure.

44:15.280 --> 44:19.240
Yeah, you can define a sort, and so the salt will be

44:19.240 --> 44:23.200
used in a good point.

44:23.200 --> 44:25.640
Great talk.

44:25.640 --> 44:29.840
I wanted to ask if someone has an already established database

44:29.840 --> 44:32.080
that is not anonymized at all.

44:32.080 --> 44:35.640
Was the best approach to go on about that?

44:35.640 --> 44:39.960
Yeah, so the question is what to do when the database

44:39.960 --> 44:41.160
is already in production?

44:41.160 --> 44:45.440
Yeah, it's a worse scenario, because you need to look

44:45.440 --> 44:48.440
at the database model and try to guess what

44:48.440 --> 44:50.360
the developers try to do.

44:50.360 --> 44:56.560
So if you can just try to find the people that made it,

44:56.560 --> 44:59.280
otherwise, yeah, just privacy by default,

44:59.280 --> 45:02.960
just add on those and everything is private.

45:02.960 --> 45:06.360
Can I ask another question?

45:06.360 --> 45:11.680
The sampling of the database, how does it work with foreign keys

45:11.680 --> 45:12.360
and relationships?

45:12.360 --> 45:14.680
Yeah, that's a very, very good question.

45:14.680 --> 45:18.320
So basically, the exception does not guarantee

45:18.320 --> 45:20.840
use that foreign keys will be respected.

45:20.840 --> 45:22.640
So you can't break.

45:22.640 --> 45:23.640
Yeah, it's up to you.

45:23.640 --> 45:25.920
You can check it or you can destroy it.

45:25.920 --> 45:29.000
But yeah, it's up to you to decide whether or not

45:29.000 --> 45:33.240
you want to keep a referential integrity, right?

45:33.240 --> 45:35.400
Because in some cases, it's important to break

45:35.400 --> 45:36.880
referential integrity.

45:36.880 --> 45:40.080
But for some user use case, you want to keep it.

45:40.080 --> 45:42.160
It really depends, actually.

45:42.160 --> 45:44.160
So there's no doubt rise about that.

45:44.160 --> 45:49.760
If you do, it's like a not new, yeah, in this example,

45:49.760 --> 45:54.480
for example, there's a not real concern from the ID.

45:54.480 --> 45:58.600
But I'm putting a new value, all right?

45:58.600 --> 46:02.400
And so the user will get a result that there's

46:02.400 --> 46:04.800
not respect to the database schema.

46:04.800 --> 46:07.000
But I do it in purpose.

46:07.000 --> 46:09.240
Of course, yes.

46:09.240 --> 46:11.080
OK, thank you for the talk over here.

46:12.160 --> 46:16.600
Did you measure the overhead and performance impacts

46:16.600 --> 46:18.040
of the extension?

46:18.040 --> 46:21.120
Because if you do it on the fly, it has to.

46:21.120 --> 46:22.320
Yeah.

46:22.320 --> 46:26.720
So over it, it really depends on different factors.

46:26.720 --> 46:29.600
So there's a number of lines, the number of rules you have,

46:29.600 --> 46:31.560
and the complexity of rules.

46:31.560 --> 46:35.440
Of course, this kind of rule is very fast.

46:35.440 --> 46:36.880
Destroying is fast.

46:36.880 --> 46:40.040
But generating a random value is very slow.

46:42.520 --> 46:47.920
But the thing's to understand this with different strategies,

46:47.920 --> 46:50.560
the cost is not paid by the same users.

46:50.560 --> 46:54.280
With dynamic masking, the cost is paid by the mass troll.

46:54.280 --> 46:56.280
So you can say, OK, your mass troll, you're

46:56.280 --> 46:58.120
going to pay for this.

46:58.120 --> 47:00.960
So you're going to have a very, very slow result

47:00.960 --> 47:04.480
whereas the normal user with regular performances.

47:04.480 --> 47:06.800
Whereas, if you do anonymized dumps,

47:06.800 --> 47:10.720
the price is paid by everyone, right?

47:10.720 --> 47:14.800
So it depends on who pays the cost.

47:14.800 --> 47:18.960
In the use case of a mask troll, why would I

47:18.960 --> 47:23.120
when do I need to use a mask troll instead of simply

47:23.120 --> 47:26.840
restrict and access to some columns?

47:26.840 --> 47:30.880
So I could grant a data scientist from your example

47:30.880 --> 47:34.280
with access to the required columns only.

47:34.280 --> 47:37.760
And just restrict access to the columns, which you actually

47:37.760 --> 47:39.440
mask.

47:39.440 --> 47:43.120
Maybe I don't see the whole picture here.

47:43.120 --> 47:47.520
When do I need to mask a data instead of simply

47:47.520 --> 47:50.480
restrict and access to this data?

47:50.480 --> 47:55.720
Oh, but maybe because this is data scientist, for example,

47:55.720 --> 48:02.600
he needs to run some queries upon, sorry, I get fast.

48:02.600 --> 48:04.320
Now, this one, yeah.

48:04.320 --> 48:09.000
So maybe he wants to run stats over the years.

48:09.000 --> 48:13.160
And he wants to make some stats about the ASK test logs.

48:13.160 --> 48:15.560
So he needs the year.

48:15.560 --> 48:18.160
He's going to group by year.

48:18.160 --> 48:24.160
So you can just erase the day and the months,

48:24.160 --> 48:25.840
but you keep the year, right?

48:25.840 --> 48:26.680
I see.

48:26.680 --> 48:28.920
So that's called generalization.

48:28.920 --> 48:30.840
And it's enough for him.

48:30.840 --> 48:35.000
But so he has access, but he does a different access.

48:35.000 --> 48:38.480
So when even more granular control is required,

48:38.480 --> 48:40.680
yes, and sizes, data, for example, one part

48:40.680 --> 48:42.560
is visible, but another is not.

48:42.560 --> 48:43.280
Yeah.

48:43.280 --> 48:44.120
OK, thank you.

48:44.120 --> 48:45.920
And you can write your own, of course.

48:45.920 --> 48:51.360
So the main use case for this is noise QL and JSON data.

48:51.360 --> 48:53.520
So because if you ask JSON data,

48:53.520 --> 48:55.640
it's going to be a nightmare to mask,

48:55.640 --> 48:59.160
because there's no schema.

48:59.160 --> 49:05.400
And so probably you're going to have to write your own functions

49:05.400 --> 49:11.680
to go deep inside the JSON values to modify them.

49:11.680 --> 49:14.440
You mentioned that indexes are not masked.

49:14.440 --> 49:18.320
But if I'm a mask user, and I do a query,

49:18.320 --> 49:21.840
it says that select from user swear or last name

49:21.840 --> 49:26.520
is corner, would I get Sara Spellman or would

49:26.520 --> 49:28.280
query it's actually a little unnomized data?

49:28.280 --> 49:32.080
OK, the result, if you have an undistimistic value,

49:32.080 --> 49:33.600
the query will change every time.

49:33.600 --> 49:35.080
Yes.

49:35.080 --> 49:37.320
The result will change every time.

49:37.320 --> 49:38.520
Again and again.

49:38.520 --> 49:42.480
No, I mean, the queries are actually going to match

49:42.480 --> 49:45.520
against the masked values, not against the original value.

49:45.520 --> 49:47.720
So right.

49:47.720 --> 49:50.800
Yeah, no, you're going to mask it against the masked value,

49:50.800 --> 49:51.760
not to really value it.

49:51.760 --> 49:52.280
OK.

49:52.280 --> 49:53.200
So it won't work.

49:53.240 --> 49:57.960
You cannot if where name is gone, now it won't work.

49:57.960 --> 49:59.840
But I need to go actually.

49:59.840 --> 50:01.000
Just take a picture.

50:01.000 --> 50:02.000
Yes.

50:02.000 --> 50:03.360
Thank you very much.

50:08.360 --> 50:12.440
I'll be outside if you have questions.

50:12.440 --> 50:13.040
Sorry.

50:13.040 --> 50:13.840
Privacy.

50:13.840 --> 50:15.080
It's amazing.

