WEBVTT

00:00.000 --> 00:11.800
Okay, hello everybody, nice to see all of you interested in the topic.

00:11.800 --> 00:18.440
My name is Alenka, I'm a software developer, I like to be, I'm also an engineer if needs

00:18.440 --> 00:19.440
to be.

00:19.440 --> 00:27.300
Data engineer, my friend Rok is also here with me, we're all interested in error working

00:27.300 --> 00:36.180
on error as commuters, mostly Python, Rok is also involved in C++ and Perkate.

00:36.180 --> 00:43.380
So before we start, I would like to also get to know all of you a bit, with exercising

00:43.380 --> 00:44.740
our limbs.

00:44.740 --> 00:50.340
So if you're involved in data, if you work with data as an engineer and list whatever

00:50.340 --> 00:53.540
can you raise your hand up.

00:53.540 --> 01:01.060
Okay, keep your hands up and raise the second one, if you know or if you use Pyro in any way

01:01.060 --> 01:09.460
as pandas directly, you are very high, okay, that's good, thank you and all of the others

01:09.460 --> 01:14.180
that didn't raise your hands, is it correct for me to assume that you're here because you

01:14.260 --> 01:21.740
heard of error and you're not really sure what it is, yeah, okay.

01:21.740 --> 01:27.980
So our goal here is to, if you know Pyro to know it better and to see what you can do

01:27.980 --> 01:35.820
with it, extra and otherwise to kind of have a glimpse of error, like what is it about?

01:35.820 --> 01:41.460
We hear a lot about it and we don't use it, let's get to know it a bit better.

01:41.460 --> 01:48.860
So I'm going to take the introduction, as I think it will be necessary, in its essence,

01:48.860 --> 01:55.660
arrow is a format, is this specification for a column or format, so if you have tabular data,

01:55.660 --> 02:02.420
it defines how you have it stored in memory, column by column versus roll by roll, so it's

02:02.420 --> 02:04.380
column, right?

02:04.380 --> 02:11.100
It works well for analytics and some other complicated words like send the operations

02:11.100 --> 02:16.860
and all that, but it's really, really good for analytical workloads, it's not designed

02:16.860 --> 02:20.540
for fetching rows of data, okay?

02:20.540 --> 02:26.460
So here you have a diagram to compare how it's stored in memory, like roll by roll or the

02:26.460 --> 02:32.020
second one column by column, so it's together, columns are together in memory.

02:32.020 --> 02:38.020
It comes not only with the specifications, so the implementations are done in multiple languages,

02:38.060 --> 02:44.500
plus plus, a Python has a binding, the C++, you have Rust implementation, etc.

02:44.500 --> 02:49.860
And you can just move around between languages in the same process and also you can move

02:49.860 --> 02:59.260
around outside through the wire and so if every part of your workload understands arrow,

02:59.260 --> 03:05.060
it's fast and it's efficient and it's zero copy, because not necessarily if you go through

03:05.060 --> 03:09.540
the wire, but it's really efficient if everybody knows arrow.

03:09.540 --> 03:14.500
So yeah, the implementations implement the spec, but also other functionality built on top

03:14.500 --> 03:19.940
of this spec for efficient storage, etc.

03:19.940 --> 03:25.700
So we got a bit of an understanding what arrow is, so in essence it's a way to store data,

03:25.700 --> 03:27.780
so it's more efficient.

03:27.780 --> 03:33.540
Python, Py arrow is a Python binding to the C++ implementation, so whatever you do in C++,

03:33.620 --> 03:38.980
we take it, we connect it, and we use it in Python, with maybe some extra stuff.

03:38.980 --> 03:47.620
It works really well with a NumPy, with pandas or with built in Python objects.

03:47.620 --> 03:52.180
And it also the same as other implementations, it has other functionality.

03:52.180 --> 03:58.900
For example, most of you probably know who use Py arrow, I would assume you use it for

03:59.060 --> 04:01.060
a copy, probably.

04:03.060 --> 04:08.100
Okay, so let's start with the first topic that we have, like we will share knowledge about

04:08.100 --> 04:14.820
four topics, of the Py arrow, one is array interchange, one is storage, the third one is compute,

04:14.820 --> 04:21.620
and then the final will be flight. So the first array interchange, let's start with what

04:21.700 --> 04:28.900
array is, it's a building block in arrow. You have array, the same data type for all the values

04:28.900 --> 04:34.980
and non-climbed, so it's the array stored together in memory, and then you can put

04:34.980 --> 04:40.340
array together in a record badge, which would be like a data frame if you know a Python data frame

04:40.340 --> 04:48.340
libraries. But you can also stack together arrays into a chunked array, and then get those chunked

04:48.420 --> 04:52.740
arrays as columns, and you get a table, which is like a data frame, but chunked.

04:53.860 --> 04:56.580
Okay, so the arrays are the smallest building blocks.

04:58.820 --> 05:07.860
Not the timer? Okay, this is a diagram of how an array would look like in memory.

05:09.060 --> 05:13.860
An example is for Boolean and string data types, which is not the simplest example,

05:14.820 --> 05:22.820
but just to see the various variables, the difference. So for every column, every array,

05:22.820 --> 05:31.700
you have different buffers that live in memory of your computer. For Boolean, you have two

05:31.700 --> 05:39.540
bitnaps, because in an arrow that's a special case, it uses bitnaps for Boolean. If it would be

05:39.620 --> 05:47.460
a primitive, it would have one values buffer, and then a bitnap buffer to indicate null values in the array.

05:48.340 --> 05:54.980
The string type has an extra offset buffer to know where the string ends or begins and ends.

05:54.980 --> 06:00.420
Okay, so that's kind of a general view of how arrays are stored in memory.

06:00.420 --> 06:07.380
If you want to learn more, there's a link with all the different diagrams for all the functions for

06:08.260 --> 06:17.220
sure you know from it arrays. Yes, anything? Okay, so let's go to the interchange,

06:18.420 --> 06:26.100
done by an arrow. If you have an unpy array, you can go to pyro array with just creating an array

06:26.100 --> 06:33.940
out of an unpy, and you can check that it has the same pointer to the buffers, the value buffer.

06:34.020 --> 06:41.940
So that means that nonpy holds the same data, the same part of memory, and pyro. So there's no

06:41.940 --> 06:47.940
zero, that means zero copied. You can just use nonpy, you can use pyro on the same data without

06:47.940 --> 06:55.700
any extra copying. So you have a nonpy array, we turn it into a pyro array, and then we

06:55.780 --> 07:05.940
went to pandas, why not? Okay, so we got the pandas series, we check the buffer, so the pointer to the

07:05.940 --> 07:12.900
buffer through the array interface is the same. We can go from pyro to polyards.

07:13.860 --> 07:20.260
Again, check the buffer, it's still the same. So all this Python libraries operate on the same

07:20.340 --> 07:27.540
data, and you can use all of this functionality from different libraries. You can also use

07:27.540 --> 07:32.820
a data fusion, not an array, but you can stack it, stack it together in data frame, and move it

07:32.820 --> 07:41.860
to data fusion, or duck DB, you can go to nano-arrow to narwhals to arrow three, which is a Python

07:41.940 --> 07:52.340
binding to arrow rust implementation, et cetera. So all zero copied. The previous

07:53.700 --> 08:01.860
interchange is built on top of C data interface, it's kind of a way that arrow project

08:02.580 --> 08:12.020
supports other libraries, moving data, or getting access to the data without copying.

08:12.020 --> 08:19.620
Okay, but don't you also have other protocols, for example, DLPAC, and you can also go from pyro

08:19.620 --> 08:28.820
to DLPAC through DLPAC to torch without copying. Okay, so that's kind of a short view of arrays

08:28.900 --> 08:34.260
in pyro, and how you can move to different libraries and use all of those without copying.

08:35.460 --> 08:47.300
The next part is storage in pyro. Thank you. Yeah, so of course data does not always live only in memory.

08:50.100 --> 08:57.700
So sometimes you want to store it places, and of course, having column more into data than

08:58.420 --> 09:05.060
it's nice to have a fitting column oriented serialization format, which is what

09:05.060 --> 09:12.260
per case designed to do. It's designed to be efficient for storage and retrieval, right, both,

09:12.260 --> 09:20.180
and it provides efficient compression and encoding schemes to handle data in bulk.

09:20.340 --> 09:28.980
It provides various other things, such as encryption, and it comes with decent amount of

09:28.980 --> 09:36.820
metadata to optimize certain optimizations. And pyro provides an implementation or rather pyro

09:36.820 --> 09:45.300
wraps the C++ implementation of parquet, reader and writer. Here is a very tiny image of

09:45.700 --> 09:52.700
how data looks like on this, when it's serialized, unfortunately, it's unreadable, but basically

09:52.700 --> 10:01.700
the blue parts are metadata, and then you see the boxes on the left are pages of data stored on this.

10:01.700 --> 10:09.860
And here's an example of how one could use parquet to store images on this. Now, we recently

10:09.940 --> 10:16.660
introduced to arrow and also pyro a fixed shape tensor format. And here we have an example where

10:16.660 --> 10:25.860
we read some images. We take their numpy representations. We zero copy those to a table,

10:25.860 --> 10:32.660
right, to well, first to an array of tensors. Then we put that into a table. Again, a zero

10:32.660 --> 10:39.140
copy operation and we finally write everything to parquet. Okay, here we don't compress, but we

10:39.860 --> 10:44.420
could. And then we read the table back, and the table is still the same. Of course, in this case,

10:44.420 --> 10:52.180
we wrote one, so we serialized and then we serialized back. And we hope that this kind of in the

10:52.740 --> 10:58.340
dayonage of people doing a lot of deep learning and stuff, they might start using this.

10:59.380 --> 11:05.460
I think hugging face actually uses fixed shape tensor for some of their internal things.

11:06.020 --> 11:14.180
So here's another example of what you can do with parquet, because there's made it attached.

11:14.180 --> 11:21.300
You can check it, check whether it fits certain Boolean conditions. And if it does,

11:21.860 --> 11:29.460
you read pages that do and you emit reading others, which can locally, of course, it helps,

11:29.460 --> 11:36.420
but especially it's useful for remote reads. So in this case, a reader of a table that was

11:36.420 --> 11:43.620
starting to onto disk will emit certain rows, right, and this is actually building on top of the

11:43.620 --> 11:50.340
first example here. It's only reading the second row of the date, and you can see that's the result

11:50.420 --> 11:59.060
indeed. It can really be useful if you read from remote, it will reduce your loads.

12:00.820 --> 12:09.540
Further, for reading and writing remote, this is a working example of writing to an

12:09.540 --> 12:15.860
history bucket. It will work with other object storages and self-hosted ones too.

12:17.700 --> 12:24.660
What you need to provide is AWS some sort of credentials for your remote system, and then you

12:24.660 --> 12:33.380
basically just write to the URI and read back. Yeah, and here, right, because of the network traffic

12:34.100 --> 12:43.060
optimizations with metadata come in really handy. Further, I mentioned encryption earlier.

12:43.860 --> 12:51.460
I hope this is approximately, yeah, I guess in the back you can't read it. But basically, this is an

12:51.460 --> 13:01.060
example of working encryption. So everything is on this slide. Sorry again, it got kind of big,

13:01.060 --> 13:10.660
but it does work, I promise. So the way we implement a modular encryption in Perkay,

13:10.660 --> 13:16.660
is you have a key management system, so that can be an open source example, it's Hashikar

13:16.660 --> 13:24.020
core vault. It keeps your keys, right, you provide the KMS connector, a key management connector.

13:24.340 --> 13:34.980
You create an encryption configuration that tells you which keys to use for which columns,

13:34.980 --> 13:39.300
and which keys to use to encrypt the footer, the footer only contains the metadata,

13:39.300 --> 13:42.900
but it's important because you have statistics there and you can leak some data out of it.

13:44.740 --> 13:50.980
Then you have a cryptofactory that takes the key management system and it takes the configuration

13:50.980 --> 13:57.540
properties, so the mapping of keys, and finally you pass everything to a perkay writer and it will

13:57.540 --> 14:04.820
encrypt your data to disk. So for instance, with all the past three slides, one could make like

14:05.540 --> 14:12.340
training, one could write an encrypted data set to S3, right, and read it back, and data at rest would

14:12.340 --> 14:21.380
be encrypted. And here is then reading back, this data, this is luckily a bit smaller,

14:22.740 --> 14:33.060
because you don't need to deal with encryption configuration. So you read back data that was encrypted.

14:33.060 --> 14:40.420
In this case, this example only encrypts one of the columns, the other one is not encrypted,

14:40.420 --> 14:47.700
so that means that you could also just read the non-encrypted one back and ignore the encrypted

14:47.700 --> 14:58.820
column and that would still work. Right, compute. Okay, back to me. The third part we want to show you

14:58.820 --> 15:09.940
is the compute module of pyro, again binding to C++. With the module, you can do basic operations

15:10.020 --> 15:16.580
on arrays, on scalers, you can do stuff with tables and data sets. If needed, for example,

15:16.580 --> 15:24.020
a group, there are aggregations, joins, and filtering. I'm going to show examples of basic,

15:25.300 --> 15:34.820
basic compute functions, but you have a reference for what we support here in the link on the slide.

15:35.780 --> 15:45.700
To note, the pyro is not meant to be a library to give you compute functions for every need.

15:45.700 --> 15:50.900
You have it's more like basic things. If you really need to do some minimal stuff for other

15:50.900 --> 16:00.180
numerical operations, you use other Python libraries to facilitate that. So first, a small example

16:00.180 --> 16:11.060
is using some minimum or maximum of an array, so we're back to arrays. So if you have an array

16:11.060 --> 16:17.540
and you just want to see the sum of an array, which is a column, you can use this compute

16:17.540 --> 16:26.500
basic compute function as the same for min or for other available. You can also do some transformations

16:26.500 --> 16:32.020
on arrays. For example, if you have two arrays, and you want to create another one. So another

16:32.020 --> 16:39.620
column that checks the equality of those two arrays, you would use equal function of the compute

16:39.620 --> 16:48.420
module. You could also multiply a column with a scalar. So one number, multiplying all the

16:48.420 --> 16:55.780
elements in an array, or you can multiply two arrays to get another one. So like basic

16:55.860 --> 17:04.660
examples of how would you use the compute module in pyro. So this is the third part we wanted

17:04.660 --> 17:13.860
to show the functionality built on top of arrow format, and the last one is the flight for

17:14.820 --> 17:26.500
loins or the wire. Thank you. So flight RPC is kind of a specification for

17:26.500 --> 17:32.580
or how to use it as a specification, or is it like an implementation, a specification, okay?

17:34.580 --> 17:41.060
Arrow flight is a specification for sending data through the wire using GRPC. So it's kind of like

17:41.220 --> 17:50.660
thin wrapper on top of GRPC that gives you convenience methods to send columnar data on the wire.

17:52.580 --> 18:02.900
So GRPC brings some benefits, but also some complications. So we want to abstract those away from you.

18:03.220 --> 18:12.580
And basically, as it's an RPC framework, it gives you an option to define vocabulary for

18:12.580 --> 18:18.740
for custom services, right? You can write, you can make your own endpoints, and call that,

18:18.740 --> 18:29.460
it comes with an ability to put everything under TLS. So your traffic will be encrypted,

18:30.020 --> 18:36.740
and the idea is that it would be used for either native, ironative storage services,

18:38.900 --> 18:46.420
connections to databases, etc. And then there's another thing built on top of that.

18:46.420 --> 18:54.020
It's arrow flight SQL that enables sending SQL requests, so it reduces the functionality a little bit,

18:54.580 --> 19:01.940
but already defines the SQL part. All right, so this is just a diagram right about the difference

19:01.940 --> 19:11.380
between an RPC and a rest point or a rest system. So instead of having this predefined

19:11.380 --> 19:19.620
words that you can use to pass data, you can actually write your own vocabulary for passing objects.

19:20.260 --> 19:31.220
It comes with its own downsides where you can really get into trouble, but it gives you a lot of power too.

19:32.100 --> 19:38.740
Right, so the way that communication works is a client would connect to the server,

19:38.740 --> 19:44.980
to get back the information about the endpoints that are available there, and then the client can

19:45.940 --> 19:52.020
start to get command and then gets data back in record batches from the server.

19:53.700 --> 19:58.820
I think it's possible to also paralyze that so you can have multiple servers returning the data.

20:00.500 --> 20:06.740
So I didn't want to include the example here because it would not fit in a single slide,

20:07.620 --> 20:14.100
but you're welcome to look at the cookbooks. We have multiple cookbooks for different language

20:14.180 --> 20:20.740
implementations, but this is the Python one. So there's there's an aflight example there,

20:20.740 --> 20:28.420
and some other stuff we covered already. So yeah, we're an open source project. We're currently

20:28.420 --> 20:35.060
where a community driven peers are very welcome. Also input on what people want,

20:35.540 --> 20:45.380
exactly come open issues, ask. We have a bi-weekly call also where computers meet and discuss stuff,

20:45.380 --> 20:49.380
so you're welcome there as well. What else should we say?

20:52.020 --> 20:58.100
Questions, yes, questions. Please, we're lucky to have one here also. So if there's a complicated question,

20:58.100 --> 21:14.020
he can help. So do you have any question for Ake Alenka? Yes, I'm going to run.

21:14.980 --> 21:20.660
No, yeah. Let's far up.

21:24.820 --> 21:31.540
Hello. Okay, yeah, thanks for the talk. One question about a here. Arrow IPC,

21:31.540 --> 21:36.740
so on the process communication. Can you maybe explain a bit? Let's say if this scenario

21:36.740 --> 21:44.020
where we have Arrow data in a Python process for example, but we also want to access from Java or

21:44.020 --> 21:51.300
from Rust, are there ways or also to kind of get zero copy in those situations? Yes, so you can

21:53.140 --> 22:04.340
basically IPC is just the way that the data is represented in memory, you just send it as a buffer

22:04.420 --> 22:09.700
as a IPC, right? And you can indeed do use it for inter-process communication.

22:10.820 --> 22:16.100
We didn't cover it here much because we had to limit the topics.

22:17.380 --> 22:24.580
And inter-language communication is pretty easy and I believe there are examples on the cookbooks.

22:26.180 --> 22:33.700
Another use for IPC is also if you want to, you can store it to this. That's now called a

22:33.780 --> 22:40.420
feather file or an arrow file interchangeably a little bit and it's basically just dumping the IPC

22:40.420 --> 22:47.380
directly to disk and gives you the benefit of being that you can memory map it so you can do

22:47.380 --> 22:55.060
stuff that's larger than memory. Yeah, I don't know. Cool, thanks. Thanks for the questions anything else.

22:55.060 --> 22:56.020
Any other questions?

23:04.660 --> 23:10.980
Hi, thanks for the presentation. My question is about the static type analysis tools because like

23:10.980 --> 23:15.540
all the spy arrow deal with them because when I was working with pandas, it was a huge headache

23:15.540 --> 23:23.380
constantly. Yeah, yeah, as far as I understand there's people who want this and there was some

23:23.380 --> 23:32.020
POC in progress, yeah, I think there's actually a PR in there. We just need, you know,

23:32.980 --> 23:39.380
people to help and get it to the finish line. Yeah, we want that also. Yeah, yeah, indeed.

23:40.180 --> 23:46.820
Thanks so much then. Yeah. Thank you. Thank you. And if it goes, we're out of time I believe.

23:48.820 --> 23:52.820
Thank you so much. Thank you.