WEBVTT

00:00.000 --> 00:08.000
Hi, my name is Rob, this is the link here.

00:08.000 --> 00:15.000
We'll try to convince you that if you're using arrow or per k or something in that ecosystem,

00:15.000 --> 00:23.000
you should try a patchy, I mean, you should try tensor arrays to store your tensor data or multi-dimensional array data.

00:23.000 --> 00:27.000
So our talk will compose and shortly talking about error attention times,

00:27.000 --> 00:31.000
then we'll talk about fixed-shaped tensor, variable-shaped tensor,

00:31.000 --> 00:37.000
a little bit about the integration with numpy, and then finally the alpac.

00:37.000 --> 00:41.000
So what are arrow extension types?

00:41.000 --> 00:48.000
So arrow currently provides several data times, but of course, people always want more.

00:48.000 --> 00:56.000
So at some point we decided to add user extension types to enable people to build their own.

00:56.000 --> 01:02.000
However, because some extension types are more often used than others,

01:02.000 --> 01:05.000
we also decided to make a registry of well-known extension types,

01:05.000 --> 01:09.000
or rather canonical types that live in the arrow namespace.

01:09.000 --> 01:16.000
And with the current implement, we have implement that arrow fixed-shaped tensor,

01:16.000 --> 01:23.000
arrow variable-shaped tensor, json, opaque, and a bit bullet.

01:23.000 --> 01:29.000
These are currently available specs for extension arrays.

01:29.000 --> 01:33.000
So let's first talk about the fixed-shaped tensor.

01:33.000 --> 01:42.000
So, fixed-shaped tensor arrays are multi-dimensional arrays of certain type,

01:42.000 --> 01:48.000
and we represent them in arrow with the fixed-sized list,

01:49.000 --> 01:55.000
which means that we have an array where you can see this.

01:55.000 --> 02:04.000
Each row is set of values, and there need to be as many values as there are.

02:04.000 --> 02:08.000
If you multiply all the shape numbers, right?

02:08.000 --> 02:13.000
So if you have a 2 by 2 tensor, you should have 4 values there.

02:13.000 --> 02:23.000
And besides the date itself, we also carry a metadata which is the shape of these tensors.

02:23.000 --> 02:29.000
And, actually, we also bring dimension names and permutation,

02:29.000 --> 02:37.000
so permutation kind of helps you to calculate strides of the tensors out of the shape.

02:38.000 --> 02:40.000
Yeah, basically that's it.

02:40.000 --> 02:45.000
Data itself is stored in this fixed-sized list,

02:45.000 --> 02:51.000
in order to call this second, second-figuous or row major order.

02:51.000 --> 02:56.000
So that means that you can slice the array, right?

02:56.000 --> 03:02.000
And you can each of these rows is a second-figuous data for a tensor.

03:02.000 --> 03:07.000
So every cell in this array is a tensor by cell, right?

03:07.000 --> 03:14.000
So here's an example of how many of that of such a tensor arrays is serialized.

03:14.000 --> 03:19.000
So we have shape, we have the names of dimensions, and then the permutation of dimensions.

03:19.000 --> 03:25.000
So, out of the permutation, again, you can calculate the strides of the tensors.

03:26.000 --> 03:31.000
Now, on to the second one, the variable shape tensor array.

03:31.000 --> 03:34.000
So this case is a little bit more complicated.

03:34.000 --> 03:39.000
Every row of the array is struck,

03:39.000 --> 03:44.000
because not only do we carry the data of the tensors,

03:44.000 --> 03:47.000
we also need to carry its shape.

03:47.000 --> 03:51.000
So the way that struck the arrays are done in arrow,

03:51.000 --> 03:55.000
it basically means you have two child arrays next to each other.

03:55.000 --> 03:57.000
Correct me if I'm wrong guys.

03:57.000 --> 03:59.000
Yeah? Yeah, okay.

03:59.000 --> 04:04.000
So these are, yeah, we carry them.

04:04.000 --> 04:09.000
And one contains the actual data.

04:11.000 --> 04:12.000
Was I muted?

04:12.000 --> 04:13.000
Yes.

04:13.000 --> 04:14.000
Oh.

04:14.000 --> 04:16.000
There goes our promotions.

04:16.000 --> 04:20.000
High stream, nice to meet you.

04:20.000 --> 04:24.000
So yeah, the variable.

04:24.000 --> 04:30.000
So the first we have the data, as the first child array in the second child array,

04:30.000 --> 04:34.000
carries the shape of each individual tensor.

04:34.000 --> 04:42.000
Besides that, we also carry, of course, the dimension names,

04:43.000 --> 04:50.000
permutations to again to calculate strides out of the shapes.

04:50.000 --> 04:52.000
And then we have this uniform shape.

04:52.000 --> 05:01.000
So this is in case we only one of the dimensions is being changed in the data.

05:01.000 --> 05:08.000
Let's say you don't need to read the whole shape every for every row.

05:08.000 --> 05:12.000
And data is also stored in row major or contiguous,

05:12.000 --> 05:14.000
secontiguous order.

05:14.000 --> 05:19.000
So here's an example of the uniform shape parameter.

05:19.000 --> 05:25.000
You can see that, like, the here the convention is the first,

05:25.000 --> 05:28.000
the first value of shape will always be 400.

05:28.000 --> 05:33.000
The second one can change the third one will always be three.

05:33.000 --> 05:37.000
So when the reader or writer, when the reader works with this,

05:37.000 --> 05:42.000
they don't need to read the first and third parameter.

05:42.000 --> 05:44.000
Yeah, oh yeah.

05:44.000 --> 05:49.000
And here the shape changes from row to row,

05:49.000 --> 05:52.000
but the dimension number always stays the same.

05:52.000 --> 05:55.000
That was kind of the design decision that we made.

05:55.000 --> 06:00.000
Yeah, and to you, Alenko.

06:00.000 --> 06:05.000
Okay, so this is strange because this is for video.

06:05.000 --> 06:08.000
And I have to.

06:08.000 --> 06:14.000
The this canonical extension type, the fixed shape one is implemented in C++,

06:14.000 --> 06:17.000
ROC++, and it has bindings in Python.

06:17.000 --> 06:20.000
So we can check it out play with it.

06:20.000 --> 06:24.000
I thought it would be nice to have an example of Python because it's nice to visualize.

06:24.000 --> 06:29.000
Here we import, we define which type we want.

06:29.000 --> 06:32.000
This is the extension type that's already in pyro.

06:32.000 --> 06:40.000
You can use it, you have to tell it which data type you need and you tell the shape of individual tensor.

06:40.000 --> 06:44.000
Then you give it the data from wherever you need it.

06:44.000 --> 06:49.000
And you define the storage type, which is as we saw the list.

06:49.000 --> 06:57.000
It has to have the same data type and the length of the list has to match.

06:57.000 --> 07:05.000
So then you define the extension the extension area out of this function with this method.

07:05.000 --> 07:11.000
So you go from storage, which is the list created out with the data.

07:11.000 --> 07:15.000
And then you give it the tensor type you need.

07:15.000 --> 07:16.000
Thank you.

07:16.000 --> 07:20.000
I had to trim it out a little bit, so it would be visible.

07:20.000 --> 07:24.000
This is how the object looks like in pyro if you print it out.

07:24.000 --> 07:32.000
So there's a list and each element of the list is a tensor element of the area of the tensor type.

07:32.000 --> 07:39.000
Now if you go to numpy, you can see it's an end the array with the shape 422.

07:39.000 --> 07:47.000
And if you go back, the four shape is the length of the array.

07:47.000 --> 07:51.000
Okay, is that clear?

07:51.000 --> 07:55.000
Yeah, so the first dimension is the length of the array.

07:55.000 --> 08:06.000
And then you have individual elements which are individual rows, tensors in the pyro will then if you go back to numpy to numpy.

08:06.000 --> 08:12.000
Yeah, so this are the individual tensors in the pyro array.

08:12.000 --> 08:17.000
Okay, and then you can also have a numpy and end the array and go back.

08:17.000 --> 08:20.000
So you go forward.

08:20.000 --> 08:23.000
Another one, another one.

08:23.000 --> 08:24.000
Yes.

08:24.000 --> 08:31.000
So you could also have a numpy array and go back to the pyro tensor array.

08:31.000 --> 08:34.000
Okay, so you can go fourth and back between.

08:34.000 --> 08:37.000
Okay, so this is for an example.

08:37.000 --> 08:39.000
I hope that was useful.

08:40.000 --> 08:47.000
I would like to take a minute or so full deal pack deal pack is a protocol.

08:47.000 --> 08:55.000
And then it enables interchange between python libraries that have arrays, the array libraries or tensor libraries.

08:55.000 --> 08:58.000
To have a device aware.

08:58.000 --> 09:01.000
So you can live on CPU or GPU.

09:01.000 --> 09:08.000
It's aware of that and it's it's meant to be on a zero copy interchange.

09:08.000 --> 09:19.000
We would like to have this nicely implemented in arrow and pyro also for now it's implemented for pyro for arrow arrays.

09:19.000 --> 09:24.000
Only to produce we would like to have the consumption also.

09:24.000 --> 09:34.000
And we would like to connect this extension tensor arrays with this methods to have the seamless interchange between other python libraries that use tensors.

09:34.000 --> 09:38.000
For example, Q pythons of role pythorge, etc.

09:38.000 --> 09:40.000
Okay, so now it's implemented for arrays.

09:40.000 --> 09:45.000
You can go from pyro array to any of those.

09:45.000 --> 09:48.000
So consumption you can.

09:48.000 --> 09:51.000
Sorry to use you could go into any of those.

09:51.000 --> 09:57.000
But not go back and it's not implemented for the extension arrays yet, but that's what we would like to do.

09:57.000 --> 10:02.000
If there's any wish for that and thumbs up, that would be awesome.

10:02.000 --> 10:05.000
Thank you.

