WEBVTT

00:00.000 --> 00:08.000
OK, so let's start.

00:08.000 --> 00:11.840
It's already after noon, so we can talk about beer.

00:11.840 --> 00:18.120
So I want to present today a talk on how you can use Postgres to create a recommendation

00:18.120 --> 00:19.120
system.

00:19.120 --> 00:24.600
The idea is to use beer as an example, but if you don't like beer, you can easily prepare

00:24.600 --> 00:29.840
something like that for juice, tea at a place, movies, etc.

00:29.840 --> 00:36.960
So I also, the general concept, then you can use it yourself to start your own simple

00:36.960 --> 00:38.720
recommendation system.

00:38.720 --> 00:43.280
So first slide about me, my name is Anjjino Vitskyan from Poland.

00:43.280 --> 00:50.720
I have roughly 12 years of experience as a DBA, mostly Oracle, but since eight years, I'm

00:50.720 --> 00:58.840
also doing Postgres administration, and I'm a database engineer at CERN since 2020.

00:58.840 --> 01:05.280
So yeah, I've also contributed a little bit to the Postgres.

01:05.280 --> 01:12.920
In 2018, I found a memory leak in replication managers, replication manager by EDB.

01:12.920 --> 01:18.040
So there is a, there is proof that I actually contributed a little bit, but yeah, I wouldn't

01:18.040 --> 01:19.440
call myself a contributor.

01:19.440 --> 01:21.440
It was just very simple.

01:21.440 --> 01:24.680
So at CERN, you probably have heard about that.

01:24.720 --> 01:30.440
We are a European organization for nuclear research.

01:30.440 --> 01:34.040
We host, I think, the biggest laboratory in the world.

01:34.040 --> 01:38.920
So here is a short video to show you what we have.

01:38.920 --> 01:48.600
In terms of IT department, we are roughly 400 people, and we are, let's say, I think, like

01:48.600 --> 01:50.800
more than 10 people here from CERN.

01:50.800 --> 01:54.680
So we have a booth, as well, if you want to check it out, I will also mention the booth

01:54.680 --> 01:59.040
at the very end, if you want to know where it is.

01:59.040 --> 02:04.800
And of course, for the amount of data that we are creating at CERN, we need to have databases.

02:04.800 --> 02:11.040
So my team, we are responsible for making sure that the databases that we are offering

02:11.040 --> 02:20.040
are working, essentially, that we have backups, that they are highly available, etc., etc.

02:20.040 --> 02:28.280
So in terms of the databases, we have Oracle, my SQL, Postgres, etc., I think the biggest ones

02:28.280 --> 02:33.760
are Oracle, because we started with Oracle in the 80s.

02:33.760 --> 02:38.640
And then, of course, we try to share the knowledge, one of the concepts behind CERN

02:38.640 --> 02:43.360
is to have the outreach to share the knowledge that we create, some of the software that

02:43.360 --> 02:49.160
we create as open source, hence we are here to share, let's say, the information.

02:49.160 --> 02:58.160
So, as I mentioned, databases at CERN, we have more than 100 Oracle databases since the 80s.

02:58.160 --> 03:05.320
And then since 2011, we have open source alternatives, let's say, so we have 600 instances

03:05.320 --> 03:09.360
of my SQL for 100 of Postgres and 200 of the influx.

03:09.360 --> 03:15.320
This is not going to be covered at this presentation, this is just to present a little bit

03:15.320 --> 03:16.320
what we do.

03:16.320 --> 03:20.120
We later, at the booth, at one point, so if you want to talk a little bit about that,

03:20.120 --> 03:22.400
feel free to check it out.

03:22.400 --> 03:28.080
And just to give you an idea of how big is the size of our environment, we have more than

03:28.080 --> 03:34.440
five petabyte of data and Oracle databases, 150 terabytes of data in the DBOD and roughly

03:34.440 --> 03:42.440
three petabytes of backups, so that's quite a big, quite a big environment.

03:42.440 --> 03:47.280
But this talk is not going to be about that, we are going to talk about PGVector.

03:47.280 --> 03:52.800
So let's start by building a simple Bureau Recommendation System.

03:52.800 --> 03:56.880
But before we do it, I need to have a disclaimer that I'm not here officially, it's not

03:56.880 --> 04:02.560
what we do officially at work, of course, if you want to, if you want to drink that's your

04:02.560 --> 04:10.160
responsibility, I don't want to promote excessive alcohol consumption, so let's just

04:10.160 --> 04:16.880
make mention that, but let's start with vectors.

04:16.880 --> 04:19.440
Let's start with a little bit of theory.

04:19.440 --> 04:26.880
So in AI, a vector and also in mass, the vector is a list of numbers of scholars that represent

04:26.880 --> 04:33.880
a point in the multidimensional space, and if we write it down as mathematically as you can

04:33.880 --> 04:39.360
see here, and is the dimensionality of the vector.

04:39.360 --> 04:44.640
So let's say we have two dimension vectors, so this would be just two scholars.

04:44.640 --> 04:51.040
If it was 300, then we have 300 numbers that we keep, we store as an array.

04:51.040 --> 04:54.160
And then we need to introduce the concept of embedding.

04:54.160 --> 04:59.440
Embedding is essentially it's a numerical representation of an object that we know from the

04:59.440 --> 05:02.000
real board.

05:02.000 --> 05:08.600
And the machine learning and artificial intelligence systems are using this numerical representation

05:08.600 --> 05:13.040
of objects to understand the world around us.

05:13.040 --> 05:19.040
So for example, a bird's nest and a die on them are similar, and they are analogous as

05:19.040 --> 05:26.120
a pair, while the day and night would be the opposite terms, and embedding simply convert

05:26.120 --> 05:33.000
those ideas into numbers, and in order to create embeddings, we need to have a model.

05:33.080 --> 05:41.160
Embedding models are AI machine learning models that we could use to convert from any type

05:41.160 --> 05:42.160
of input.

05:42.160 --> 05:46.680
I will mention it a little bit later into this vector, and the vector is the thing that

05:46.680 --> 05:48.680
we will store in our database.

05:48.680 --> 05:54.320
So we can imagine that we have some input, and it could be something like a movie or a picture

05:54.320 --> 05:59.600
or some text, maybe an audio fragment with some voice.

05:59.600 --> 06:04.440
We use the embedding model to convert it into the embedding, so we will get the vector

06:04.440 --> 06:07.960
so the array of the scholars at the end.

06:07.960 --> 06:14.200
So for example, let's switch to beer, again, we will have a description of the beer with

06:14.200 --> 06:20.880
might be citrusy with a sweet aroma, and then our embedding model will convert it into a vector

06:20.880 --> 06:25.640
so a set of numbers, a set of scholars that we will keep in our database.

06:25.640 --> 06:31.160
So of course, this is a very simplified idea of what's going on.

06:31.160 --> 06:35.800
If you're input is longer, for example, if you want to embed a full document, something

06:35.800 --> 06:41.440
longer, you would cut this, you would cut this text into smaller times, there are many

06:41.440 --> 06:44.760
different ways to do it, so we will not be covering that.

06:44.760 --> 06:51.320
This is just a very simplified version of what the system like that is doing.

06:51.320 --> 06:56.280
So you can see that with different input, we will have different outputs.

06:56.280 --> 07:07.920
So for example, our embedding model will create the vectors based on the input, and probably

07:07.920 --> 07:13.320
we will get different answers, different values for different results.

07:13.320 --> 07:17.920
But the general idea, the general concept is that similar inputs should give a similar

07:17.920 --> 07:23.960
result, similar values in these vectors, and we could somehow calculate the difference

07:23.960 --> 07:32.600
between the values that we get, and we can use it to find similarities in our dataset.

07:32.600 --> 07:37.000
So our recommendation system will be super easy, it will be super simple, it will just

07:37.000 --> 07:41.800
use the similarity of the vectors that are there.

07:41.800 --> 07:45.240
So how do we calculate the similarity?

07:45.240 --> 07:52.240
It's assumed for simply this sake that we have a two dimensional vector, and let's

07:52.240 --> 07:59.040
say that the x axis is sweetness, the y axis is the number of x of our object that we want

07:59.040 --> 08:04.600
to, that we want to, let's say, convert into a vector.

08:04.600 --> 08:07.480
So we have dogs and we have cats.

08:07.480 --> 08:13.480
In our embedding model, they should be somewhat close to each other, they should be similar,

08:13.480 --> 08:16.480
we can calculate the cosine distance between them.

08:16.480 --> 08:22.720
So here, let's just connect from a line from the, our starting point of our space.

08:22.720 --> 08:30.560
So from 0, 0, we calculate the cosine value of the angle alpha, and we can say that it's

08:30.560 --> 08:31.560
quite similar.

08:31.560 --> 08:36.000
In comparison, bananas are also sweet, but they don't have legs, or they have less legs

08:36.000 --> 08:37.560
than the animals.

08:37.560 --> 08:40.160
So they are somewhere else in our space.

08:40.160 --> 08:47.960
So we can calculate the angle again, and we can see easily on this slide that the better

08:47.960 --> 08:50.320
angle is bigger than the alpha.

08:50.320 --> 08:57.080
So the distance is bigger, so similarity is smaller, so we can claim based on this information

08:57.080 --> 09:02.080
that we got, that a dog is more similar to a cat than it is similar to a banana.

09:02.080 --> 09:09.680
This sounds trivial to us, but of course our computer doesn't know by itself the reality

09:09.680 --> 09:14.960
of the world around us, and essentially the same thing will happen in our similarity search,

09:14.960 --> 09:18.880
but the model that I will be using, so the model that I'll present a little bit longer,

09:18.880 --> 09:22.760
a little bit later, will have 384 dimensions.

09:22.760 --> 09:30.520
So essentially, the same calculation will be done for us, but using the 384 dimensions.

09:30.520 --> 09:35.280
And of course there are some caveats, some limitations, so for example, we can also say that

09:35.320 --> 09:46.160
some other animals in our space are also sweet, they have less legs, and it's not, if we calculate

09:46.160 --> 09:54.400
it just in one, if our vector was just too dimensional, we would get very weird results.

09:54.400 --> 09:59.520
So this is why we need to have more vectors, and they will help us to calculate the distances

09:59.520 --> 10:03.800
more realistically, and we'll get back to that a little bit later.

10:03.800 --> 10:10.440
So there are of course some limitations of the similarity when we calculate it, when the

10:10.440 --> 10:12.440
vectors are somewhat limited.

10:12.440 --> 10:18.880
So for example, if you think about it, healthy and unhealthy, very basically intuitively

10:18.880 --> 10:25.800
for us, this is something that's completely on the two opposite sides of how world functions.

10:25.800 --> 10:31.240
If you drink coke, it's probably going to be unhealthy, but drinking water is healthy, so

10:31.280 --> 10:38.240
they are on two ranges of the same scope, but for our model, they might be somewhat similar,

10:38.240 --> 10:39.840
and this is actually the case.

10:39.840 --> 10:45.280
So for example, with the model that I'm using, the healthy and unhealthy are somewhat similar,

10:45.280 --> 10:52.320
healthy and not healthy are even more similar, and this is because both can be put in the

10:52.320 --> 10:57.320
multidimensional space as adjectives that are related to the health status, so in a way there

10:57.320 --> 10:59.160
are quite similar to each other.

10:59.200 --> 11:03.000
So this is for example, the limitation of the model that I'm using.

11:03.000 --> 11:07.600
But for example, I don't get a banana, I definitely less similar, but for example, there

11:07.600 --> 11:12.400
is some similarity, because one and the other one is an object, as opposed to being, I don't

11:12.400 --> 11:19.880
know, a concept, just an idea, not an object in a physical world, and we can also use it

11:19.880 --> 11:25.400
to calculate distances between even full sentences or longer pieces of text.

11:25.400 --> 11:31.400
So for example, I like beer and table partitioning this animation feature of RDBMSs,

11:31.400 --> 11:36.640
is not similar at all, because if you think about it, yes, it's completely different topics.

11:36.640 --> 11:40.680
But for example, I like beer and I like indexes and databases, this is already a little

11:40.680 --> 11:48.840
bit more similar, and I like to index my data, and I like indexes, this is quite a lot

11:48.840 --> 11:49.840
similar.

11:49.840 --> 11:55.760
So with that in mind, we need to make sure that we are aware of this limitation of this,

11:55.760 --> 12:04.840
let's say, simplified idea of the model that we might get some results that could be a

12:04.840 --> 12:06.840
little bit good, not ideal.

12:06.840 --> 12:14.360
But this is the reality of the world that we live in, so the opposite is not well-defined.

12:14.360 --> 12:17.800
What is the, we can have a discussion like, what is the opposite of King?

12:17.800 --> 12:20.680
Do we have any ideas in the room?

12:20.680 --> 12:26.680
Yeah, maybe president, maybe Queen, you know, it's the first thing that came to my mind

12:26.680 --> 12:32.760
was Queen, but also Prince, or maybe a poor man, or maybe a banana.

12:32.760 --> 12:40.720
So it's trying to implement, trying to convert, like multi-dimensional world that we live

12:40.720 --> 12:47.720
in into one dimensional space, trying to put, trying to simplify it into one number, that

12:47.720 --> 12:50.160
is virtually impossible.

12:50.160 --> 12:57.960
That's why we will have these limitations, and for these limitations, depending on the

12:57.960 --> 13:01.200
model that we are using, it will be different results.

13:01.200 --> 13:05.280
So of course, how we choose your model is going to make a big impact.

13:05.280 --> 13:08.320
But how do we handle the vectors in the database?

13:08.320 --> 13:13.840
Now we have a little bit of theory on how it works, how what are we trying to achieve.

13:13.840 --> 13:15.520
So let's have a look.

13:15.520 --> 13:20.480
One of the ideas on how we can do it, it's PGVector.

13:20.480 --> 13:24.760
So PGVector is an extension that was written and developed by Andrew Kane.

13:24.760 --> 13:30.640
I think he did like more than 99.9% comments to the repo.

13:30.640 --> 13:36.040
So it's quite amazing what he has managed to achieve.

13:36.040 --> 13:41.880
But it's an extension to progress that we can install easily to try to do nearest the neighbor

13:41.960 --> 13:45.000
search to try to do indexing.

13:45.000 --> 13:49.920
We can do single precision half precision binary sparse vectors.

13:49.920 --> 13:54.400
So these are different types of vectors that depending on the model that we are using, we

13:54.400 --> 13:59.640
will get different output vectors with different definitions.

13:59.640 --> 14:01.120
So how to start?

14:01.120 --> 14:03.000
We need to build the download extension.

14:03.000 --> 14:08.520
Of course, we need to build it with post-respineries, or we just download the binaries

14:08.560 --> 14:11.120
for me, it was very easy on the market.

14:11.120 --> 14:15.160
It's a part of home-group, so it was just installing a one package.

14:15.160 --> 14:20.560
We need to, of course, our database, we need to create the extension to handle the vector

14:20.560 --> 14:21.560
data.

14:21.560 --> 14:30.160
And then we can just add a column, the extension will give us a new data type, which is called

14:30.160 --> 14:31.160
vector.

14:31.160 --> 14:35.720
And, optionally, here where the three dots are, we can specify the dimension of the vectors

14:35.720 --> 14:37.720
that we're going to keep in it.

14:37.720 --> 14:41.960
But it's not obligatory.

14:41.960 --> 14:47.400
And then, if you want to connect to your database for some of the languages, you need to

14:47.400 --> 14:50.160
add a library.

14:50.160 --> 14:55.720
I think they are available for any languages that are able to connect to post-gres.

14:55.720 --> 15:00.280
So for example, you just need to do a pip install and then import a pgvector Python.

15:00.280 --> 15:03.080
But this is optional, sometimes it's not needed for some of the queries.

15:03.080 --> 15:05.720
It's not even needed.

15:05.720 --> 15:10.040
So how are we going to query our data?

15:10.040 --> 15:17.800
So here, we introduce a new operator, which will calculate the distance of the column

15:17.800 --> 15:20.880
that we created in the previous step, so the embedding column.

15:20.880 --> 15:26.560
And we will calculate the distance from this vector that we are using here as an example.

15:26.560 --> 15:34.720
And there are many operators like that, as I mentioned, different ways to calculate the

15:34.800 --> 15:37.920
distance between two vectors.

15:37.920 --> 15:43.360
So for our use case for our presentation, we will use the cosine distance.

15:43.360 --> 15:49.520
So what I presented on this slide with the cat and dog and banana and the flamingo.

15:49.520 --> 15:53.440
We can also use inner product, euclidean distance.

15:53.440 --> 16:00.800
So there are different ways on how to calculate it since 0.70, you can also use the Manhattan

16:00.800 --> 16:01.800
distance.

16:01.960 --> 16:06.760
There are different ways you should probably check it out when you select your model,

16:06.760 --> 16:10.440
which works the best for you, which one gives you the best results.

16:11.880 --> 16:13.400
And what about the indexes?

16:13.400 --> 16:21.400
So by default, the nearest neighbor search or a search in comparison to your vector will perform an

16:21.400 --> 16:22.440
exact search.

16:22.440 --> 16:24.120
So that's a very slow operation.

16:24.120 --> 16:28.920
If your data set is quite big, it will be a full table scan.

16:29.080 --> 16:37.160
And the PG vector will calculate the distance between what you provided and the data

16:37.160 --> 16:38.040
and your data set.

16:38.040 --> 16:40.840
So this is quite slow if your data set is big.

16:41.880 --> 16:47.160
So there are two types of indexes that you can use, but the indexes will give you approximate results.

16:47.720 --> 16:52.280
The idea is that you want to sacrifice a little bit the quality of your data,

16:52.280 --> 16:55.960
but this results that are faster.

16:55.960 --> 17:01.400
So there are two most popular approaches to doing things in vector lookups.

17:01.400 --> 17:07.480
So this is the hierarchical navigable small world and the IVFF, so inverted file flat.

17:09.480 --> 17:11.640
The general concept of the first one.

17:11.640 --> 17:19.080
So the H and SW is that you put your data on a space.

17:19.080 --> 17:21.960
Here you can see two dimensional representation.

17:22.040 --> 17:28.680
And then randomly you choose only, like, let's say 5% of your data, you put it on a different layer.

17:28.680 --> 17:32.840
And then out of this layer, you also put like 5% on this additional layer.

17:32.840 --> 17:39.240
So when you want to do a lookup, you just traverse through these indexes to find potentially

17:39.240 --> 17:41.240
something that's the closest to your results.

17:42.360 --> 17:43.880
So that's one way.

17:44.840 --> 17:49.400
A second way is try to do kind of like a clustering of your data.

17:49.480 --> 17:52.440
Try to cluster the vectors that are similar to each other.

17:52.440 --> 17:57.640
So you're, again, this is a simplification because here we have a two dimensional space.

17:58.680 --> 18:05.320
You try to create center-roids of grouping to be things together.

18:05.320 --> 18:09.800
And then here with the green color, you can see that if you try to do a lookup,

18:09.800 --> 18:17.800
it will probably try to result from this cluster of information as well as the neighboring

18:18.520 --> 18:23.800
So if you want to learn more about the indexes, there is an amazing tutorial that's linked here

18:23.800 --> 18:26.600
from Alex G, from Neon.

18:26.600 --> 18:33.080
It's a tutorial how to write down your own search using those two algorithms.

18:33.080 --> 18:37.640
But this is out of the, let's say, out of this scope of this presentation,

18:37.640 --> 18:45.560
we will use those indexes, but without diving deeper into the kv and the limitations of those indexes.

18:45.560 --> 18:49.320
There is one thing that you need to know about pgvector and indexes.

18:49.960 --> 18:57.160
So the index lookup search will, will be the first operation that will be done.

18:57.160 --> 19:01.560
And then the filtering will be done on top of the results from your index lookup.

19:02.440 --> 19:06.840
So just because the filtering is applied after the index lookup,

19:06.840 --> 19:13.320
it is possible that in this query you will get less than five rows by the default configuration.

19:13.400 --> 19:23.560
The idea is that when you order by, by default, the pgvector will start with, I think, 40 as a default.

19:23.560 --> 19:25.320
The, as a default set.

19:25.320 --> 19:32.760
If out of your 40 entries that were looked up with this, with this query,

19:32.760 --> 19:37.960
if only one fits your criteria of category, let's say,

19:37.960 --> 19:43.000
then you might get a one result instead of the five that you would actually expect.

19:43.000 --> 19:49.320
So for the, as I said, the candidate list is 40 by default,

19:49.320 --> 19:52.040
but you can also control it as a parameter if you want to do more.

19:52.840 --> 19:55.480
And there is also the possibility to do interactive scans.

19:55.480 --> 20:00.120
So pgvector can understand that, oh, by the way, your limit was five,

20:00.120 --> 20:05.320
but if only manage to get two results, so I'm going to go and do another index lookup,

20:05.320 --> 20:11.320
try to get more data for you and do it several times until you get the full result set of five.

20:11.480 --> 20:18.200
But of course, there is a limit of, it's also controllable with, with a parameter,

20:18.200 --> 20:23.160
how many times should it do the lookup, because you want to essentially get your data out.

20:23.800 --> 20:29.240
But okay, let's, that's enough, theory, let's, let's finally build the system.

20:29.240 --> 20:32.120
So for that, we need to have a data set on beers.

20:32.120 --> 20:38.920
So I have managed to find on Kaggle.com, I found a data set on beer profile and ratings.

20:38.920 --> 20:46.040
So what we took from this data set is only the beer name and the info, which is a column that's

20:46.040 --> 20:51.000
sometimes it consists information on the taste of beer, sometimes it's just some marketing information

20:51.000 --> 20:56.680
about the beer. Sometimes it's some technical details about the beer, the person tells the

20:56.680 --> 21:02.040
I view. So we will see how that goes. Like, I didn't do any data filtering of this data set.

21:02.040 --> 21:09.880
So in our database, we have, I think, 3,300, around 300, 3,000 data bases. And of course,

21:09.880 --> 21:20.680
this was just a CBS V file that that I imported into our database. So then I created the column

21:20.680 --> 21:26.360
of the embedding, which we will start the vectors in and I've specified the dimensionality of the

21:26.360 --> 21:32.280
vector to 384, because this is the dimensionality of the model that we are using.

21:33.240 --> 21:40.360
So how to put it all together? I wrote a simple Python application that connects to our database,

21:40.360 --> 21:46.840
gets the text from the description, puts it through embedding model and then converts it into

21:46.840 --> 21:54.120
the vector and the vector will write it down. And for the embedding model, again, it's up to you

21:54.120 --> 22:01.080
to choose your model. So for example, this model that I'm using is all mini LM from library of

22:01.080 --> 22:05.720
sentence transformers. It's quite a popular model, as you can see, it was downloaded more than

22:06.280 --> 22:12.840
a 3 million times last month, but also what's important is the license. So it's a patchy to zero

22:12.840 --> 22:22.360
license. It's a model that allows you to transform text in English to an embedding. So there is a limitation

22:22.440 --> 22:27.080
for example, this model, it's not a multi-language model, so it will not understand descriptions

22:27.080 --> 22:35.480
of the beer in Polish or French or other languages. So there are multi-language models available.

22:35.480 --> 22:41.640
Depending on the use case, depending on the data set, you should choose that and you potentially

22:41.640 --> 22:49.400
choose something that works best for you, just by doing some trial and error. So how do we do

22:49.960 --> 22:56.760
the embedding? So simply, the easiest way to do an embedding. Of course, one of the options

22:56.760 --> 23:02.680
is to call an API provided by one of the cloud providers. You can use any of the major providers

23:02.680 --> 23:11.640
of the LLMs or OpenAI, but we try to do stuff locally. So the sentence transformer library for

23:11.640 --> 23:17.480
Python will automatically download this model for you. So when you run this code for the first time,

23:17.480 --> 23:23.400
it will actually call an external service that will deliver the file to your laptop, to your computer,

23:23.400 --> 23:29.320
and it will be hosted locally. So this will work no matter if you have internet access or not.

23:30.120 --> 23:38.440
We just instantiate this object. We try to encode this, as you can see, we encode the data

23:39.320 --> 23:46.040
using the model encode. And then we receive the embedding. As I mentioned, it's the model that we are

23:46.040 --> 23:52.600
using is providing the 300 plus vector dimension. So it's going to be an array of 384

23:53.880 --> 24:00.920
scholars as an array. So for our process of embedding the data on the database site, it's super simple.

24:00.920 --> 24:07.720
It's like a column like any other. Just the data type will be this array of vectors. So of course,

24:07.880 --> 24:19.880
we'll do it like set embedding equal this parameter. So we have 30361 beers, which will try to embed.

24:19.880 --> 24:28.200
So doing it locally on my MacBook, it's 43 seconds. Doing it locally, but in four processes,

24:28.840 --> 24:35.080
it's roughly 20 seconds. So for a small data set that's completely acceptable to test

24:35.160 --> 24:43.000
several models. It's probably a good idea to test it before doing it on full scale.

24:43.000 --> 24:49.480
For example, my real data set would be 3 million of beers, then I would have to worry about

24:49.480 --> 24:55.240
a little bit more. But for this use case, this was fairly simple. And this was super easy to

24:55.240 --> 25:00.760
parallelize, because I've copied my very simple code and Python and just asked the tragedy to

25:00.840 --> 25:06.200
parallelize it and gave me an output that was using the multiprocing pool and Python. So it was

25:07.480 --> 25:14.760
fairly quick and easy. And this is essentially the code. So we have a cursor that's connected to

25:14.760 --> 25:21.800
our Postgres. It's outside of the slide because it's pretty standard. So you can see that here we

25:21.800 --> 25:29.640
are downloading the information from from the beer stable. And then for each of them, we do the

25:31.400 --> 25:38.280
encoding. We append everything into the binds array, because we want to do it the nice way with

25:39.080 --> 25:45.080
reusing the statement. So at the very end of our code, we just do execute many. We run the

25:45.080 --> 25:49.720
update SQL, which is as I presented in the previous slide, it's just setting them, adding to this

25:49.720 --> 25:55.080
value that we have calculated. And that's it. Of course, if your data set would be way, very you

25:55.080 --> 26:00.440
would have to split into smaller operations, probably it would be a good idea to distribute it across

26:00.440 --> 26:06.520
some compute modes. But for this case, for a project like that, like a site project, this is

26:06.520 --> 26:12.840
more than nothing. It was super simple to write that code. So how do we look up our information?

26:13.400 --> 26:21.400
We do a simple query from the Postgres. We order by the embedding, as I mentioned, we have this new

26:21.400 --> 26:27.720
operator that will calculate the cosine distance between the input that we are providing. So the

26:27.800 --> 26:32.600
vector that we are providing, and we will limit, because we don't want to sort all of our data set.

26:33.160 --> 26:40.680
We will limit, so in the examples, I think I have a limit set to five examples.

26:41.640 --> 26:48.440
So again, let's have a look a little bit into how our code works for the lookup, because then

26:48.440 --> 26:53.640
previously we've seen how to vectorize our data. Now we will see how to do the lookup.

26:53.800 --> 27:03.160
So as I mentioned, there is this additional library that you can use to import the procedure that's

27:03.160 --> 27:09.720
called register vector. Essentially, the register vector is responsible for making sure that our

27:09.720 --> 27:15.000
input is converted into an array before it's being fed into Postgres, so it's making sure that the

27:15.000 --> 27:19.320
types are compatible. We don't have to worry about the conversions, because it's going to be done

27:19.320 --> 27:24.920
automatically for us. So I have a piece of code that we will hopefully manage to run in a second

27:24.920 --> 27:31.480
and have a live demo of how it works. We have some user input, and then the user input, we are

27:31.480 --> 27:36.520
trying to create an embedding using the model that was loaded before, and then we simply select

27:36.520 --> 27:44.040
the five top beers, and then we will print the beers in an output. The real code that I'm going to

27:44.040 --> 27:47.720
use a little is a little bit more complicated than that, just because I do some fancy, like,

27:47.720 --> 27:56.520
modification of how we print out the results just for a readability, so it's easier to see.

27:56.520 --> 28:04.200
But so far it's quite a simple code, and the full code is 60 lines, so that's nothing major.

28:05.080 --> 28:13.400
So I mentioned the indexes. Without the indexes, we will get the exact results, so as I said,

28:14.360 --> 28:21.160
it's going to be a full scan on the beer stable. You can see that we have 3,360,

28:21.160 --> 28:29.080
the data set is 361, but what I'm doing here, I'm trying to, I have a query that's also trying to

28:29.080 --> 28:38.600
find the beer that's similar to a beer with the ID 2363. So this is just to simplify the

28:39.560 --> 28:44.920
the explain plan, and you can see that without the indexes, the execution time is 15 millisecond,

28:44.920 --> 28:52.520
if we create the hierarchical navigable small world, we can get it down to 3 milliseconds,

28:52.520 --> 28:59.640
and with the IVF flat, the execution time is less than half a second, it's almost around half

28:59.640 --> 29:06.520
a half a millisecond, and you can see that in the explain plan it's visible as a normal index

29:07.480 --> 29:14.680
operation on our table. So this is just the backup slide in case the demo doesn't work,

29:14.680 --> 29:19.880
but our expected behavior of the system is that we type in a prompt, so let's say we are looking

29:19.880 --> 29:26.520
for something that has lemon, and we get some results, so this is not a full text search,

29:26.520 --> 29:31.160
so we are expecting to see other things that are probably relevant to us, that's not necessarily

29:31.800 --> 29:37.320
so we know that lemon is a cypress, so we have like a proposal of a beer that has

29:37.320 --> 29:43.480
cypress and lemon zest, we have something that has lemon juice in it, we have more citrusy

29:43.480 --> 29:50.600
beers and lemon grass aroma, so there are some good recommendations as well, but let's do the

29:50.680 --> 30:06.120
the fun part, let's do a demo, so I don't know if that's visible for people in the back row,

30:07.400 --> 30:16.040
hopefully yes, so let's start with a first one, so it's looking for beers that are herbal

30:16.120 --> 30:22.600
but fruity, so here we can see that we are getting some, the first column will be the distance,

30:22.600 --> 30:28.760
so it's not similarity, it's distance, so the lower the closer we are to have a perfect hit,

30:28.760 --> 30:38.920
let's say, so here we see some proposals, but maybe it will be more visible here, so the next

30:39.000 --> 30:46.280
query that I'm using is berries, and what my code is doing, it's highlighting with the black

30:46.280 --> 30:52.440
with the blue background, the direct hits, let's say, of the words, so we might see that,

30:52.440 --> 31:00.440
you know, the first proposal is here because of the blueberries, raspberries, but the third one

31:01.080 --> 31:07.560
has raspberries, but only one, so singular, but our model knew that it was similar, so the

31:07.560 --> 31:17.000
similarity surge knew that they are somewhat similar, but this is quite simple, this is something

31:17.000 --> 31:22.760
that probably a full-text surge could also handle for us, so let's start to do something a little

31:22.760 --> 31:30.520
bit more complicated, so let's try to find if our database says has a beer that has a taste of

31:30.520 --> 31:37.320
beers, a berries, and caramel, so this is a little bit more complicated, we will not have a

31:37.320 --> 31:44.840
direct hit in terms of string comparison, but here we can see that for example, we have fruit

31:44.840 --> 31:55.560
ale, which has subtle fruit flavor, but actually know hit on the caramel, but for example,

31:55.640 --> 32:02.360
the package hack will have a dark fruit aroma with rich caramel flavor, so this is probably

32:02.360 --> 32:08.200
something that could be of interest to somebody that wanted to have a beer like that, and I have

32:08.200 --> 32:15.880
some other ideas on what we can try, but maybe we have a suggestion from the room, what should we look for?

32:15.880 --> 32:27.960
Yeah, I knew that this would be one of the, so for them doesn't give us that many, let's say

32:27.960 --> 32:34.280
relevant stuff, but something that worked for me at a different conference that we had SQL,

32:34.280 --> 32:40.680
right, because we were talking about Postgres SQL, so the first hit is a DBA, don't mention

32:40.680 --> 32:50.360
the third one, so here we can see that it's a very interesting example, because with the

32:50.360 --> 32:59.000
SQL, our model actually we didn't put the name of the beer into what we are vectoring, so we

32:59.000 --> 33:06.680
are only vectoring this long text, this description, so we can see that our model knows that

33:06.680 --> 33:12.360
DBA is somehow related to SQL, but of course we know that it's like the double barrel area is

33:12.360 --> 33:22.280
not the same DBA that we are thinking of, so I have also met, yes, the first one, the first one,

33:22.280 --> 33:28.680
the first one, the first one is called to beer, oh yeah, yeah, for first them yeah, maybe it

33:28.680 --> 33:33.880
actually knows that yeah, the first one is happening in Belgium and the trap is a Belgian beer,

33:34.040 --> 33:41.560
it's true, yeah, I could probably also import the information about the breweries because it's also in the

33:43.560 --> 33:49.800
it's also in the data set, so it all like the our model is also sometimes hallucinating a bit

33:49.800 --> 33:55.080
giving us results that are not that interesting, so for example, the second result is just

33:55.080 --> 34:01.560
Ibu's of the beer, so it's not really relevant to our case, the third one, the Oracle,

34:02.440 --> 34:11.800
is somewhat relevant, right? But yeah, let's see some other interesting results, I don't know why,

34:11.800 --> 34:17.240
but maybe somebody would try to look for beer that would go very well with his cigarettes.

34:20.200 --> 34:26.360
So here we can see that our model understands that cigarettes are somewhat related to smoke,

34:26.440 --> 34:31.800
so the results that we are getting are smoked, brown ale, smoked, stout, it's probably a different type of

34:31.800 --> 34:37.720
smoke than the cigarette smoke, but still the model is somewhat understanding the realities of the

34:37.720 --> 34:49.160
world that we live in, let's do the Queen, so our model understands that Queen is somewhat similar to King,

34:49.160 --> 34:56.200
so the first result is King Crimson, the second one is not very relevant, but the third one,

34:56.200 --> 35:06.680
actually mentions the Queen here, another interesting example is Poland, so there is only I think

35:06.680 --> 35:12.920
like one or two Polish beers in this data set, I think like this data set was based mostly on the

35:12.920 --> 35:20.200
US market, so the first one is actually a graduate skier type of beer, which is a Polish type of beer,

35:20.200 --> 35:26.120
the second one is also related to Poland, dancing is good ice, so it's also relevant to Poland,

35:26.120 --> 35:33.000
and then I think it didn't have that many Polish results in the data set, so it knew that for example

35:33.000 --> 35:39.320
Czech beer is and Czech is somewhat similar to Poland, so it was trying to come up with an answer that

35:39.320 --> 35:46.760
was somewhat relevant, so you can see that the other ones are related to Czech, but I also mentioned that

35:46.760 --> 35:56.200
the indexes are allowing us to get faster results, so you can see that here like the results were

35:56.200 --> 36:03.880
super simple, but this is just because our data set is super small for the needs, like 3000

36:03.880 --> 36:12.200
rows that's nothing, but we can have an example of how creating an index will change our results,

36:12.200 --> 36:17.720
so let's do a simple search, it will try to look for beers that have the lime in their description,

36:17.720 --> 36:24.680
so potentially something that was lime, citrusy, similar, I will just move on now here to

36:28.760 --> 36:35.400
to another console, so let's create an index, on the beers using the IVV, IVF flat,

36:36.200 --> 36:42.760
and we are trying to create the index on the embering column, and this is an operator that we need

36:42.760 --> 36:49.160
to mention when creating the indexes, that the comparison operator that we will be using is the

36:49.160 --> 36:56.760
cosine operator, so we can go now back to our example and let's look for lime again,

36:57.560 --> 37:07.960
this was supposed to give me different results, so here you can see that the first one didn't

37:09.240 --> 37:16.200
didn't use the index just yet, but the second one you can see that the similarity search took

37:16.200 --> 37:24.280
one tenth of what this one was using, you can see that it's 0.02 seconds and this is 0.002,

37:24.360 --> 37:29.400
so it's 10 times faster, but also the results are a little different, right?

37:30.200 --> 37:35.880
The Govanus Gold was the first hit and that's the first hit on this one that's using the index as

37:35.880 --> 37:41.160
well, but here on the second point we already get a different result, so this is the approximate

37:41.160 --> 37:49.160
result that I was mentioning that when we are using the indexes, the results are let's say they are

37:49.160 --> 37:56.680
not exact, but if our data set is large enough, that's something that we should make sure that

37:56.680 --> 38:05.320
with our system can handle, and just by having a vast amount of data we will let's say get the

38:05.320 --> 38:10.760
good result, so again, a recommendation to drink responsibly, I know it's Belgium, remember that

38:10.760 --> 38:17.960
Belgium beers are quite strong, but what about the real life usage of a system like that?

38:18.680 --> 38:27.320
So let's move a little bit away from the beer example, let's have some text as an input, so for

38:27.320 --> 38:38.520
this case I've put some first phrases of Romeo and Juliet, so in production we would split it into

38:38.520 --> 38:45.640
smaller tanks, so in terms of tanks like Shakespeare's poems we can chunk it into different lines

38:45.640 --> 38:52.760
into different verses, so we would split our text into line by line, and then embed them in an embedding

38:52.760 --> 38:58.040
model using an embedding model we would create the vectors, and we would of course start the vectors

38:58.040 --> 39:04.600
in our database, so with a table like that, when we have somebody that's looking for information

39:04.600 --> 39:10.760
about houses, of course in the context of the dataset that we have, what we can do is we can

39:10.760 --> 39:16.200
embed the query from our user through the same embedding model, and this is important, we need to

39:16.200 --> 39:20.680
use the same embedding model because different models will give us different results, so they are

39:21.720 --> 39:26.600
just a mix of numbers that will not make sense, so we always have to stick with one model that's

39:27.240 --> 39:34.600
to get the same results, the query from the user will give us a vector, and using this vector we can

39:34.600 --> 39:41.560
do a similarity search, and to retrieve some information, so this information, so for example,

39:41.560 --> 39:50.680
let's just query the most interesting part here, most relevant to the houses, so the first phrase,

39:50.680 --> 39:56.280
the two households will probably be the most similar to the, to the information the query from

39:56.280 --> 40:03.400
our user about the houses, so if we feed that relevant information along with the user prompt

40:03.480 --> 40:11.640
into an LLM, we can try, ask it, if you phrase it correctly, this is, we can ask it to be our

40:11.640 --> 40:17.160
assistant, get the information and provide us with a result, so the LLM along with the

40:17.160 --> 40:24.280
internal information will know that the households are, this is just Shakespeare's way of saying

40:24.280 --> 40:29.400
that these are families and not really houses like physical objects, so this will give us the

40:29.400 --> 40:34.840
result that both houses had people of equal status, so a system like that is essentially

40:35.880 --> 40:42.680
capable of trying to answer our questions, in production we would actually add a rerank operation

40:43.720 --> 40:49.160
we would get a lot of information from database and try to run a what is called a rerank operation,

40:49.160 --> 40:53.800
so query a lot of information from the database, we do a rerank to see what's relevant to the

40:54.760 --> 41:01.240
input from our user's potential using a different algorithm or a different model to do that,

41:01.240 --> 41:07.720
and we are selective into what we feed into the LLM, and if we do it like that, we have a rack,

41:07.720 --> 41:14.440
a simple rack model that we could use to have an let's say AI assistant based on our data set,

41:15.320 --> 41:24.600
and with that, I just want to reference the projects that I used in preparation for this talk,

41:24.600 --> 41:29.480
what could be interesting is the right up of Seren's internal knowledge chatbot,

41:29.480 --> 41:37.800
so there is a working team at Seren in the department that's trying to come up with a chatbot

41:37.800 --> 41:43.000
that would be able to scan all of internal documentation that we have and provide people with the

41:43.000 --> 41:49.480
answers, and yeah if you're ever in the Geneva region feel free to give us a visit we have an

41:49.480 --> 41:55.400
amazing public spaces with the science gateway that you can see here on the on the photos where we have

41:55.400 --> 41:59.960
some exhibitions about what we do about physics it's also a nice place if you have kids it's very

41:59.960 --> 42:06.120
interactive kids can play around with a lot of stuff, but here at Boston we also have a booth in the

42:06.120 --> 42:14.040
building yeah I can't spell building F on level two so you can you can see us there we have I

42:14.040 --> 42:20.600
think we still have some swag we also have people that are responsible for some open open source

42:20.600 --> 42:26.760
project at Seren so yeah feel free to come by and that's it thank you

42:26.760 --> 42:42.360
okay do we have any questions in the room yeah

42:45.960 --> 42:51.560
hi thanks for the amazing talk so I have the question regarding indexes I mean are there any

42:51.560 --> 42:59.720
limitations as the number of vector dimension increases are they betting field so I think that the

42:59.720 --> 43:07.960
PG vector has a limitation of dimensions of 4,000 this is the current limitation of the PG vector

43:07.960 --> 43:12.680
but I know that as a workaround there are some algorithms I haven't used that but there are some algorithms

43:12.680 --> 43:22.200
that let you in an intelligent way convert the 8,000 you know 8,000 dimensional vector into a smaller

43:22.200 --> 43:36.520
dimensional value okay thank you yeah maybe if you can yeah just probably not not a perfect

43:36.520 --> 43:41.560
question but can you explain to us not knowing the difference between different kind of distance

43:41.560 --> 43:47.640
calculations when his sum of the other maybe more appropriate than the one that you use today

43:47.640 --> 43:55.320
yes so the question is about the different types of different types of calculating the distances

43:55.320 --> 44:02.840
so it's actually something that I skipped in my demo so for example in my data in in my case

44:02.840 --> 44:11.400
I have used it's very simple to compare the results and in my case as you can see here like

44:11.400 --> 44:16.840
I prepared three different versions of the of the query here so by using different

44:21.080 --> 44:23.160
by using different operators different

44:25.960 --> 44:32.200
different comparator here we can compare different different results so

44:33.720 --> 44:38.200
how you you should choose the distance calculation how you can choose the

44:39.640 --> 44:44.920
the similarity search whether you use cosine,uclidean etc it's depending on your data set

44:44.920 --> 44:50.440
and also the model that you are using so I checked and this is just something that I found online

44:50.440 --> 44:55.720
that the recommended way for this model that I was using was to use the cosine distance so

44:56.040 --> 45:02.520
this is why I was using just that any other question

45:04.360 --> 45:05.640
well then we still have five

45:08.040 --> 45:11.640
right questions yeah the read the is the question oh yes sorry

45:11.640 --> 45:27.960
hi thanks for the talk I have a question about the chunk size or you talked about chunking

45:27.960 --> 45:32.440
the text are there any recommendations for a size limit that should not be exceeded or would

45:32.440 --> 45:41.080
that depend on the transformer used so the question was how to do chunking how to decide on the

45:41.160 --> 45:47.560
chunk size and I cannot give you a simple answer on that just because I didn't investigate it enough

45:47.560 --> 45:56.440
from what I've seen in real life use cases is use should first of all for it so in my case when

45:56.440 --> 46:02.360
I was presenting the Romeo and Juliet part I was just cutting like each line by by each line

46:03.160 --> 46:09.800
which is not necessarily ideal but what you could do for example try to have a little bit of the previous

46:09.800 --> 46:15.960
chunk in your new chunk because you still get some context another idea would be to split

46:15.960 --> 46:21.960
by sentences whenever you have a dot you could split it into different parts so there are many

46:21.960 --> 46:28.360
different objects and I think as it is in machine learning and AI it's mostly we need to test and

46:28.360 --> 46:32.520
see what works best for you because of course depending on your data set depending on the

46:33.480 --> 46:40.520
model that you are using you will get different results and it's hard to find the correct answer

46:40.520 --> 46:47.240
there is like no magical ball that will tell you the best chunk size is you know 250 bytes you know

46:47.240 --> 46:53.800
maybe it's 17 awards or for a different model maybe it's something completely different also

46:53.800 --> 47:00.680
the chunking is something that I think is quite problematic but it impacts your results quite

47:00.680 --> 47:09.320
dramatically yeah yeah I was wondering thank you for the talk I was wondering if you have some

47:09.320 --> 47:17.080
usage at certain of like what you presented yeah so one of the use cases was that I presented

47:17.080 --> 47:23.560
that in the references we have a team that's working on developing an internal chat but that

47:23.560 --> 47:31.080
will give us some kind of information so you can you can ask questions like a like an

47:31.080 --> 47:35.800
assistant that will give you information based on the internal knowledge based articles at

47:35.800 --> 47:41.240
certain so that's one use case I know that another use case that they are looking into is the

47:41.240 --> 47:48.680
experiments they create huge amounts of data which are also like numerical values etc so potentially

47:48.680 --> 47:55.240
by creating our own model we could find similarities and the results that we are getting but I

47:55.240 --> 48:00.040
don't think it's in production stage yet it's just you know it's in the research phase

48:03.400 --> 48:09.240
thanks for a great talk my question is do you foresee especially with the advent of more

48:09.880 --> 48:18.040
like open source LLMs do you foresee the popularity of training your own LLMs on your data

48:18.680 --> 48:25.000
do you foresee that to be like more popular in the future over using rag I think that training

48:25.000 --> 48:31.800
your own model is not going to get like super popular because it's it's expensive so I think what

48:31.800 --> 48:37.240
people are doing right now partially because of the necessity so using a pre-trained model and trying

48:37.240 --> 48:42.440
to train it a little bit more on your specific data this is this is the way to go or trying to

48:42.440 --> 48:49.240
come up with something like like I presented in the rag part this is for at least for now the

48:49.240 --> 48:55.480
way to go just because of the cost because the cost of you know of course now we have the deep

48:55.480 --> 49:01.640
seek you know they claim to reduce the cost dramatically so maybe it will become more and more popular

49:01.640 --> 49:04.440
but I think like for now this is what everyone is doing

49:04.840 --> 49:15.960
consider in your use case at certain how do you plan to handle aggregate questions to this

49:15.960 --> 49:21.000
chartboard I mean when some calculations is required rag is not enough in this case

49:23.000 --> 49:30.520
this I am not sure what's the plan of the team because I'm not directly involved in this project

49:30.520 --> 49:39.080
so that's a very very good question yeah I've put my LinkedIn information so yeah just send it

49:39.080 --> 49:44.920
to me and I can get to the team and I don't get back to you on that any further questions from

49:44.920 --> 49:53.400
room I don't see any hands so I'll ask one myself yeah did you actually find a bear that you

49:53.400 --> 49:59.560
hadn't tried yet which you really liked using this method I found something I don't remember it was

50:00.680 --> 50:11.000
it was I think it was soup or for I found something that was yeah for spiced porter this sounds

50:11.000 --> 50:14.840
super interesting but I haven't tried it yet I haven't found it yet the problem with this data

50:14.840 --> 50:22.520
set is I think it's like 20 years old and it's mostly US breweries so it's it's it's quite

50:22.520 --> 50:29.320
tricky to actually get a hand on that but I think this one stood out a lot the fall beer I would

50:29.320 --> 50:35.400
give it a try it was probably just just to taste it but not have a full pint okay well then

50:35.400 --> 50:42.600
thank you for your talk we will start something to do now and get on

