WEBVTT

00:00.000 --> 00:05.000
Good morning, everyone.

00:05.000 --> 00:08.000
I will be doing my best to speak loudly.

00:08.000 --> 00:10.000
Is it OK in the back?

00:10.000 --> 00:12.000
OK, it comes up.

00:12.000 --> 00:13.000
Hi, everyone.

00:13.000 --> 00:14.000
I'm Gabor.

00:14.000 --> 00:21.000
I'm here to talk about graph databases, specifically where they come from, where they are, and where they are headed.

00:21.000 --> 00:28.000
So my background for this is that, for the last 13 years, I did a lot of work on graph databases.

00:28.000 --> 00:32.000
I did my masters, my PhD and postdoc in graph databases.

00:32.000 --> 00:37.000
Two years ago, I pivoted to the relational databases at DuckDB for my day job.

00:37.000 --> 00:41.000
But I still have a special place in my heart for graph databases.

00:41.000 --> 00:46.000
And I'm happy to come here and tell you my experiences and lessons I learned about them.

00:46.000 --> 00:48.000
So what is this talk about?

00:48.000 --> 00:53.000
I would try to bring you up to speed in 50 minutes on what graph databases are.

00:53.000 --> 00:56.000
And then you can decide whether they are something that you want to use.

00:56.000 --> 01:01.000
There is a chance that you don't really have a graph workload, but I encourage you to stick around,

01:01.000 --> 01:07.000
because I will show you some new research results that are applicable for graphs and also outside of graphs.

01:07.000 --> 01:11.000
And then finally, I believe performance is really important for data analytics,

01:11.000 --> 01:16.000
so I will give you a bunch of pointers for benchmarks and how to compare graph database systems.

01:16.000 --> 01:18.000
So let's get started.

01:18.000 --> 01:23.000
For this talk, I will use a running example that's rather complicated graph.

01:23.000 --> 01:25.000
It has four types.

01:25.000 --> 01:28.000
Countries, cities, people, and messages.

01:28.000 --> 01:32.000
And it encapsulates this weird social network in the low countries,

01:32.000 --> 01:39.000
where everyone and every city has a name with only three or four letters in the sake of fitting on the slide.

01:39.000 --> 01:45.000
We have four cities, a bunch of people, and you have the sort of edges between them that you would imagine.

01:45.000 --> 01:51.000
People know each other, people write messages, people are located in a city and so on.

01:51.000 --> 01:55.000
So what are graph databases in the first place?

01:55.000 --> 02:01.000
The definitions vary a lot, and it's really difficult to agree on what constitutes a graph database.

02:01.000 --> 02:06.000
But if you go to a graph database, meet up, or if you open the website of any of the vendors,

02:06.000 --> 02:10.000
there seems to be one thing that almost everyone agrees on.

02:10.000 --> 02:12.000
And that common theme is that joins our bad.

02:12.000 --> 02:16.000
If you go to a graph database, meet up, you get handed merch like this.

02:16.000 --> 02:22.000
Graph constructions own no joins behind this point, or they say no expensive runtime joins,

02:22.000 --> 02:27.000
or you should leave join those who left joins behind and leave joins behind.

02:27.000 --> 02:32.000
So what are they on about? It's kind of a bit murky, they don't really give any specific examples.

02:32.000 --> 02:40.000
I argue that sometimes they try to imply that relational databases cannot join, which is obviously an overstatement.

02:40.000 --> 02:43.000
Relational databases can join actually rather well.

02:43.000 --> 02:51.000
But then what's the problem? Well, let's take the middle part of our social network, where we have four cities and seven people,

02:51.000 --> 02:56.000
and we want to ask a question, where do the friends of the person then live?

02:56.000 --> 03:00.000
We can take this graph and then we can represent it as two tables.

03:00.000 --> 03:08.000
We have a no stable with person one and person two pairs, and we have a location table that says where a specific person lives.

03:08.000 --> 03:15.000
Now it's easy to see from the graph that then has two friends, car and deer, and they live in mall and os.

03:15.000 --> 03:19.000
And you can also see that these are represented in the tables.

03:19.000 --> 03:23.000
But if you try to write a secret creatively, you may end up writing something like this.

03:23.000 --> 03:32.000
You want to select the city from the no stable joint on the location table on the corresponding attributes, and you filter for Dan.

03:32.000 --> 03:36.000
But you will get only one city, you will get only os. What did we miss?

03:36.000 --> 03:42.000
We miss the fact that the no's edge is bi-directional, and this is not captured in the data.

03:42.000 --> 03:47.000
So what do we do? Well, one thing we can do is we can rewrite the query to make up for this.

03:47.000 --> 03:54.000
Another thing we can do is we can insert those edges into the no stable flipped, so that we will introduce some redundancy,

03:54.000 --> 03:56.000
but then the query will work.

03:56.000 --> 04:03.000
Or we can just do a view that captures the undirected no's edge by flipping the edges in union or all.

04:03.000 --> 04:10.000
All of these work, but already at this very simple complexity level, it starts to become a bit clunky.

04:10.000 --> 04:13.000
So this is one thing graphed at a basis, got really right.

04:13.000 --> 04:20.000
Most graphed at a basis have a syntax that is more elegant for capturing joints on something like this.

04:20.000 --> 04:28.000
They provide really concise and elegant syntax, and they also provide an elegant syntax for handling the bi-directionality.

04:28.000 --> 04:35.000
For example, the cipher query language, which is shown here on the right, has this construct where you can do brackets,

04:35.000 --> 04:38.000
and just dash is saying this is an undirected edge.

04:38.000 --> 04:49.000
And the rest is also simple, you say match for pattern matching p1 for person1, the name of the person location as a bracketed edge, and then return what you need.

04:49.000 --> 04:57.000
So most graphed databases have some sort of syntax sugar for expressing joints concisely.

04:57.000 --> 05:03.000
And there are actually quite a few graphed databases, and in the next couple minutes I will go through them.

05:03.000 --> 05:10.000
You can think about them as some systems along the transactional to analytical spectrum.

05:10.000 --> 05:16.000
So the more transactional system is, the more right oriented it is, and the more focused it is on high throughput,

05:16.000 --> 05:23.000
and the more analytical is, the more focused it is on running queries that are long running and read a lot of data.

05:23.000 --> 05:30.000
And I will argue that other than providing syntax sugar for joints, they also address other joint problems.

05:30.000 --> 05:36.000
Namely, the first category addresses something that is called the repeated joint problem.

05:36.000 --> 05:37.000
What does this mean?

05:37.000 --> 05:43.000
Well, let's take another slice of the social network with just the bagency of more.

05:43.000 --> 05:48.000
And imagine we have a client application that needs to render this in the user interface in some form.

05:48.000 --> 05:55.000
What happens here is that you can do maybe five joints and then return the table.

05:55.000 --> 05:59.000
It's actually not just five joints, you have to do left joints, you have to do some filtering,

05:59.000 --> 06:04.000
and ultimately you will end up with this kind of universal schema of all your data.

06:04.000 --> 06:07.000
But it has a bunch of problems. A big problem is redundancy.

06:07.000 --> 06:14.000
You have to spell out the city name, the postcode and the population for every person who lives there.

06:14.000 --> 06:18.000
Then you have to spell out all the person data for every message that they created.

06:18.000 --> 06:25.000
You have a bunch of nows, and this is just difficult. It's also quite difficult to reconstruct the graph on the client side from all this data.

06:25.000 --> 06:28.000
This is known as the overfetching problem.

06:28.000 --> 06:32.000
You run complex queries, you overfetch, and then the client has to do extra work.

06:32.000 --> 06:38.000
So in practice, almost no one does this. What they do instead is that they do iterative ramp trapping.

06:38.000 --> 06:43.000
They first get the cities, they then get the people in the city, and then they get their messages.

06:43.000 --> 06:48.000
But this is known as underfetching, because you don't get enough data to actually answer your problem.

06:48.000 --> 06:53.000
You end up doing n plus one iterations, just always have to get more data and more data.

06:53.000 --> 06:57.000
This is expensive, you run trip on the network.

06:57.000 --> 07:04.000
So this is kind of seems like a tough problem, and as most tough problems in computer science, it has a name.

07:04.000 --> 07:07.000
This is the impadence mismatch problem from object relational mapping.

07:07.000 --> 07:16.000
The impadence mismatch problem states that if you have a graph like object oriented model and relational model for your relational database,

07:16.000 --> 07:21.000
those are really difficult to map, and they don't really play nicely with each other.

07:22.000 --> 07:31.000
So one potential solution to this was invented by the API folks. This is GraphQL, which allows you to have a query on the left,

07:31.000 --> 07:39.000
where you can specify in this nest structure that you want the social network, then the city, it's people, and it's message and their messages.

07:39.000 --> 07:48.000
And then you can return an SD data structure, which usually is represented as JSON, and is a rather nice encapsulation of what you had.

07:48.000 --> 07:55.000
We don't over-pfetch, we don't under-fetch, there is no redundancy, and it's easy to reconstruct the graph from what you received.

07:55.000 --> 07:58.000
So this was actually picked up by a bunch of graph databases.

07:58.000 --> 08:12.000
Another solution that graph databases do, like Ori and DB, is they extend SQL by having some sort of a dot operator for introducing newer edges and navigating along the edges of the graph.

08:12.000 --> 08:20.000
In the stock, because we are at first them, I will also focus on the licensing aspects of projects, and Ori and DB is an interesting one to start with.

08:20.000 --> 08:32.000
So it was originally a patchy project, then it was bought by SAP, the company was bought, they spun out and enterprise version of it, and they saw licenses for it, then they abandoned it.

08:32.000 --> 08:40.000
Ori and DB lives on in three different versions, one is a conceptual fork represented by the light bulb, that's called Arcade DB.

08:40.000 --> 08:50.000
One is a fork called Utrecht DB, which actually powers the JetBrains Utrecht system for issue tracking, and there is still the original system.

08:50.000 --> 08:54.000
So it kind of forked off into three different projects.

08:54.000 --> 09:01.000
Another system is Microsoft Cosmos DB, this is an interesting system because it runs in the Microsoft Cloud, so it's obviously a proprietary piece of software.

09:01.000 --> 09:09.000
And it has this really famous use case of powering GenGPT, so that's kind of something a graph database can do.

09:09.000 --> 09:14.000
And Microsoft offers another SQL-like rule language, which is similar to Ori and DBs.

09:14.000 --> 09:20.000
It offers the Grammly language, and very interestingly, it offers a MongoDB compatible API.

09:20.000 --> 09:26.000
So this already illustrates that some graph databases don't really do complex pathways or pattern matching.

09:26.000 --> 09:31.000
They just do some document database style operations, because that's what people want.

09:31.000 --> 09:37.000
They want to think about their data as a graph, but they want to create in terms of nested JSON-like data structures.

09:37.000 --> 09:48.000
And one very nice evidence for this is that Microsoft actually spun out an open source fork of Cosmos DB, which is called Document DB.

09:48.000 --> 09:54.000
So this is really for document processing, it's licensed MIT, and it is running on top of Postgres.

09:54.000 --> 10:06.000
A bunch of other graph database systems are here, what you can note here is that graphQL and various graphQL forks are quite popular, as well as data log and the Grammly language.

10:06.000 --> 10:14.000
If you ever look for something like a graph serving system that does this sort of nested data returning queries,

10:14.000 --> 10:21.000
you should look for something that's backed by a key value store, a document store, or it's called a real-time graph database,

10:21.000 --> 10:33.000
if you go to the DB engines ranking and you open the list of graph databases, these will be called multi-modal systems, which are also document stores as kind of a second data model.

10:33.000 --> 10:39.000
So if I want to put a label on this, I would call this the Mongo DBs of the graph database space.

10:39.000 --> 10:49.000
The second category is interesting too. I would call them classic graph databases as a temporary name, and they basically solve the problem of recursive joins.

10:49.000 --> 11:00.000
Well, if you want to do something like shortest path finding, no way, just try to find the shortest path between two people, then you would do something like this,

11:00.000 --> 11:07.000
between Dan and Finn, obviously everyone can see that the shortest path is through Ada and Carr.

11:07.000 --> 11:16.000
But if you want to get the same answer from a SQL system, you will end up writing something known as a recursive query with the recursive closing SQL,

11:16.000 --> 11:27.000
and you end up with a monstrosity like this. I have spent a lot of time refining this query, but you do have to use lists to keep track of where you've been, you have to use window functions to look around the table,

11:27.000 --> 11:37.000
and it's not efficient to write, it's not nice to read, and it's also not efficient to run, so this is obviously a big limitation of SQL 99.

11:38.000 --> 11:45.000
Luckily, graph databases have another trick up their sleeves, inside for you can just say, you want the nose asterisk for cleaning star,

11:45.000 --> 11:54.000
and then you want a shortest path and then you return this. This is basically as simple as it gets, obviously to start different compared to the SQL query.

11:54.000 --> 12:02.000
And this language sci-fi was introduced by the Neo4j database system, Neo4j is kind of the first modern graph database,

12:02.000 --> 12:09.000
and they did great work in introducing sci-fi, turning it into open source with a specification called open sci-fi,

12:09.000 --> 12:14.000
and then pushing it through the ISO standardization process to EADGQL.

12:14.000 --> 12:21.000
Neo4j also did amazing advocacy works, they ran more than 10 fourth-dumb dev rooms, published books when meetups and so on.

12:21.000 --> 12:28.000
Neo4j is a transactional system in nature, but it leans into analytics, it has an analytical suite for graph algorithms,

12:28.000 --> 12:34.000
and if you get the enterprise version, it also has a parallel runtime for more complex queries.

12:34.000 --> 12:41.000
Another system that you may have come across in the last 10 years is called Titan, or it's called Genus Graph, or it's called Huge Graph.

12:41.000 --> 12:47.000
This has also been forked multiple times, and most of those forks are no longer maintained.

12:47.000 --> 12:52.000
The current version that's maintained is maintained mostly by by-do, and it's an Apache project,

12:52.000 --> 12:59.000
and it focuses on highly transactional setups where you store the graph distributably, mostly create with the Gramlin language,

12:59.000 --> 13:04.000
and focus on short reads and high throughput transactions.

13:04.000 --> 13:11.000
Other systems in this category include AWS, Neptune, Galaxy-based graph scope, mamm graph, Nebula graph,

13:11.000 --> 13:18.000
notably Google Spender Cloud, which is a new player, just being a few years, a few days ago, and Altypa.

13:19.000 --> 13:25.000
And you can see that there are three languages here, Gramlin systems, Cypher systems, and GQS systems.

13:25.000 --> 13:31.000
So it's kind of not unified yet in terms of three languages, but these are the key players.

13:31.000 --> 13:37.000
This is the category, which is really popular in terms of mapping it back to relational database system.

13:37.000 --> 13:47.000
Most of the legacy vendors, IBM, Oracle, Microsoft, they have versions of their systems of different maturity levels that do graph queries.

13:47.000 --> 13:52.000
There are also postgres extensions and working progress implementations in postgres.

13:52.000 --> 13:57.000
So I think it's fair to say that this is the postgres of graph database systems.

13:57.000 --> 14:02.000
They do transactional workloads, bunch of joins, and they also do recursive joins.

14:02.000 --> 14:06.000
And the final category is graph BI or graph overlap.

14:06.000 --> 14:12.000
If you will, this focus is on many too many joins, complex recursive joins, and other graph patterns.

14:12.000 --> 14:15.000
So what are complex recursive joins?

14:15.000 --> 14:21.000
Well, instead of doing shortest path, we can do weighted shortest paths, also known as cheapest path.

14:21.000 --> 14:28.000
For example, if you take our random social graph, we can do some crazy queries like, let's pick two cities.

14:28.000 --> 14:33.000
Let's get the residents of those people, then let's do a Cartesian product between them,

14:33.000 --> 14:36.000
and then try to find all the cheapest path between them.

14:36.000 --> 14:42.000
This is, of course, a very heavy-hitting operation, and it needs very special algorithms to deliver it.

14:42.000 --> 14:49.000
Namely, because you need something like a multi-source, multi-destination bi-directional cheapest path algorithm,

14:49.000 --> 14:53.000
which is something that isn't even there in terms of research as far as I'm aware.

14:53.000 --> 14:59.000
So this is something where real innovation can happen if someone wants to run these queries quickly.

14:59.000 --> 15:09.000
Maybe a more mundane example is cyclic queries, so a typical use case for databases in the graph space are the recommendation problems.

15:09.000 --> 15:14.000
So you can do something like, try to find people who have a lot of shared interest.

15:14.000 --> 15:20.000
So I added a bunch of topics that people are interested in, mostly based on music genres,

15:20.000 --> 15:23.000
and we can do a query like this.

15:23.000 --> 15:28.000
Let's take the nosages, join interest one, interest two, and make sure those two are the same.

15:28.000 --> 15:33.000
So visually, you can see that Ada and Bob have two shared interests, rock and ska,

15:33.000 --> 15:37.000
and Ada and Carl have one shared interest, rap music.

15:37.000 --> 15:46.000
So this is a triangle, and the triangle queries have this interesting problem that if you start to calculate them,

15:46.000 --> 15:50.000
they will have huge intermediates, even if the end result is actually not that big.

15:50.000 --> 15:57.000
The intermediates can blow up, but they only blow up if you do it with the so-called binary joins,

15:57.000 --> 15:59.000
when you join two operands together.

15:59.000 --> 16:05.000
So if you do interest one, nose and interest two, with binary joins, no matter how you order them,

16:05.000 --> 16:10.000
it can blow up, and it can blow up quadratically, where m is the number of edges.

16:10.000 --> 16:16.000
Obviously, this is not going to finish, but interestingly enough, if you do multi-way joins,

16:16.000 --> 16:21.000
then you can get it down to n on the power of 1.5, which is still worse than n log n,

16:21.000 --> 16:26.000
but it's something that we'll finish eventually.

16:26.000 --> 16:29.000
And the third category is acyclic queries.

16:29.000 --> 16:39.000
This is also kind of new research, which is, if you run a query saying, which persons have more topics of interest than friends,

16:39.000 --> 16:47.000
kind of an evil query, then we can do this actually quite simple secret query that says, take the nose edge,

16:47.000 --> 16:53.000
join it on the interest, and then just do group by on the person in the middle,

16:53.000 --> 16:58.000
and try to find the number of topics that they are interested in, and the number of friends that they have,

16:58.000 --> 17:02.000
and if there are more topics than friends, then that's a match for this query.

17:02.000 --> 17:10.000
So we can see that both, for example, are two friends, but five interest, so he will qualify for the results of this query.

17:10.000 --> 17:16.000
The problem is, is that if we do this join, we will get a huge blow up again,

17:16.000 --> 17:23.000
because you can see that we have a topic, person one and person two, and every time Bob would get a new friend,

17:23.000 --> 17:28.000
have to repeat all of his hobbies in the table, and every time he would get a new hobby,

17:28.000 --> 17:33.000
we would have to list all those friends that he already has next to the new hobby.

17:33.000 --> 17:40.000
So this is kind of another quadratic style blow up, and this is caused by a multi-value dependency,

17:40.000 --> 17:48.000
which means that the person one in the middle is conditional independent on the two sides, both the topics,

17:48.000 --> 17:52.000
and the friends that they have are independent.

17:52.000 --> 17:55.000
And this is a common pattern that's introduced by many to many joins,

17:55.000 --> 17:59.000
and it's not only common, it's very predictable, so it's easy to compress away,

17:59.000 --> 18:02.000
and this compression is captured by something known as factorization.

18:02.000 --> 18:06.000
factorization is a lost-less compression method for relational tables,

18:06.000 --> 18:13.000
and you can take a big flat table on the left, and turn it into something known as a factorized table.

18:13.000 --> 18:18.000
And the factorized table is basically this union of little Cartesian product,

18:18.000 --> 18:24.000
so we have put in the topics for ADA, topics for Bob, and all the people,

18:24.000 --> 18:29.000
and this is such an elegant and concise representation that you can answer the secret query

18:29.000 --> 18:31.000
by just looking at the table.

18:31.000 --> 18:35.000
You can say, well ADA has an equal amount of topics and friends,

18:35.000 --> 18:41.000
Bob has more topics, so that's a return, and then Cartesian has fewer topics.

18:41.000 --> 18:45.000
So you can just eyeball it and answer the query, which would be a lot more difficult,

18:45.000 --> 18:48.000
and memory intensive if you use the actual table.

18:48.000 --> 18:55.000
So it's kind of obvious at this point in research that the workloads that have a lot of these

18:55.000 --> 18:59.000
many to many join operations, or N to M edge join operations.

18:59.000 --> 19:04.000
They could benefit from factorization, but there are a lot of open questions around there,

19:04.000 --> 19:08.000
and there are also a lot of open questions on how to implement this efficiently,

19:08.000 --> 19:13.000
so both theoretical and implementation side, this is still up for grabs.

19:13.000 --> 19:18.000
And another thing that is quite interesting is that advanced ways of this factorization,

19:18.000 --> 19:23.000
called D representations, could also work for returning combat grabs.

19:23.000 --> 19:29.000
So if you remember for the first category of systems, where we tried to return a graph,

19:29.000 --> 19:32.000
but it was kind of difficult, factorization can also help with that.

19:32.000 --> 19:39.000
So this is, I think, a really promising direction that may be overarching across different graph database systems.

19:40.000 --> 19:45.000
The interesting thing about factorization and virtual optimal joins is that they are

19:45.000 --> 19:52.000
results that follow this really traditional and kind of idea path of first being published in theory papers,

19:52.000 --> 19:58.000
then being picked up by the database researchers, then being designed systems by systems architects,

19:58.000 --> 20:01.000
and that ultimately they will end up in production systems.

20:01.000 --> 20:04.000
But what's interesting is that this is really long journey,

20:04.000 --> 20:10.000
so even though the ideas themselves are quite simple, it takes 10 years to get the first good process access

20:10.000 --> 20:15.000
and maybe another 10 years to get the first good mature industry systems.

20:15.000 --> 20:20.000
Speaking of, there is one that's called Kuzu, this is from the University of Waterloo,

20:20.000 --> 20:24.000
and this implements both virtual optimal joins and factorization.

20:24.000 --> 20:29.000
It supports the Cypher Career Language, and it also has strong focus on path queries.

20:29.000 --> 20:35.000
It also has two academic precursors, graph flow, which was their system written in Java,

20:35.000 --> 20:38.000
and then they had grain DB, which was a dark DB fork.

20:38.000 --> 20:42.000
They abandoned both of those systems, but they reused all the knowledge that they accumulated

20:42.000 --> 20:49.000
and built Kuzu, which is this dark DB style single node in process system for graph processing.

20:49.000 --> 20:53.000
Another system, just to show you the variety, is called Tiger Graph.

20:53.000 --> 20:58.000
This has a completely different QR language called GSQL, and it focuses on traverses.

20:58.000 --> 21:06.000
So this is less on the join heavy, and more on the recursive cheapest path queries and other queries,

21:06.000 --> 21:09.000
where you do some sort of a fixed point computation.

21:09.000 --> 21:15.000
So if I want to put a final label on this, I would say this is the teradata of graph database systems.

21:15.000 --> 21:21.000
So once you have this categorization, it will help you select the system, would you something like MongoDB,

21:21.000 --> 21:24.000
or Postgres, or teradata for your problems?

21:25.000 --> 21:28.000
What about schema and distribution? There is no easy answer to this.

21:28.000 --> 21:31.000
Distribution can happen for all of these systems.

21:31.000 --> 21:36.000
The only rule of thumb that I could say is that if you have really heavy analytical queries,

21:36.000 --> 21:39.000
which are illustrated here by purple, then you need a schema.

21:39.000 --> 21:46.000
There is really not really a huge chance of being very fast if you don't have the schema of your data.

21:46.000 --> 21:51.000
But other than that, there are single node and distributed systems in all categories.

21:51.000 --> 21:54.000
So I promised that I will talk about benchmarks.

21:54.000 --> 21:59.000
I have been a longstanding member of the IDBC, the link data benchmark console,

21:59.000 --> 22:04.000
which is a non-profit with the mission to accelerate progress in graph data management.

22:04.000 --> 22:09.000
IDBC has a membership about 100 individuals, mostly researchers and practitioners,

22:09.000 --> 22:13.000
and also 25 organizations. Who are those?

22:13.000 --> 22:16.000
Well, we have a bunch of different categories here.

22:16.000 --> 22:21.000
We have traditional vendors like Oracle, we have AWS, graph database vendors,

22:21.000 --> 22:29.000
like Neo4j, Tiger Graph, the end group, Alibaba, and the bunch of other companies from hardware,

22:29.000 --> 22:33.000
from cloud and databases and also academic institutes.

22:33.000 --> 22:37.000
And the idea of IDBC is that it will take all of these members,

22:37.000 --> 22:40.000
the database companies, the hardware vendors and the cloud vendors,

22:40.000 --> 22:45.000
and it will force them to collaborate on standards, like benchmarks standards,

22:45.000 --> 22:49.000
and create language standards, and it will force them to compete on performance,

22:49.000 --> 22:54.000
purely because if they don't compete, they are left out from a leaderboard.

22:54.000 --> 22:58.000
So to this hand, IDBC defines a bunch of benchmarks and create languages.

22:58.000 --> 23:03.000
The first benchmark, one of the most popular workloads of ours,

23:03.000 --> 23:07.000
is called the social network benchmark interactive workload.

23:07.000 --> 23:12.000
And it has this sort of short read queries that fall into the classic graph data base,

23:12.000 --> 23:15.000
or post-grast of granted databases category.

23:15.000 --> 23:20.000
So for example here, we do a person and then we do one hope to hope and fetch the messages

23:20.000 --> 23:23.000
that were created before a given day.

23:23.000 --> 23:25.000
And obviously this is a graph database benchmark,

23:25.000 --> 23:29.000
so if you formulate the queries in SQL, you can formulate them,

23:29.000 --> 23:31.000
but sometimes they will be a bit unwieldy.

23:31.000 --> 23:34.000
For example, this is already a rather long query.

23:34.000 --> 23:38.000
But this allows me to demonstrate two new query languages.

23:38.000 --> 23:41.000
One is called SQL property graph queries.

23:41.000 --> 23:44.000
This will be mentioned in the next store by Daniel with more details.

23:44.000 --> 23:47.000
But this is an extension of SQL 1023,

23:47.000 --> 23:52.000
where you can say, from graph table and then you can do things like match a cyclic.

23:52.000 --> 23:55.000
And then you have something very similar to the Cypher language.

23:55.000 --> 23:57.000
You have the visual graphs syntax.

23:57.000 --> 24:01.000
You can say, one comma two instead of spelling out the joins,

24:01.000 --> 24:05.000
and you can return your results as a table.

24:05.000 --> 24:09.000
Another language which was standardized just last year is called GQL.

24:09.000 --> 24:14.000
This basically shares the pattern matching core with Cypher PGQ,

24:14.000 --> 24:18.000
but it's more oriented towards processing just graphs.

24:18.000 --> 24:22.000
The interactive workload has been out for a couple of years now,

24:22.000 --> 24:27.000
and we have received a lot of official results from vendors.

24:27.000 --> 24:33.000
And I think this is really impressive diagram because it shows that we have mostly

24:33.000 --> 24:36.000
like a doubling of performance every year,

24:36.000 --> 24:41.000
and even better if you take the performance and the price of the systems into account.

24:41.000 --> 24:49.000
But what's really interesting in this diagram is that all those systems that competed here

24:49.000 --> 24:52.000
and achieve better results are based in China.

24:52.000 --> 24:56.000
So apparently China has a really strong venture capitalic system,

24:56.000 --> 25:00.000
and they also do a lot of open source, two of these systems are open source.

25:00.000 --> 25:07.000
So I think this is something to think about two weeks after DeepSeek R1 came out.

25:07.000 --> 25:10.000
This is obviously something that hit like the mainstream media,

25:10.000 --> 25:14.000
but in graph databases there's also a huge advantage in Chinese vendors

25:14.000 --> 25:16.000
when it comes to performance.

25:16.000 --> 25:18.000
I cannot really speak about the usability of these systems,

25:18.000 --> 25:20.000
but performance wise they're good.

25:20.000 --> 25:24.000
And one system that's really impressive is the little red dot,

25:24.000 --> 25:28.000
which is not the fastest, but it's the fastest that uses the declarative language.

25:28.000 --> 25:32.000
This is who by a GES, it's the graph engine service,

25:32.000 --> 25:36.000
and it uses the cipher language, and it uses factorization internally

25:36.000 --> 25:38.000
to make the queries go fast.

25:38.000 --> 25:44.000
Another workload that I've worked a few years on is called the Business Intelligence workload.

25:44.000 --> 25:48.000
This is graphful app, complex patterns like finding triangles,

25:48.000 --> 25:51.000
complex pathways like finding cheapest paths,

25:51.000 --> 25:56.000
and doing like a long run of power test,

25:56.000 --> 26:00.000
with individual query run times, and throughput test with overlapping query run times.

26:00.000 --> 26:02.000
We also ordered a bunch of systems for this,

26:02.000 --> 26:06.000
but the vendors were very careful not to overlap in the scale factors.

26:06.000 --> 26:09.000
So I cannot do any systems system comparison here.

26:09.000 --> 26:14.000
We will do more all this this year, and hopefully we can compare their results.

26:14.000 --> 26:18.000
And the final benchmark that I would like to mention is the financial benchmark.

26:18.000 --> 26:21.000
This is a very interesting one because it was driven by the end group,

26:21.000 --> 26:24.000
and create link and another company called Altipo.

26:24.000 --> 26:28.000
They do very low latency queries,

26:28.000 --> 26:30.000
with relaxed consistency requirements.

26:30.000 --> 26:32.000
So think about the nosey,

26:32.000 --> 26:34.000
well, eventual consistency,

26:34.000 --> 26:37.000
and they have very interesting queries like they have path queries,

26:37.000 --> 26:41.000
where you have to have not only regular expressions type path,

26:41.000 --> 26:43.000
but also path where you keep some memory.

26:43.000 --> 26:45.000
You keep some state during the traversal,

26:45.000 --> 26:49.000
and you check that the amount of money that's strictly in across the path

26:49.000 --> 26:52.000
is only decreasing by small amounts.

26:53.000 --> 26:56.000
How do you use these TPC benchmarks?

26:56.000 --> 26:59.000
Well, we have a complicated benchmark kit.

26:59.000 --> 27:01.000
I have spent some time in this talk,

27:01.000 --> 27:04.000
making fun of the vendors who kept relicens in their software.

27:04.000 --> 27:07.000
We also did this because we picked GPL,

27:07.000 --> 27:10.000
and then our members out there, so we also relicens.

27:10.000 --> 27:13.000
And we do detailed specifications, papers,

27:13.000 --> 27:17.000
generators, data sets, and all sorts of frameworks

27:17.000 --> 27:20.000
to make our benchmarks be easy to implement.

27:20.000 --> 27:22.000
The benchmarks are easy to implement,

27:22.000 --> 27:24.000
but they're not easy to get audited.

27:24.000 --> 27:27.000
We have TPC grade benchmark audits,

27:27.000 --> 27:29.000
which means that we have certified auditors.

27:29.000 --> 27:32.000
And if you're a vendor, you have to talk to an auditor.

27:32.000 --> 27:35.000
It's usually something like 30 to 50,000 euros.

27:35.000 --> 27:37.000
It's a multi-week process.

27:37.000 --> 27:40.000
The auditor, we'll talk to the vendor,

27:40.000 --> 27:43.000
they will compile the specifications sheets,

27:43.000 --> 27:45.000
and then they will, of course,

27:45.000 --> 27:49.000
rerun the benchmark and write a full disclosure report out of it.

27:49.000 --> 27:52.000
This is kind of a painful process,

27:52.000 --> 27:53.000
but vendors go through it,

27:53.000 --> 27:55.000
because this is kind of the gold standard

27:55.000 --> 27:56.000
in the graph database space,

27:56.000 --> 27:58.000
and we have been doing audits for five years,

27:58.000 --> 28:00.000
published 50 results,

28:00.000 --> 28:03.000
and never had to retract any of the results.

28:03.000 --> 28:05.000
The most issues I could find was a few typos

28:05.000 --> 28:08.000
in old food disclosure reports.

28:08.000 --> 28:10.000
And as you can see, we have grown quite well

28:10.000 --> 28:11.000
in the last five years,

28:11.000 --> 28:13.000
so we have more and more members,

28:13.000 --> 28:15.000
lots of user community meetings,

28:15.000 --> 28:18.000
benchmarks, results, and so on.

28:18.000 --> 28:21.000
So what are the challenges in the graph database space?

28:21.000 --> 28:22.000
What is a big decline?

28:22.000 --> 28:25.000
I don't think that needs to be denied.

28:25.000 --> 28:28.000
The hype cycle has moved over to AI technologies,

28:28.000 --> 28:30.000
and I think there's a huge amount of confusion.

28:30.000 --> 28:32.000
I'm trying to have the confusion here a bit

28:32.000 --> 28:34.000
by setting these categories,

28:34.000 --> 28:36.000
but when vendors go out saying,

28:36.000 --> 28:38.000
we are a graph database,

28:38.000 --> 28:39.000
you don't need joins,

28:39.000 --> 28:41.000
and we will replace all relational systems.

28:41.000 --> 28:43.000
I don't think that does anything any, any one any good.

28:43.000 --> 28:45.000
I don't think anyone is seriously saying any more

28:46.000 --> 28:49.000
that graph database is very replace relational systems.

28:49.000 --> 28:51.000
There are many niche use cases,

28:51.000 --> 28:54.000
and there are vendors who are basically tied to one or two of their customers.

28:54.000 --> 28:56.000
They over optimized for that,

28:56.000 --> 28:59.000
and that ends up in this very fragmented space

28:59.000 --> 29:01.000
with dozens of clear languages.

29:01.000 --> 29:03.000
You can see the effect of that,

29:03.000 --> 29:06.000
so between 2013 and 2022,

29:06.000 --> 29:09.000
there was a 13-fold increase in the DB engines

29:09.000 --> 29:11.000
ranking for graph databases,

29:11.000 --> 29:13.000
but since the summer of 2022,

29:13.000 --> 29:14.000
it has been a decline,

29:14.000 --> 29:18.000
so we are now back to basically two thirds of the peak,

29:18.000 --> 29:21.000
and the decline doesn't look very good.

29:21.000 --> 29:23.000
If you have heard about graph frag,

29:23.000 --> 29:25.000
the vendors have heard about this,

29:25.000 --> 29:27.000
this is the life that they are trying to cling onto.

29:27.000 --> 29:31.000
This is something that the vendors think is going to help them,

29:31.000 --> 29:33.000
and indeed it produces a lot of attention,

29:33.000 --> 29:35.000
and it's a nice graph use case.

29:35.000 --> 29:37.000
Okay, to sum up,

29:37.000 --> 29:39.000
graph databases have syntax sugar

29:39.000 --> 29:41.000
and optimizations for joins.

29:41.000 --> 29:43.000
If you have ten joins in your query,

29:43.000 --> 29:44.000
and for some reason,

29:44.000 --> 29:46.000
it doesn't work with a relational system,

29:46.000 --> 29:47.000
you should try a graph database,

29:47.000 --> 29:49.000
but then based on the problem you may want,

29:49.000 --> 29:50.000
the MongoDB,

29:50.000 --> 29:51.000
the Postgres,

29:51.000 --> 29:53.000
or the Teradata of your database.

29:53.000 --> 29:55.000
And licensing is difficult,

29:55.000 --> 29:58.000
check the licenses before using the software,

29:58.000 --> 30:00.000
and check the performance results.

30:00.000 --> 30:01.000
Thanks a lot.

