WEBVTT

00:00.000 --> 00:15.000
Hi, during the talk, we are going to talk about why D.D., my name is Evgeny or Eugen.

00:15.000 --> 00:36.000
So, my name is Evgeny or Eugen. I am developed by D.D., and the talk is expected about

00:37.000 --> 00:43.000
about why D.D., but first of all, I am going to start the talk from very important things about

00:44.000 --> 00:50.000
rumors and our relations with Eugen by D.D., because actually there are two kinds of people.

00:51.000 --> 00:56.000
Some people believe that we, Eugen by D.D., and why D.D., are exactly the same thing.

00:57.000 --> 01:04.000
So, in this case, Frank and I are colleagues. Even more, some people believe that once upon

01:05.000 --> 01:12.000
time we had a bar fight. And I am today, you are going to know the actual truth. And the actual

01:13.000 --> 01:21.000
truth we had never had any bar fight, fortunately. And why D.D., and Eugen by D.D., despite

01:21.000 --> 01:30.000
common characters in our naming, different distributed data bases. And we communicate a lot

01:31.000 --> 01:36.000
with each other. And actually we enjoy talking about data bases, distributed systems,

01:37.000 --> 01:44.000
and getting things done related to benchmarking. So, what D.D., is not just a database.

01:44.000 --> 01:49.000
It's not just a distributed database. It's a platform. Originally, we started as early

01:50.000 --> 01:57.000
p-system. But later, because there was a very high demand, we added support for Kafka-like

01:58.000 --> 02:04.000
topics, servers, or a distributed queue. And recently, we have added support for all

02:04.000 --> 02:13.000
app queries. And there are even more projects based on why D.D., but mostly, we are known as open

02:14.000 --> 02:20.000
source distributed, SQL database. SQL database means that we are relational database.

02:21.000 --> 02:26.000
We support LTP and the app. And we are major system. We have installations with

02:27.000 --> 02:34.000
1000 servers. A partial license is expected. If you have a laptop or phone, feel free to go

02:35.000 --> 02:41.000
to GitHub and start our project. We are strictly consistent. Because we take care about

02:42.000 --> 02:48.000
developers, application developers, and strict consistency is really important for them.

02:49.000 --> 02:56.000
It means that in CUP theorem, between consistency and availability and partition, we

02:57.000 --> 03:03.000
choose consistency when partitioned. And our transactions have a reasonable execution.

03:04.000 --> 03:09.000
We are highly available at full-tolerant as expected. We support nulls for a very big

03:09.000 --> 03:16.000
source. Imagine situation when one availability zone is not available. It crashed.

03:17.000 --> 03:22.000
Maybe a fire of water, something happened in the data center. And also, you lose a

03:23.000 --> 03:29.000
rack. In another availability zone. In this case, WDB is still read right available.

03:30.000 --> 03:35.000
Which makes us a mission critical database. So we have no downtime for maintenance,

03:35.000 --> 03:42.000
etcetera. And fun fact about WDB. We are not just cloud native database. I call

03:43.000 --> 03:48.000
this feature of database bootstrapable. We are actually bootstrapable because

03:49.000 --> 03:55.000
there are some clouds, which are based on WDB. They use WDB to store their native data.

03:56.000 --> 04:03.000
Also, even more, they use WDB to implement their own versions of the elastic block

04:03.000 --> 04:10.000
store. And finally, they provide WDB as a service. And in this case, when you get WDB as a

04:11.000 --> 04:17.000
service, you actually get WDB over WDB over WDB. So it's infinite number of WDBs.

04:18.000 --> 04:23.000
But even more, you might know that inside almost any database, the risk LSM.

04:24.000 --> 04:31.000
Now, imagine number of LSM's over LSM's over LSM's. And everything ends in SSD where

04:31.000 --> 04:41.000
there is another yet LSM. So let's go to main topic. And start from spatial overview of our

04:42.000 --> 04:50.000
architecture. As expected, we have SQL database. So we have tables. And each table has primary key.

04:51.000 --> 04:58.000
Primary key can be consisted of number of columns. And tables are sorted by the primary key.

04:58.000 --> 05:06.000
And we split tables into ranges. And there is a partition responsible for each range.

05:07.000 --> 05:15.000
Also, we use slightly different naming for partitions. We use a name tablet.

05:16.000 --> 05:21.000
Here is our logical architecture. At the bottom, there is distributed storage.

05:22.000 --> 05:29.000
Distribute storage is responsible to store all users metadata and system metadata.

05:30.000 --> 05:36.000
Also, it's responsible for redundancy, replication, it's a major component for consensus.

05:37.000 --> 05:44.000
And there is a tablet layer above distributed storage. In this case, tablets are reliable components.

05:44.000 --> 05:51.000
And they actually implement some sort of database, more like database logic.

05:52.000 --> 05:57.000
All together with distributed transactions layer, tablets, and distributed transaction,

05:58.000 --> 06:05.000
implement ACID distributed transactions. Between tablets, between different tables, et cetera.

06:06.000 --> 06:15.000
But because we are limited in time, today we are going to focus on WDB platform components distributed storage and tablets.

06:17.000 --> 06:25.000
Most important feature of our architecture is that we separate compute and storage right from the beginning.

06:26.000 --> 06:32.000
There are compute nodes, which implement database logic, compute nodes contain tablets,

06:33.000 --> 06:37.000
distributed transaction, query processing, and even GPC.

06:38.000 --> 06:42.000
And storage nodes, they provide the fact that storage.

06:43.000 --> 06:49.000
Our system works on share nothing architecture, commodity hardware, so it's not like spanner.

06:49.000 --> 06:59.000
We don't require any special hardware, and separating compute and storage allows to scale them independently.

07:00.000 --> 07:08.000
And of course, WDB can be run in virtual machines, in containers like Kubernetes or on-band metal.

07:09.000 --> 07:15.000
This room is for cloud native databases, probably I should not say this, but I prefer bare metal.

07:15.000 --> 07:24.000
Because no overhead, no, it's really better, no fake machines.

07:26.000 --> 07:29.000
Here, illustration about WDB cluster.

07:30.000 --> 07:42.000
Usually, each cluster contains multiple databases, and multiple databases, in this case they are called dedicated databases, they share same distributed storage.

07:42.000 --> 07:53.000
In really huge installations, there are thousands of databases running in single cluster and sharing a huge distributed storage.

07:54.000 --> 08:06.000
This is done for efficiency to reduce number of resources, because if you have separate, if you have per database cluster, it's really more expensive.

08:06.000 --> 08:20.000
Because we are cloud native, and even cloud friendly, you can use dedicated database and share this database with multiple serverless databases.

08:21.000 --> 08:26.000
In this case, such a database is called not dedicated, but shared database.

08:26.000 --> 08:31.000
Still storage is still storage is shared between multiple databases.

08:32.000 --> 08:36.000
So let's dive deeper and have a look on distributed storage.

08:37.000 --> 08:49.000
Distributed storage is kind of where the special purpose key value store distributed key value store, and it is used to store in usable blocks.

08:49.000 --> 08:57.000
Bloops are of different size. Bloops can be very small, just one byte, or 10 megabytes.

08:58.000 --> 09:06.000
And tablets, which are located on a second layer, they use distributed storage for two main things purposes.

09:07.000 --> 09:16.000
First of all, because tablets reliable component and you might guess it contains a big edge state machine, it uses distributed storage to write log records.

09:17.000 --> 09:20.000
Or to read ranges of log records.

09:21.000 --> 09:37.000
Also, because as usual, there is log structured merge 3 inside tablet, and tablets use Bloops distributed storage, or Bloops storage, two stores in the long blocks, which are actually parts of LSM3.

09:38.000 --> 09:44.000
Bloops storage is actually an old title for distributed storage.

09:45.000 --> 09:51.000
So I use interchangeably Bloops storage and distributed storage names.

09:52.000 --> 09:58.000
Distributed storage supports multiple density schemes with support to register coding.

09:59.000 --> 10:03.000
It's kind of very special in comparison with other databases.

10:04.000 --> 10:12.000
For example, you can use WDB and install WDB inside single really ability zone, and you can use a regular coding to cut your expenses.

10:13.000 --> 10:23.000
Because in case of a regular coding, you get just 1.5 redundancy, not 3 times redundancy, you don't have to store 3 replicas.

10:24.000 --> 10:29.000
And in case of single will be released zone, that's a nice feature and it works really fast.

10:30.000 --> 10:36.000
Why 1.5? Because as a regular coding, we use a block 4 plus 2.

10:37.000 --> 10:43.000
It means we have 4 parts of Bloops written to the Bloops storage with data and two extra parity blocks.

10:44.000 --> 10:46.000
More classic scheme is the application.

10:47.000 --> 10:51.000
It's the case when you have 3 ability zones or 3 data centers.

10:52.000 --> 10:56.000
And in this case, you get 3 replicas, everything is expected.

10:57.000 --> 11:03.000
But it's designed in a way that you can add more redundancy schemes depending on your needs.

11:04.000 --> 11:08.000
What do I mean by a special purpose key value store?

11:09.000 --> 11:15.000
Key is a table of tablet ID, generation, step, and probably some other things.

11:16.000 --> 11:19.000
It's expected from any distributed system you should have.

11:20.000 --> 11:22.000
Generation, you should have step to solve consensus.

11:23.000 --> 11:25.000
Value is a mutable block.

11:26.000 --> 11:28.000
Put and get method I expected.

11:29.000 --> 11:33.000
And I would like to tell you more about very special method called block.

11:34.000 --> 11:36.000
It's used with tablet ID and generation.

11:37.000 --> 11:39.000
One tablet wants to become a leader.

11:40.000 --> 11:43.000
It uses block method to say to block a distributed storage.

11:44.000 --> 11:46.000
Hey, Bob Storage, I would like to be a leader.

11:46.000 --> 11:54.000
And this is actually a kind of, it's something between consensus and election.

11:55.000 --> 11:58.000
Actually, it's kind of election but of consensus algorithm.

11:59.000 --> 12:01.000
Well, Bob Storage answers.

12:02.000 --> 12:06.000
Okay, tablet, now you are a leader in this generation.

12:07.000 --> 12:09.000
Tablet can do useful things.

12:10.000 --> 12:11.000
It can store data.

12:11.000 --> 12:16.000
Also, another very special thing about our distributed storage is the garbage collection.

12:17.000 --> 12:19.000
It's a barrier based.

12:20.000 --> 12:25.000
So tablets use barriers to collect garbage and they must move this barrier to tell

12:26.000 --> 12:29.000
Blob Storage that it's time to make garbage collection.

12:30.000 --> 12:32.000
Let's have a look on this a bit of storage node.

12:33.000 --> 12:38.000
Each is a bit of storage node, owns devices, block devices,

12:38.000 --> 12:41.000
and manages these devices.

12:42.000 --> 12:47.000
You might see that there is no file system in between device and WIDB.

12:48.000 --> 12:53.000
We don't use file system at all because there are many benefits.

12:54.000 --> 13:01.000
Main benefit is that we can do our own caching instead of virtual file system layer.

13:01.000 --> 13:06.000
Also, here is this special component called padisk.

13:07.000 --> 13:10.000
Actually, padisk is a component which manages device.

13:11.000 --> 13:15.000
And we can run multiple virtual disk on single padisk.

13:16.000 --> 13:23.000
We have a special scheduler which provides which is responsible to make use of device layer between different

13:24.000 --> 13:25.000
WIDisks.

13:27.000 --> 13:31.000
Distributed storage consists of distributed storage groups.

13:32.000 --> 13:37.000
Distributed storage group is a thing which is reliable.

13:38.000 --> 13:42.000
And it consists of unreliable virtual disks.

13:43.000 --> 13:48.000
Virtual group, these groups are managed by Blob Storage Controller.

13:48.000 --> 13:53.000
It's a special purpose tablet, which is responsible for managing these groups.

13:54.000 --> 14:02.000
I think this is the way to think about distributed storage to think about it as distributed rate.

14:03.000 --> 14:09.000
In my opinion, it is the best intuition about this component.

14:10.000 --> 14:13.000
Tabit is located above Blob Storage group.

14:14.000 --> 14:21.000
It uses special component distributed storage proxy, which is with the way about ways to communicate with virtual disks.

14:22.000 --> 14:25.000
And hides complicity of using virtual disks.

14:26.000 --> 14:31.000
Tabit, I attached one or multiple Blob Storage groups.

14:31.000 --> 14:40.000
And this is actually the kind of API above Blob Storage group.

14:41.000 --> 14:45.000
Virtual disk is active component.

14:46.000 --> 14:48.000
For example, when you lose some virtual disk.

14:49.000 --> 14:54.000
And for example, it's a way to sometimes then you get note back.

14:55.000 --> 14:57.000
And virtual disk communicates with others.

14:57.000 --> 15:03.000
It's a peer-to-peer system to get all the changes and to make synchronization.

15:04.000 --> 15:07.000
So in case of failure, virtual disk will be back.

15:08.000 --> 15:14.000
And it will be able to communicate with others in the group to make synchronization.

15:15.000 --> 15:23.000
And important feature of distributed storage group is that distributed storage group is not static.

15:23.000 --> 15:29.000
It's dynamic. You can easily move with disk from one note to another note from one device to another device.

15:30.000 --> 15:38.000
And this is a bit of storage isolates users because easy-to-date database has own Blob Storage groups.

15:39.000 --> 15:41.000
So they are isolated.

15:42.000 --> 15:44.000
Usually they share same physical devices.

15:45.000 --> 15:50.000
But you can depend on your needs. You can easily configure a system to not use same physical devices.

15:50.000 --> 15:55.000
To not share same physical devices between a dedicated database.

15:56.000 --> 16:03.000
So if the device is broken and replaced, it's possible to make the application automatically.

16:04.000 --> 16:08.000
We have a self-heal process which actually is responsible for this kind of replication.

16:09.000 --> 16:15.000
This is a bit of storage is very scalable because you can scale everything independently.

16:15.000 --> 16:22.000
And Blob Storage controller is a single component, but it's very efficient and doesn't require much CPU.

16:23.000 --> 16:29.000
For example, we have cases when we have 10K nodes and everything is okay.

16:30.000 --> 16:32.000
Let's briefly overview the tablets.

16:33.000 --> 16:39.000
Tablets are reliable components and encapsulate stateful logic.

16:39.000 --> 16:41.000
They run and compute nodes.

16:42.000 --> 16:48.000
And if Tablet dies, our window structure is responsible to start Tablet somewhere else.

16:49.000 --> 16:52.000
And to restore it in the same state as it was before dying.

16:53.000 --> 16:55.000
Tablet contains replicated state machine.

16:56.000 --> 16:59.000
And it has LSM. So everything has good expected.

17:00.000 --> 17:05.000
And Tablets are a different types of tablets.

17:05.000 --> 17:11.000
And logic is above RSM and LSM. So it's easy to add new types of tablets.

17:12.000 --> 17:16.000
Tablets communicate with Blob Storage groups via channels.

17:17.000 --> 17:21.000
So channel hides a completely set of Blob Storage groups.

17:22.000 --> 17:25.000
There might be multiple channels and some columns can be stored on HD.

17:26.000 --> 17:32.000
And some columns can be stored on SSD depending on how you use the actual columns.

17:32.000 --> 17:34.000
And here is the final slide.

17:35.000 --> 17:38.000
Describing it actually contains a bug.

17:39.000 --> 17:42.000
Sorry for this. I'm sorry. I forgot to remove extra text.

17:43.000 --> 17:45.000
So there is that data shot tablet.

17:46.000 --> 17:51.000
It's responsible for raw store and raw store-related SQL queries.

17:52.000 --> 17:56.000
We use own WAKEL language, which is very similar to SQL.

17:56.000 --> 18:03.000
But also we are post-rescompotable to almost post-rescompotable.

18:04.000 --> 18:07.000
We have colon shots. They are responsible for colon store.

18:08.000 --> 18:11.000
And there are tablets like Schimshard, which managed all the metadata.

18:12.000 --> 18:16.000
All the Schimshard's very special tablet called Hive, which managed established.

18:17.000 --> 18:19.000
It's responsible to start, stop balance tablets.

18:20.000 --> 18:24.000
We have coordinators, mediators, transaction allocators.

18:24.000 --> 18:28.000
Because WGB is originally inspired by Calvin.

18:29.000 --> 18:32.000
So it's our similarity with Calvin.

18:33.000 --> 18:36.000
And things like class and management system, this view process, etc.

18:37.000 --> 18:40.000
Thank you for your attention. Please ask your questions.

18:41.000 --> 18:43.000
Feel free to hit me up outside conference.

18:44.000 --> 18:46.000
Here's my Twitter.

18:47.000 --> 18:49.000
Feel free to ask any questions.

18:49.000 --> 18:54.000
Subscribe to me. I will be happy to answer.

18:57.000 --> 19:00.000
You mentioned UDB is open source, right?

19:01.000 --> 19:02.000
Yeah, to license.

19:03.000 --> 19:06.000
It's like all stack open source, including blob storage and PDS.

19:07.000 --> 19:08.000
You mentioned that is all that.

19:09.000 --> 19:11.000
Yes, everything is open source.

19:12.000 --> 19:15.000
Including SDKs and all the layers are open source.

19:15.000 --> 19:18.000
Would they be able to run it, say, on AWS Cloud?

19:19.000 --> 19:20.000
Yeah. Okay, thanks.

19:23.000 --> 19:26.000
Anyone else we question for you, Kenny?

19:28.000 --> 19:29.000
How good?

19:30.000 --> 19:31.000
One more.

19:34.000 --> 19:36.000
Hello, again, it was nice.

19:37.000 --> 19:42.000
It was kind of boring start, because I hope you will tell more about bar fights or something like that.

19:42.000 --> 19:46.000
But still, it was kind of impressive.

19:47.000 --> 19:49.000
I just have one small question.

19:50.000 --> 19:55.000
You say about duplication of data, like a couple of algorithms to do this.

19:56.000 --> 19:57.000
What about the duplication?

19:58.000 --> 20:00.000
As far as I remember, you say about key value storage.

20:01.000 --> 20:02.000
We're kind of simple one.

20:03.000 --> 20:05.000
Are you using some kind of duplication?

20:06.000 --> 20:10.000
To the extent of my knowledge, we don't use any duplication.

20:11.000 --> 20:19.000
So, application applications can do the duplication on their own.

20:28.000 --> 20:33.000
Hello, this is my first time hearing of YTB, and thank you so much for the amazing talk,

20:33.000 --> 20:34.000
and I'm walking us through it.

20:35.000 --> 20:41.000
I would like to ask you, would you recommend that I should be considering using this for my next project?

20:42.000 --> 20:44.000
Or what is a good environment to consider?

20:45.000 --> 20:47.000
Could the piece repeat this kind of noise?

20:48.000 --> 20:53.000
I was saying that this is my first time hearing YTB, and I was curious from you.

20:54.000 --> 20:56.000
Should I consider using this at my next project?

20:57.000 --> 20:58.000
Should I consider introducing this to my company?

20:59.000 --> 21:02.000
Or is this something that you expect to be really a community thing?

21:03.000 --> 21:10.000
Well, I think it's really easy to use WDB externally, because we have external clients,

21:11.000 --> 21:18.000
and they use WDB without any issues, and also they checked if they can scale WDB.

21:19.000 --> 21:31.000
And you probably would like to check our blog to see performance numbers to understand if it matches your needs and requirements.

21:31.000 --> 21:32.000
Thank you so much.

21:33.000 --> 21:34.000
Thank you for your question.

21:38.000 --> 21:40.000
Any other questions?

21:41.000 --> 21:43.000
Cool, thanks for giving it.