WEBVTT

00:00.000 --> 00:11.800
Max is a good friend of mine for IBM. He's going to be talking about LM tools in use in

00:11.800 --> 00:15.000
VOLM. Take it away, Max.

00:15.000 --> 00:23.120
Thanks. So hi, everyone. I'm Max from I work at, yeah, okay. I'm Max. I work at IBM research.

00:23.120 --> 00:29.200
So I have a solution problem here. So some things I cut out below, but in the interest of

00:29.200 --> 00:33.840
time, let's move on. So my goal with this presentation is that by the end, you will know

00:33.840 --> 00:43.960
how tool calling with LM's works at the lowest level. So what is VLLM? Are there any

00:43.960 --> 00:49.120
VLLM users out there? Oh, okay, there's some, but yeah, it's nice to know that I'm not

00:49.120 --> 00:55.520
preaching to the choir. But anyway, the project's goal is to be the fastest and easiest to

00:55.520 --> 01:03.720
use open source, LLM inference and serving engine, right? So it started in, yeah, okay, you would

01:03.720 --> 01:10.360
see 2023 below there, but it started as a PhD project at the University of Berkeley. It has

01:10.360 --> 01:18.200
grown a lot since then and has become a Linux foundation project. There are contributions

01:18.200 --> 01:26.760
by many companies, including IBM, but also Intel, Mr. All the model providers and hardware

01:26.760 --> 01:34.440
providers contribute. And some of these companies also use it in production, like RIDU.

01:34.440 --> 01:41.080
Okay, so there are two ways to use VLLM. One is in Python with the offline-badged inference

01:41.080 --> 01:48.120
API. So this is basically like using the transformers library and only a bit faster, right?

01:48.120 --> 01:57.120
And then there is online serving using with the OpenAI API. So you can use VLLM to host

01:57.120 --> 02:06.000
models for your LLM based apps. All right, so just to contextualize this a little bit.

02:06.000 --> 02:12.840
So when you think about the program, it has a central logic, right, a control flow that,

02:13.400 --> 02:19.840
well, where you program what you want to do. And it uses sub-retains in your own program,

02:19.840 --> 02:26.120
it uses libraries, it uses operating system to interact with the world. So you could see these

02:26.120 --> 02:33.720
things as tools for your business logic, right? But the problem with programs is that this

02:33.720 --> 02:38.920
business logic is fixed. So if you want new behavior, you have to program it and update all

02:38.920 --> 02:44.520
your deployments, right? So with LLMs, you can do new things. You can put an LLM, they're in

02:44.520 --> 02:51.160
the middle, it can come up with new plans depending on the user input. And it can use all these

02:51.160 --> 02:58.200
things as tools, right? So that's, that's the motivation. So this allows you to do some things,

02:58.200 --> 03:05.000
like the classical example of an AI assistant, where you give the model a bunch of APIs

03:05.000 --> 03:09.960
that it can call here. In this example, our restaurant reservation API. And so you know,

03:09.960 --> 03:17.320
you can handle user input in natural language, so that it will be the model can, yeah, satisfy

03:17.320 --> 03:24.920
the users requests that come in natural language with these API calls. Okay, so there are

03:24.920 --> 03:31.560
basically three types of tool calling out there. One is JSON based, where the model input,

03:31.640 --> 03:36.920
I mean the function descriptions are in JSON and the model output is also JSON. Then there's

03:36.920 --> 03:43.720
code based tool calling, where the model generates code. For example, in LLM3 models, they generate

03:43.720 --> 03:49.960
Python code, and that has to be executed. And there's built into, so for example, again, in the

03:49.960 --> 03:56.760
LLM3 models, during instruction tuning, the model learns to use brave search and wealth from

03:56.840 --> 04:04.520
often. Okay, so in this talk, we're going to focus on JSON tool calling. So in this modality,

04:04.520 --> 04:09.960
the model, what you give the model is a description of all the functions that it can call

04:09.960 --> 04:19.240
in this JSON format that is very similar to open API, if not identical. And what you do is you

04:19.240 --> 04:26.360
describe your function, you give a function name, you tell the model what it does in the description,

04:26.920 --> 04:34.120
then you list all the parameters, the types, and names, and also descriptions. And so when you

04:34.120 --> 04:40.520
get a user input in natural language, the model should generate an output, where it identifies

04:41.720 --> 04:48.200
the arguments for these functions from the user input and generates and select the correct function,

04:48.200 --> 04:56.440
and returns your JSON like this one. Okay, so this is modeled in the OpenAI API as a

04:56.760 --> 05:03.880
chat. So here you have a sequence of different messages with different roles. So in the beginning,

05:03.880 --> 05:11.320
you usually, as the application developer, you start with a system prompt, where you tell the model

05:11.320 --> 05:17.240
what is, is it's role as, for example, you would say something you're helpful assistant, in this case,

05:17.240 --> 05:22.600
you're going to know a little bit about weather prediction, then you have a user request,

05:22.680 --> 05:30.920
right, with a user role, coming in with a user request, right, and then there can be some

05:30.920 --> 05:36.600
back and forth between the model and the user, and at some point the model will generate a tool

05:36.600 --> 05:44.360
call in JSON format as we've seen, then you can put this output back into the model, and here,

05:44.360 --> 05:51.240
well, it's cut below, but the model, when it sees this output from tool, it can generate an explanation

05:51.320 --> 05:57.880
in natural language, so that you can send it back to the user. All right, so we know that

05:57.880 --> 06:05.000
the model only handles text, so to translate from this list of messages to actual text, that the model

06:05.000 --> 06:12.280
can process, we have a, we use chat templates, that chat templates usually bundled with the tokenizer,

06:12.280 --> 06:19.160
in VLLM, we curate some chat templates specifically for tool use. So this example is for IBM

06:19.240 --> 06:27.560
granite 3.1, so as you can see in the beginning, the available tools are inserted with a special

06:27.560 --> 06:32.920
role called available tools, but depending on the model, it can also be in the system role,

06:32.920 --> 06:39.160
or in the first user message, then there is a fall loop iterating over all of the messages,

06:39.160 --> 06:46.040
and formatting accordingly, if they're depending on the role, so in most roles, the message is just

06:46.120 --> 06:53.320
copied plain text into the prompt, but in some cases, like with tool calls, there has to be some

06:53.320 --> 07:02.440
JSON manipulation, and we can't see it, but if you want the model to complete text as the

07:02.440 --> 07:09.880
system, then you also start inserting start of role, assistant, and of role, so that the model

07:09.880 --> 07:17.480
knows it has to go from there. All right, so the text looks like this, at the beginning, as I

07:17.480 --> 07:27.720
mentioned, you have the functions, then the system prompt, then the user input, and then just the

07:27.720 --> 07:35.960
beginning of the assistant role, so that the model will complete, so it will complete with a tool call,

07:36.040 --> 07:43.560
there's a special format, that's a model specific, so for each different model in VLLM, we have

07:43.560 --> 07:52.520
a different parser, to get the model output, and return JSON, then we can put the response back,

07:53.320 --> 07:57.800
and by the end, the model can generate a well, we can see it here, but it can generate a

07:57.800 --> 08:04.440
natural language description for the user. All right, so from an application developer perspective,

08:04.840 --> 08:14.520
wait, why is it not correctly formed? Okay, but anyway, yeah, as a client developer, in Python,

08:14.520 --> 08:24.120
for example, you can use the open AI Python client, our API in VLLM is compatible, so you define

08:24.120 --> 08:33.880
your tools as JSON digs, and it's a message as sorry, as a list, and then you call, when you call

08:33.960 --> 08:39.880
the API, you get back, this chat completion object, where the finished reason is two calls,

08:39.880 --> 08:45.640
and you get a nice array of the two calls that were passed from the model return, so

08:47.320 --> 08:52.600
notice that there's a special parameter here, tool choice auto, which I will explain now.

08:52.680 --> 09:04.760
Okay, so the tool choice auto parameter, it let's you generate the two calls in different ways,

09:04.760 --> 09:11.560
so when you, there's one, the first option is to actually pass it a JSON with the function that you

09:11.560 --> 09:18.600
want to call, so in VLLM, this forces the model to use only one specific function,

09:19.480 --> 09:26.200
using structured output, right, so with structured outputs, we restrict which tokens the model can

09:26.200 --> 09:31.880
generate, so it has to generate, for example, JSON schema that follows the output for that specific

09:31.880 --> 09:38.280
function, so if you use input has nothing to do with that function, the model can be forced to make

09:38.280 --> 09:43.960
upper arguments, then this tool choice required, which is similar, but now the model can choose

09:44.040 --> 09:51.320
between different functions, then this tool choice none, where you don't want the model to

09:51.320 --> 09:55.480
generate a two call, so you could use this, for example, if you want to do more prompt engineering

09:55.480 --> 10:00.280
after you insert a third to tools, for example, a chain of thought or something like this,

10:00.840 --> 10:06.200
and finally, this tool choice auto where the model free to either generate text to send back

10:06.200 --> 10:15.640
to the user or to generate a function call, okay, so as we have seen, the model only handles text,

10:15.640 --> 10:22.200
VLLM translates this from into JSON, but VLLM does not call, actually call the tools,

10:22.760 --> 10:28.680
so as an application developer, this is your responsibility, right, so what you need to do is to

10:28.680 --> 10:36.520
basically implement this executor box to orchestrate this interaction, so you take the user prompt,

10:36.520 --> 10:43.080
send it to the model, the model returns the tool call, then sends it to you execute the tool,

10:43.080 --> 10:47.480
get the response, show it to the model, so that model can generate output for the user,

10:48.200 --> 10:57.640
and then finally, this sequence ends with text sent to the user, all right, so when you're using

10:57.720 --> 11:04.920
tool calling, it's nice to know how your model was trained, right, so I like this paper from IBM

11:04.920 --> 11:11.560
research a lot because it details the tasks which the model was trained, and this also another data set

11:11.560 --> 11:21.000
on the data set that was used, so highly recommended, I have to rush through these other

11:21.000 --> 11:28.120
slides now because I'm getting out of time, but yeah, wrapping up, so based on natural language

11:28.120 --> 11:35.960
LLM can figure out what functions call, the chat API is used for that, and so since the model

11:35.960 --> 11:43.000
and it inference server only handle text in JSON, the orchestration is your responsibility as a developer,

11:43.000 --> 11:48.200
so if that's a lot of work, you might also want to check out agent frameworks such as the

11:49.160 --> 11:54.040
agent framework, and with that, I'll leave you with some pointers, and thank you.

11:55.880 --> 11:56.520
Give me a hug.

12:01.000 --> 12:05.240
Thank you so much, Max, and I'm going to kick him out over there, so you have questions for him,

12:05.240 --> 12:08.440
you should find him over there, because Peter, you're up next, buddy.

12:11.400 --> 12:14.840
All right, we'll give it about two minutes, or however quickly he gets us about

12:18.200 --> 12:19.880
one.