WEBVTT

00:00.000 --> 00:15.860
So, hello everyone, this is also going to be a short presentation.

00:15.860 --> 00:21.240
Together with Mathias and Christina, we will present you a picoscolor, which has the aim

00:21.240 --> 00:25.560
to advance open research with an open source AI platform.

00:26.440 --> 00:31.920
Currently, we are an open source proof of concept, this is our stage, and we are trying to leverage

00:31.920 --> 00:37.000
domain adapt to the AI, and real-time collaboration to create this tool.

00:37.000 --> 00:42.520
And our goal in the future is to gather the community of developers and researchers

00:42.520 --> 00:47.360
around the tool in such a way that we can advance systematic reviews.

00:47.360 --> 00:52.920
We are a group of master and computer science students from Georgia Tech, our story started

00:52.920 --> 00:57.400
with, initially, a team of four software engineers, then we moved to five software engineers

00:57.400 --> 01:04.280
and we participated to two hackathons, and also we collaborated with Robert Gordon University

01:04.280 --> 01:12.720
research researchers, and with the help of DevPost, we actually managed to win a hackathon.

01:12.720 --> 01:17.920
We got second prize in the Red Hat and AI hackathon, which actually motivated us to continue

01:17.920 --> 01:25.120
with this project and try to move it from a proof of concept to an actual tool that

01:25.120 --> 01:28.160
people can use and be useful.

01:28.160 --> 01:33.520
The reason why we want to do that is because we believe that currently there is an existing

01:33.520 --> 01:40.480
challenge, a user's face certain issues, and existing tools in the field of systematic reviews

01:40.480 --> 01:47.480
struggle with increasing volume and complexity of technical literature, specialized terminology,

01:47.480 --> 01:53.000
transparent selection workflows, and also there is a limited ability to integrate new

01:53.000 --> 01:58.720
publications into ongoing or past literature reviews.

01:58.720 --> 02:03.640
In terms of the users that we believe that might find this tool useful, first of all, the

02:03.640 --> 02:08.680
academic researchers, we believe that it's challenging for them to navigate through vast amounts

02:08.680 --> 02:15.240
of literature and new knowledge is hard to integrate and publish literature reviews.

02:15.240 --> 02:18.920
So conference organizers, imagine that you have to organize like a conference with thousands

02:18.920 --> 02:24.800
of publications, and you have to scheme through all the articles that are being submitted

02:24.800 --> 02:31.400
to the conference, and this tool is actually helping to avoid manually reviewing submissions,

02:31.400 --> 02:39.480
and also avoid a situation that might lead to time-consuming and potentially bias selection

02:39.560 --> 02:43.560
processes, and also students and educators.

02:43.560 --> 02:49.960
They currently lack access to comprehensive resources and structured guidance for conducting

02:49.960 --> 02:53.640
effective literature reviews.

02:53.640 --> 03:00.520
In terms of how we solve this problem, this profile concept currently leverages AI powered search

03:00.520 --> 03:06.760
and ranking, which helps with enhancing literature reviews, literature relevance.

03:06.760 --> 03:14.040
So basically, when we search, it's not just like a typical search tool, it uses AI to search

03:14.040 --> 03:19.720
through all the papers in such a way that the search becomes more relevant.

03:19.720 --> 03:26.600
It has a scalable architecture, uses microservices for flexibility, and this in the future

03:26.600 --> 03:32.200
will help us adapt and use different, for example, plug-and-play models.

03:32.200 --> 03:38.360
So we are thinking about collaboration tools, and the tool is designed for shared work spaces

03:38.360 --> 03:45.480
and role-based access, and in terms of the systematic review flow that we have, this includes

03:45.480 --> 03:51.800
metadata extraction and selection criteria, and currently it's inspired from the biomedicine

03:51.800 --> 03:57.160
methodologies, actually peak of stance from population intervention, comparison, and outcome.

03:57.160 --> 04:02.600
And yeah, if you didn't understand how the tool works so far, I will do a little demo for you.

04:04.200 --> 04:08.120
This is one of the main functionalities, this is the search tool.

04:08.120 --> 04:11.880
So you can see here the user inserting the search.

04:11.880 --> 04:17.960
You can see that there's an advanced search option with also the possibility to have scientific

04:17.960 --> 04:24.920
notation translation, and also we have an AI query optimizer that helps users optimize their

04:25.320 --> 04:36.920
question for the search tool. After inserting the question, basically we get to this funnel section

04:36.920 --> 04:43.080
that has an AI summary on top with all the papers that are being displayed on the first page,

04:44.200 --> 04:50.840
and the user can select the paper and move it throughout the funnel and give it different

04:50.840 --> 04:57.000
other statuses. For example, in this case, it has moved from identified to screened and also

04:57.000 --> 05:01.480
the user can archive that selection and also bring it back, of course.

05:04.840 --> 05:12.280
And also it has the feature of showing only the items that has PDFs. For example, if we get

05:12.920 --> 05:18.120
in a search, we get like 2000 results, for example, but not all of them has half PDFs.

05:18.680 --> 05:26.440
This will just retrieve only the items with PDFs, and we have the possibility of chatting

05:26.440 --> 05:31.240
with a PDF AI assistant about that PDF in specific, and you can see a display here.

05:31.240 --> 05:35.480
Now let me tell you how to explain you more about the rest.

05:36.360 --> 05:50.760
Okay, hopefully that works all right. Hi, everyone. Thanks for your attention.

05:52.600 --> 05:57.160
So yes, so now we covered the problem with the space as well as some of the key pain points

05:58.280 --> 06:03.400
and you'll watch the tool in action. So now I want to show you how because

06:03.400 --> 06:09.400
color works behind the tools. So we behind the scenes and also what we see for the future

06:09.400 --> 06:14.040
because the work where are proof of concept, right, but we want to push this into like MPP,

06:14.040 --> 06:20.360
where it becomes functional for the users. So here's a high level view of a because color and

06:20.360 --> 06:25.640
the solution. Essentially, users have the ability to upload their own documents and this triggers

06:25.640 --> 06:31.960
a data cleaning pipeline. After that pipeline, we have a fine-tune AI models that will perform

06:31.960 --> 06:36.840
metadata extraction. So in the case of biomeric and we extracted the peak elements, but

06:36.840 --> 06:45.320
the idea is to tell all these specific domains with fine-tune adapters. So once we have this,

06:45.320 --> 06:50.440
we embed everything so the documents as well as the metadata and store it on a vector store

06:50.440 --> 06:56.840
that is able to perform semantic and similarity search. So yeah, in essence, like we're ranking

06:56.840 --> 07:04.040
these stories. Finally, we have a rack chatbot. So this can plug into like your favorite LLM

07:04.600 --> 07:10.360
and that can provide some rich summaries and insights. Additionally, if you see all of those

07:10.360 --> 07:15.400
like feedback loops, that's the automatic tension where every annotation or user decision

07:16.360 --> 07:21.560
fits back into the model in that making this whole process smarter and more accurate over time.

07:22.440 --> 07:30.360
In terms of our current stack, we use fast API and Python and also like power our AI features

07:30.360 --> 07:35.880
with fine-tune cyber and LLM index. We have re-acting the in the front end and we have

07:35.880 --> 07:42.440
containerized all of these applications. Well, we mentioned before, we participated in Red Hat

07:42.440 --> 07:47.880
and Intel Hackathon. So you know, we actually the hosties on the Red Hat OpenShift environment

07:48.040 --> 07:52.760
and we take a advantage of Intel OpenBeno model formats for faster inference.

07:53.560 --> 08:00.920
And this data, the moment sits on TIDB for scalable vector search. So I think that you have

08:00.920 --> 08:06.120
a free tier that we were able to use, but in the future we would like to host or

08:06.120 --> 08:09.880
provide the path to host the solution and full like open source infrastructure.

08:10.920 --> 08:14.440
We have microservices and we have AI adapters.

08:15.400 --> 08:20.680
So in terms of the potential and how we see this tool is like surrounding like some key aspects,

08:20.680 --> 08:26.440
right? So it's all about like this principle of collaboration and transparency within the systematic

08:26.440 --> 08:31.880
review and how you know let's say a professor can interact with their TAs and form their own projects.

08:31.880 --> 08:35.960
But as well, how can that move like a cross project, right? And how can we perform like

08:37.800 --> 08:42.200
archive of different annotations that could be trained for specific domains.

08:43.160 --> 08:48.840
For this human AI collaboration is quite important for us, as well as we aim to integrate

08:48.840 --> 08:55.000
with other tools such as automatic annotation tools or reference annotation managers.

08:56.280 --> 09:00.200
We want to create an intuitive user experience, hopefully you got that from having a look at the

09:00.200 --> 09:09.640
demo. So dashboard and data driven insights. So we are here on the POC and we aim to go into

09:09.640 --> 09:15.800
a self-host solution. We have some personas which I'll talk in the next slide, but essentially

09:15.800 --> 09:21.000
we're taking one little like thin slice, produce a valuable product there and then we plan to

09:21.000 --> 09:25.480
expand for a more complicated use case like, you know, full-on system attribute.

09:26.440 --> 09:31.160
So in terms of our first persona that we aim to achieve is the conference organizer.

09:31.160 --> 09:35.960
And we picked this because it's quite like a close-knit, bring your own documents type of a scenario.

09:36.840 --> 09:42.040
Currently, like in terms of this visual, you can see that we are nearly the end of the phase one,

09:42.040 --> 09:46.920
which is all about the back end and the core features. These are like the growth operations for

09:46.920 --> 09:53.640
management documents, the AI driven, driven as core in, so on and so forth. And on the next phase,

09:53.640 --> 09:57.880
we want to focus more on, you know, how to make this collaboration and more user management.

09:59.240 --> 10:04.040
And yeah, as well as increasing the efficiency of our ranking processes.

10:04.760 --> 10:10.760
Lastly, provide the PI2 on top of all of that. So this is just a little bit more of, you know,

10:10.760 --> 10:15.480
once we nailed that first scenario, our focus areas are just improving the system, improving the

10:15.480 --> 10:20.600
automation and connecting to more data stores, right? So at the moment on our demo, we only have, like,

10:20.600 --> 10:26.840
you know, 30,000 documents, but we want to link more to databases like both methods or archive.

10:27.480 --> 10:33.240
Turns off, you know, if you want to contribute, we welcome contributors, especially,

10:33.240 --> 10:38.200
like some back end developers, right? So if you like go, for instance, we're applying to to build

10:38.200 --> 10:43.640
a back end based on that, as well as you, if you make very intuitive interfaces, come and talk to us.

10:44.520 --> 10:49.480
Ultimately, you know, I think we're interested on developing a lot of the AI models, and we have

10:49.480 --> 10:54.840
some help from Intel and some of the processors to the hackathon. So if you're interested on those

10:54.920 --> 11:01.400
technologies come talk to us, or if you're a user, right? If you want to use this proof of concept,

11:01.400 --> 11:06.760
now for your current project or, you know, maybe on the future projects or to change these two

11:06.760 --> 11:12.920
towers, you know, something that might help you, we're happy to hear your ideas. In terms of how to

11:12.920 --> 11:19.080
join this community, so we have our GitHub repo, currently we're meeting weekly, so usually on

11:19.080 --> 11:26.440
two states, and yeah, we're happy to engage online as well. So think that's the end of the presentation.

11:49.720 --> 12:02.600
Sure, so the question was, what do we use for the extraction pipeline? I assume you

12:02.600 --> 12:08.520
mean the metadata extraction. Yeah, so that's how like the initial stage of this two, right? So

12:08.520 --> 12:14.120
it was based on one paper from Robert Gordon University on training adapters, specifically for

12:14.120 --> 12:23.160
biomedicine, and there are other open source, also data basis for extracting people. So we use

12:23.160 --> 12:29.160
that about like a thousand records, and I'm fine killed a bird model, a cyborg model, and that will

12:29.160 --> 12:36.040
become our metadata extractor. So at that point, you know, it knows how to extract these four key elements

12:37.240 --> 12:42.040
put it into our tool. And we do some forther cleaning on top of that, right? But the idea is like if you

12:42.600 --> 12:47.800
bring in other domain with the core elements for your systematic review of your selection criteria,

12:48.520 --> 12:54.680
you can decide those four, five key elements, bring in a thousand annotations, and then you can't

12:55.560 --> 12:58.040
explore the literature at the scale.

