WEBVTT

00:00.000 --> 00:11.040
Okay. Thank you. So, thanks everybody for being here. What seems to me to be extremely

00:11.040 --> 00:18.500
late on Saturday night. I'm going to talk about research software, sustainability and

00:18.500 --> 00:23.920
RSEs, and kind of in the context of one project, and with a few kind of other things

00:23.920 --> 00:29.680
along the way. And I'll mention the second author, Ben, is also here in the audience, and so

00:29.680 --> 00:35.760
if I tough questions, I'll see if he can answer them. Okay. So, first, just to start, the

00:35.760 --> 00:42.320
project that I am part of that inspired this is called parcel, which sometimes stands for

00:42.320 --> 00:46.600
parallel scripting library, but not really officially. But it's a parallel programming and

00:46.600 --> 00:51.440
Python library, and it has the idea that you have apps that define opportunities for

00:51.440 --> 00:57.440
parallelism, and they're basically decorators on Python functions, or on bash apps, external

00:57.440 --> 01:02.720
applications. And these apps then return futures rather than values, and the fact that you're

01:02.720 --> 01:06.560
doing that, then let's them return immediately, and so you can do them in parallel, and

01:06.560 --> 01:12.880
you can have them run remotely, in some cases. And in the fact that you can do this, then

01:12.880 --> 01:18.640
let the runtime basically figure out what concurrency is needed, what is ready to run, what has

01:18.720 --> 01:22.560
to wait because it's waiting for something else to run first, things like that. And so

01:22.560 --> 01:27.760
this lets you not have to specify a lot of the parallelism by hand, but just the dependencies

01:27.760 --> 01:34.240
can calculate this automatically. And this also, at least in theory, is independent of where it

01:34.240 --> 01:40.080
is running, because the code doesn't say anything about where this happens, we say where things

01:40.080 --> 01:45.760
are going to happen in a separate object that's not related to the compute parts of things. So,

01:45.760 --> 01:49.920
anyhow, so that's just a little bit about parcel, if you're interested, you can try it,

01:49.920 --> 01:55.920
there's a binder instance, so you don't have to download anything. Okay, so we wrote this,

01:55.920 --> 02:01.280
the software that does this, and we found out that it's being used by a bunch of different kinds

02:01.280 --> 02:08.000
of people. So, one group is individual researchers and large research consortia, and an industry

02:08.000 --> 02:14.720
to some extent, and across a lot of different scientific and other domains. And so some examples

02:14.720 --> 02:22.800
of what parcels been used for was to produce an interconnected simulated sky survey in preparation

02:22.800 --> 02:32.880
for an astronomical telescope that's coming online soon. And some biomedical applications looking

02:32.880 --> 02:40.880
at, well, basically looking at genomes and looking at molecular structures. So, it is a bunch

02:40.880 --> 02:44.640
of different things that can do. It basically tries to do things, or it's most useful when you're

02:44.640 --> 02:48.960
trying to do something that's at a very large scale. It would be hard to do on a single computer.

02:50.480 --> 02:55.360
And so, we also then have external stakeholders. And so, these are, in some cases, they're the

02:55.360 --> 03:00.960
direct users, like the astronomers, the astronomy project, for example. In some cases, they're

03:00.960 --> 03:07.920
platform developers. So, we have different projects that are using parcel as a component,

03:08.000 --> 03:12.960
and then they provide that platform to their end users. And so, parcel is not directly used by

03:12.960 --> 03:18.080
their end users. And a couple of examples for this is QCRC hive that's working in molecular

03:18.080 --> 03:23.920
sciences and in globalist compute. There are also a cyber infrastructure, or E infrastructure,

03:23.920 --> 03:29.680
or E research infrastructure providers. So, national labs in the U.S. like Argonne and Nurse,

03:29.680 --> 03:33.680
for example, support parcel on their systems. And so, they are part of our stakeholders.

03:34.640 --> 03:42.640
There's also linked contributors. So, because we're building a, we're providing a library

03:42.640 --> 03:46.960
that's useful in a lot of different cases. Other people that are providing their own library,

03:46.960 --> 03:52.320
sometimes see wanting to work with us as a way of increasing the impact of what they're doing

03:52.320 --> 03:58.800
and trying to share work to do something bigger. And so, there's a project called WorkQ that was

03:58.800 --> 04:04.480
developed at University of Notre Dame. And, and they basically have integrated this in some sense

04:04.480 --> 04:10.960
into part of parcel, which makes people that are using parcel able to use workQ and adds to

04:10.960 --> 04:17.600
workQ's usage as well. And then there's also funders. And so, for us, national funders in the

04:17.600 --> 04:22.960
U.S., NSF and DOE, Chan Zuckerberg Initiative, as a philanthropic funder, and then collaborating

04:22.960 --> 04:27.520
projects like the astronomy project that gives us a bit of money to support their own needs

04:28.240 --> 04:34.400
as well. So, I've talked about parcel, but let me kind of do a little bit of the history for a second.

04:34.400 --> 04:39.680
This was initially supported by an NSF Award, a five-year award that we got out of a one-year

04:39.680 --> 04:48.080
no-cost extension for them, and from 2016 to 2022. In 2020, we released version one, and we have now

04:48.080 --> 04:53.120
moved on to weekly releases since then, rather than version releases in the same way,

04:53.120 --> 04:58.160
semantic versioning. The focus since that version one has mostly been on maintenance rather than

04:58.160 --> 05:01.760
adding new features, so we do a little bit of different stuff now than we did initially.

05:01.760 --> 05:05.840
And I'll talk more about that as we go on. We had a funded development team that was somewhere

05:05.840 --> 05:11.280
between two and four people or FTs per year when we were doing this building initially.

05:12.720 --> 05:20.480
And currently, under the funding that we have now, we are basically supporting about one FTA

05:20.480 --> 05:24.720
year of maintenance and development in about half an FT a year of community management.

05:25.680 --> 05:30.480
And I'll just point out that there's kind of a few different metrics that you can look at,

05:30.480 --> 05:36.800
and our metrics are going up, and that's good. So we have lots of downloads, lots of website

05:36.800 --> 05:42.080
visitors, lots of users, which are these you believe are actually indicative of anything,

05:42.080 --> 05:49.440
I don't know, but we show them to our funders and they're happy. Okay, so sustainability,

05:49.440 --> 05:54.000
then, if we want to talk about that part, the way that I kind of define sustainability is that we

05:54.000 --> 05:59.840
want to balance resources and work. And so the resources that we have are grants,

06:00.480 --> 06:04.160
as I've talked about external funding from projects that use parcel, a trail said,

06:04.880 --> 06:09.120
volunteer effort from groups that develop tools that use parcel, where we're not really

06:09.120 --> 06:14.000
directly involved. In some cases, in other cases, we are, and then companies that use parcel

06:14.000 --> 06:19.280
in their services. And then, on the other side, we have the work. And so the funded parcel team

06:19.280 --> 06:24.480
is doing this core work, which includes managing the community and reviewing code contributions

06:24.480 --> 06:29.120
and fixing bugs and supporting users and developing new features, although not so many of those,

06:29.120 --> 06:35.280
at this point, and releasing new versions of the software. And so, if we're sustainable,

06:35.280 --> 06:40.960
basically, these columns match up. But one of the challenges is that the provided in the volunteer

06:40.960 --> 06:46.080
resources can add features to parcel, and they can support some use cases, like somebody

06:46.080 --> 06:50.000
can support their own use case pretty easily, but they're not as likely to support somebody else's

06:50.000 --> 06:56.560
use case. And so, the challenge is that those provided in volunteer resources aren't currently

06:56.560 --> 07:02.560
sufficient or aligned well enough to fully support what we need over multiple years. And so,

07:02.560 --> 07:08.640
we don't think, at least at our scale, that we could survive without any funding just based on volunteers.

07:08.640 --> 07:15.680
And what we've tried to do, basically, let me see if I can, can I do this as well?

07:15.680 --> 07:22.480
Yeah, okay. We basically had this period where we were getting this initial funding. And we're doing

07:22.480 --> 07:28.960
lots of development, a fair amount of user support, increasing maintenance was increasing,

07:30.080 --> 07:33.840
not so much research and development, and the number of users was slowly increasing.

07:33.840 --> 07:39.120
And what we kind of pitched is that we got into this period where we started ideally wanting to do

07:39.120 --> 07:43.280
less of these things, less development, less maintenance, less user support,

07:43.280 --> 07:47.280
or at least have less resources going into those things and do them more efficiently,

07:47.280 --> 07:51.120
but have some R&D that continues and have the number of users grow.

07:51.120 --> 07:55.120
And we're hoping that we can get into some kind of a sustainable period where we don't need a huge

07:55.120 --> 08:00.080
number of resources to do the maintenance part of this, and we can find those resources.

08:00.080 --> 08:03.040
And that's the question, as can we, can we actually get to there?

08:04.480 --> 08:09.040
So in terms of actually then doing this, we see that there are projects that are successful

08:09.040 --> 08:14.320
and we'd like to be able to follow their model. And so Astropy in astronomy is one that's,

08:14.320 --> 08:19.040
I think, is a good model. There's a challenge that it's much bigger than parcel, and I'm not

08:19.040 --> 08:26.000
actually sure if that's a qualitative difference as well as a quantitative difference. I suspect it is,

08:26.720 --> 08:32.800
but we'll see. And YT is another project that we also were looking at. And so then we will work

08:32.800 --> 08:37.600
and are working on community and governance and funding streams and innovation and training and

08:37.600 --> 08:42.400
outreach and engagement, trying to improve things to get to that point. We are trying to work with

08:42.400 --> 08:46.320
other related communities, so there's lots of other people working on software sustainability,

08:46.320 --> 08:52.240
the SSI and the UK, research software alliance and others. And so we want to kind of share

08:52.240 --> 08:56.960
work with them as well as the workflow community, because if we can bring together more community,

08:56.960 --> 09:01.440
then we have less work that we have to do ourselves and ideally becomes more sustainable.

09:02.320 --> 09:07.520
We also, in our proposal, said we were going to capture endocuments and share sustainability lessons.

09:07.520 --> 09:11.600
We've done a horrible job of this. And it's partly because what we said we were going to

09:11.600 --> 09:16.560
do is to track everybody's time on all of their activities and be able to report on what we did

09:16.560 --> 09:21.120
to increase our sustainability and it's really hard to track time at the level that we wanted.

09:21.120 --> 09:24.960
And we could never find any good tools and we tried to do things manually and then people forgot

09:24.960 --> 09:29.600
to do it and then we try to remind ourselves every month and it just, it was a mess. So I think we're

09:29.600 --> 09:34.560
going to end up kind of failing on this unless you count this talk is as that and then maybe we're

09:34.560 --> 09:39.440
successful. Okay. And the other thing that we're also trying to do is to reduce technical debt and this

09:39.440 --> 09:45.680
is what I would say has been a lot of what Ben has been working on. So if we can make the code simpler

09:45.680 --> 09:50.480
and make it easier to maintain than we don't actually need as many resources to continue maintaining it.

09:50.480 --> 09:57.360
So all right. So let me kind of go on and talk a little bit about how we've done this and where we are.

09:57.360 --> 10:02.080
So we kind of think of the project as having these stages and I think this is probably general

10:02.080 --> 10:07.680
for a bunch of other projects as well where we had an initial concept testing and an initial

10:07.680 --> 10:13.040
development kind of proof of concept effectively. We then started growing a little bit doing some

10:13.040 --> 10:18.240
testing with with initial users and continuing development to move past the initial stage.

10:19.040 --> 10:23.840
And then we got to expanded usage and support and still a little bit of development but not really

10:23.840 --> 10:28.640
anywhere near as much. And then we think we're hopefully getting into this point where we're in

10:28.640 --> 10:33.920
community usage and some sustained maintenance and supports but not very much development at this point.

10:33.920 --> 10:39.200
Okay. So these four things are kind of not exactly completely distinct from each other but we feel

10:39.200 --> 10:46.000
like we've kind of progressed through these different stages. And so how we did this in some ways is that

10:46.000 --> 10:52.160
this partial started as an idea in 2016 based on a previous project that was probably 10 or 15 years old

10:52.320 --> 10:56.800
where we'd done a lot of work and a lot of thinking but that code was all getting very old

10:56.800 --> 11:01.200
and it was written in some bad ways and it used a language that didn't really make any sense anymore.

11:02.080 --> 11:06.480
And so we basically said to ourselves if we were doing the same thing from 15 years ago we were doing it

11:06.480 --> 11:11.360
today what would we do differently and we thought well let's try that and see if we can do that and see how it compares.

11:12.560 --> 11:17.840
So effectively a simple tool or language in a runtime for fast easy scripting on big machines.

11:18.800 --> 11:24.480
And we realized we should do this in Python today that they would make much more sense than what we were doing.

11:25.280 --> 11:29.680
One person basically did the initial exploration and proof of concept and proved that this worked.

11:30.720 --> 11:37.120
And in this one person had about four people that were kind of managing and helping them which I guess worked.

11:38.480 --> 11:43.440
And they built the initial usable system and they were the main developer and they did things the way that they wanted

11:43.440 --> 11:44.880
because they were the main developer.

11:45.840 --> 11:51.280
Once we had a second developer that became active the two developers then needed to agree on and define

11:51.280 --> 11:56.560
and document processes and that actually was kind of an interesting process because they didn't have the same things in mind.

11:58.960 --> 12:02.640
Yeah so kind of going from one to two is a remarkably huge change.

12:03.680 --> 12:08.880
And then as we moved to this more open community process again two to four of TA year of developer like

12:08.880 --> 12:13.120
73 contributors at the time but we finished this initial work in 2022.

12:13.840 --> 12:19.840
These processes became more important because all these community members had to know what they were going to try to do and how they were going to be able to fit in.

12:20.960 --> 12:27.120
And so these processes include things like how do we make design and architecture decisions, what coding style do we use,

12:27.120 --> 12:33.120
when what testing is sufficient, what documentation is sufficient, how do we actually engage with and support users,

12:33.120 --> 12:37.040
what properties do contributions and changes need to have to become acceptable.

12:38.000 --> 12:43.360
How do contributions and changes actually get accepted then, how do we encourage and develop contributors,

12:44.080 --> 12:50.160
how do we mix CS research as well as software product development which kind of have completely different goals often.

12:50.880 --> 12:54.880
And then things like who writes papers and who's listed as co-authors on those papers.

12:55.680 --> 13:01.520
And so all of these things we had initial answers for and all the answers I would say changed over the life of the project.

13:02.320 --> 13:08.960
And so the thing that we're trying to do then is to be consistent not about what the processes are but be consistent about documenting the processes.

13:10.960 --> 13:16.960
Okay, and so then to look at these two different things then the developer work has changed so so as I've said before,

13:16.960 --> 13:23.920
we are currently looking at maintenance and how reach and support that's a lot different than earlier in the project and so we need different kinds of contributions now.

13:24.480 --> 13:32.720
And the development activity then includes maintaining the code base, adding more tests, responding to issues, supporting development on different resources.

13:33.600 --> 13:41.680
As the community has grown, the number of use cases has grown and the range of challenges has grown and part of this is because every HPC system is unique in some way.

13:42.400 --> 13:50.160
And so we don't have, so we have to kind of worry about different configurations, different operating systems, different hardware.

13:51.120 --> 14:05.600
And then also reviewing contributed code and so this is led us to develop minimum requirements on that code starting ideally with the pre-coding discussion and then talking about plans for future maintenance and what's if we accept this what's going to happen to it, what are you committing to do.

14:07.600 --> 14:19.040
So internally then we have different developer roles over time so we had initially this research developer research programmer focusing on doing things quickly what the software can do more than who could actually use it.

14:20.000 --> 14:29.840
This person really is essential to do new research software projects do the initial device development testing, but they could take shortcuts that are going to hurt the project's later sustainability.

14:30.800 --> 14:45.760
And so this is really important in this initial development and maybe is looking at new features and later stages and then we get into a software developer at a later stage and this is a person who really focuses on developing professional class research software.

14:46.320 --> 14:51.520
They focus on the software itself and it's users rather than new ideas they want to explore or something else.

14:52.400 --> 15:02.080
They're dedicated to making the software as useful as possible, making it clean and relatively beautiful, increasing simplicity and compatibility and future maintainability and reducing technical debt.

15:03.040 --> 15:14.000
And it's important to have this kind of a person involved in everything except maybe the first stage because these processes might actually impede the development to the the initial thinking that we need to do.

15:14.560 --> 15:18.560
And then there's also a user developer.

15:18.560 --> 15:26.560
So this is somebody who's main job is as a scientist or a disciplinary researcher and they maybe are adding some features relevant to their own work.

15:26.560 --> 15:34.720
They're focusing on their own usage, their writing code somewhere else and so they may not be writing code in the same style that we want which can be a challenge.

15:34.720 --> 15:43.280
They can take shortcuts that harm the project's future sustainability, but they're important to have because they're the users and they bring in new ideas as well.

15:43.840 --> 15:50.160
And then finally as the collaborating developer and just because I'm running a little bit short on time, I'm not going to go into this in great detail.

15:50.160 --> 16:04.000
Other to say that this is somebody whose whose main job is developing their own software for their own project and somehow that has to be integrated into ours and so there can be challenges in terms of differing styles or differing code contributions.

16:04.000 --> 16:12.640
And if we want to be sustainable we really need to define the interface to this kind of person as well as we can because again this is the the resources that we're not paying for.

16:12.640 --> 16:16.160
But are coming in to help the project be sustained.

16:16.160 --> 16:25.160
Okay, community work also changes so in addition to the developers there's members of the community that do other stuff.

16:25.160 --> 16:33.400
And this is stuff like answering user questions which actually happens sometimes down to managing social media which never happens.

16:33.400 --> 16:42.600
So the range of these different things are all the things that need to happen from how likely it is somebody external to the core team is going to do them to how unlikely it is.

16:42.600 --> 16:57.600
Somebody external from the core team is going to do them, but we need to do all of them so the challenge then as I said before is that volunteers and other people can do some of these but they can't do all of these and so we need still some some way of doing some of these other ones.

16:57.600 --> 17:10.600
And specifically or particularly one one way of thinking about this is I'm sorry that Abby isn't here who gave one of the first couple of talks in this in this room earlier so I kind of borrowed this from her.

17:10.600 --> 17:19.600
Is that we think of community as having different stages from discovery to contact to participation and all the way up to leadership hopefully.

17:19.600 --> 17:37.600
And at each stage there's a way that people get involved and there's kind of ways that we can try to promote them and the community manager I would say one of their main job is to kind of push people up along this path as they can and to make it easy for people to move along this path and again that's something that somebody has to do it doesn't just happen automatically.

17:37.600 --> 17:47.600
Okay, the last the last thing yeah the last thing I want to say is kind of a new topic is num focus so num focus is a.

17:47.600 --> 17:55.600
An open source software foundation kind of an umbrella foundation we joined num focus in September of last year.

17:56.600 --> 18:08.600
So we are a fiscally sponsored project under num focus now and the process for doing this involved first we applied and then we are accepted and then we needed to sign a financial sponsorship agreement.

18:08.600 --> 18:23.600
The core work in parcel was developed at University Chicago and University of Illinois and so they own the the copyright on most of the the code because those of us working for either of those don't actually own our copyright because our employer does.

18:23.600 --> 18:32.600
University of Illinois agreed to transfer the ownership of its IP to us so that we could transfer it to num focus relatively quickly a couple of months.

18:32.600 --> 18:38.600
Took them a little bit of time to understand what we're trying to do but once we had a call and they understood it they were fine with it.

18:38.600 --> 18:42.600
University of Chicago took about nine months first they didn't understand it.

18:42.600 --> 18:48.600
Then they needed to have lawyers involved and then those lawyers didn't understand it and then they needed to have more lawyers involved.

18:48.600 --> 18:59.600
And and then they didn't understand it and then so anyhow so there were lots of discussions and the lawyers didn't actually use email or calendaring they went to do everything over the phone which was very weird.

18:59.600 --> 19:07.600
So but eventually they did agree to this and so and so we see being part of num focus as part of our sustainability plan.

19:07.600 --> 19:14.600
In particular because ownership of the project in a neutral place we think encourages others to move up in leadership and governance.

19:14.600 --> 19:22.600
It's not something where these two universities are going to be in charge forever other people that want to come in and have a say in what happens or more than welcome to and.

19:22.600 --> 19:40.600
And we think that being in num focus actually helps do this I'll just mention quickly there is a funding devroom tomorrow and HPSF the high performance software foundation is going to be giving a talk there which is another one of these foundations so if you're not familiar with this that will be an interesting talk I think to hear.

19:40.600 --> 19:51.600
And the other thing is the num focus actually gives us some mechanism to hire staff outside the US one needed and to contract for specific work items to the best available person regardless of their affiliation.

19:51.600 --> 20:01.600
So we don't have to have everything go through the universities we can move money around more easily and we can move money outside of a particular country as well which is relatively important.

20:01.600 --> 20:11.600
So where we are now there's some good news which is good community growth and that to do at least in part to the community manager lots of contributors lots of users I think all this is quite happy.

20:11.600 --> 20:16.600
The code is gotten better thanks to our core maintainer at least in part.

20:16.600 --> 20:27.600
We're moving more to plugins to reduce what the core code has to do removing old code that isn't used or doesn't work or things like that and having more and better tests and I think this is all very important.

20:27.600 --> 20:38.600
But on the less good news side we actually are not really sure how we're going to sustain the community manager and the core developer and I feel very awkward saying this with Ben in the room sorry.

20:38.600 --> 20:46.600
But but again this is core community and maintenance work is really harder and possible for for us to rely fully on volunteers to do.

20:46.600 --> 20:53.600
And so at least for a project of our size I think if we are as at like 10 times the size we'd be in a different situation.

20:53.600 --> 21:03.600
But where we are I think this is a challenge okay so finally so lessons so sometimes some of the choices we made are just choices and we made them for the sake of having made a choice.

21:03.600 --> 21:16.600
And and once you make a choice then it can be hard to change them but it's important to I think consider these changes regularly and particularly when you get new developers coming on becoming part of the core team that's a good opportunity to think about this.

21:16.600 --> 21:21.600
Going from one to two developers was a big step and again it's not opportunity to think about changes.

21:21.600 --> 21:26.600
Research software sustainability is a hard problem and there are no simple answers.

21:26.600 --> 21:31.600
The existence of these different types of developers and their utility during the different phases.

21:31.600 --> 21:39.600
I think was interesting in this emerge during the process and I would be interested if this matches other people's experience and if anybody's in computer science research.

21:39.600 --> 21:44.600
I think there's some interesting research we could do to find out if this does match other processes.

21:44.600 --> 21:56.600
Again yeah let's piece and then finally the the boundaries between the types of people on their roles are kind of fuzzy so different people can have different roles this isn't a statement about people to statement about roles so.

21:56.600 --> 22:01.600
Okay and then I'll just close with having some acknowledgments up for a second as we end all right so thank you.

22:02.600 --> 22:07.600
Thank you.

22:07.600 --> 22:20.600
One question the most motivated guess it is you.

22:21.600 --> 22:24.600
Yeah so question yes I'm trying to yes.

22:24.600 --> 22:35.600
Okay so the question is about the the possible role industry can play I think I can imagine industry playing actually a pretty significant role.

22:35.600 --> 22:47.600
The challenge I think that we have is that what we're doing is potentially relevant to some industry and and we do have one kind of industrial partner that we work pretty closely with and they have been.

22:47.600 --> 22:53.600
helping with some volunteer time from their staff that actually have done some nice things.

22:53.600 --> 22:59.600
If we I think if we had done a better job or if we will do a better job of selling to industry.

22:59.600 --> 23:10.600
I think that that would actually help and and again that the HPSF model that that's like that other foundation is much more focused on things related industry but they're building code.

23:10.600 --> 23:21.600
That computer vendors rely on to sell their computers and so there's there's a much stronger incentive we're providing am a library that people can use but it.

23:21.600 --> 23:28.600
It's not the only library we think it's probably the best in some situations but in other situations there's probably a different one that might be better so it's.

23:28.600 --> 23:37.600
We don't have that kind of I don't know and somebody's I don't know that I would say we're the killer app for industry in general we in some particular problem places we are.

23:37.600 --> 23:46.600
But as a general statement I don't think I can say that and so that makes it hard to make that case I think to industry but but that's part of what we're actually another piece of why we're interested in.

23:46.600 --> 23:54.600
We're interested in joining them focus is because of the industry connections and we've had some discussions already so it it may work I don't know.

23:54.600 --> 23:56.600
Thank you very much.

23:56.600 --> 24:01.600
Okay thank you.

