[S9 E7 Transcript] Interdisciplinary Roots and Inclusive Pathways in Data Science (feat. Mike Ludkovski & Alex Franks)

[Transcript] Interdisciplinary Roots and Inclusive Pathways in Data Science (feat. Mike Ludkovski & Alex Franks)

Eric Van Dusen 0:01

Welcome to the UC Berkeley Data Science Education podcast. We're happy you're listening in today. In this space, you'll hear from a variety of distinguished data science educators and professionals. The individuals we'll speak with are diverse in experience and perspective, but share the common goal of shaping the future of data science education. Our idea is to have some informal conversations with the goal of creating community and let people hear from practitioners in this growing new field.

Lauren Chu 0:31

Thanks, Eric. I'm Lauren, your other host, helping to dive into the world of data science education. Let's meet today's guest.

Eric Van Dusen 0:40

Hi everybody. Today we got Mike Ludkovski and Alex Franks from UC Santa Barbara, people that I have known for a couple years as Santa Barbara has been building out what they're doing in the data science space. But to start off, can you both share how you got into statistics and data science teaching, what first drew you in. And how did you end up joining up at UC Santa Barbara?

Alex Franks 1:05

yeah, so I can start off. So this is Alex. I got my degree in applied math and computer science in 2009 and this was right around the time that data science as a term was starting to kick off. And there was lots of hype in the news around, you know, statistics being like, the next big, sexy job. And so this was already an area that I was pretty interested in. And I think this really just started to reinforce my interest in the field. So I ended up getting a PhD in statistics, and that took me down, you know, the sort of the usual winding academic route where I was fortunate enough to get an offer from UCSB, where I first met Mike.

Mike Ludkovski 1:48

I'm Mike Ludkovski. I've been at UCSB for 17 years now. It's been a while. I'm not really a data scientist. My research is in stochastic modeling, and honestly, I got involved in the whole data science arena, essentially through administrative duties. So back in 2018 I got nudged to become department chair. That was exactly when Alex joined. So I did not get to hire Alex, but he came just before that. And then there was a kind of the early stages of creating, for example, the precursor for the California Alliance for science education. So I think the very first meeting, I kind of got dragged along as department chair to be there. And then we submitted the proposal. We got one NSF funded project, another, and so forth. And then I kind of strung along. And here I am running multiple things at once.

Eric Van Dusen 2:43

Awesome, awesome. So between the two of you, you just have a really wide range of research topics, finance, stochastic modeling you mentioned. But to biology, sports analytics, maybe you could just bring in exposure to so many different research topics. How does that affect how you think about teaching data science?

Alex Franks 3:05

Yeah, so I'm happy to take this one. So there's a famous quote by a statistician, John Tukey, who's often associated with sort of introducing and promoting the concept of exploratory data analysis. And his quote is that the best thing about being a statistician is that you get to play in everyone's backyard, by which he means, as a data scientist, you get to dabble in all of these different areas. So part of your question is, sort of, well, how does that work? And how does that inform our thinking? And I, I think that the longer you work in statistics, data science and adjacent fields, you really start to see that all these stories around data that come up in different disciplines, they're actually linked through the language of statistics and mathematics. So when I start a new domain, I will usually try to start by reasoning, by analogy and a simple example, just like, sort of a simple, made up example, working with an astronomer, and, you know, they're telling me about the number of photons hitting a telescope.

And I might think, Well, hey, actually, that's kind of analogous, at least mathematically, to the number of cars passing through an intersection in that project that the urban planner I just did last week, right? Or maybe an even simpler example, right? If I'm modeling the number of made free throws in basketball? Well, that's now, I guess, to coin flips in some way. And so you go back to the binomial distribution. So I think you know, the longer you work in statistics and data science, the more you start to see sort of more complex connections like this. It's one of my favorite things about data science. And so in my own teaching, I try to incorporate a lot of these different examples so that students, hopefully, will start to develop their own appreciation for reasoning by analogy and the power of statistical modeling across disparate disciplines.

Mike Ludkovski 4:53

Yeah, and I would really echo Alex, you know, I do a lot of different interdisciplinary projects I work with environmental scientists. I work with power engineers on the electrical grid. I work with quant finance folks. I work with insurance folks. And, you know, as you start doing these projects, you know, you have to relearn some things, but then you also realize that there is a, you know, something called, you know, data analysis, data science. It's a field to itself which has its own method, its own commonalities, and it's, you can see the, you know, that thing playing out across different topics. It really helps to kind of, I would say, you know, be more more deliberate, and kind of understand what there's a pitfalls you might face, and then how to even, I think that really helps how to talk to other folks, because the language of data is kind of easier to to get a common ground ons.

And, you know, a lot of the other lingo is so specialized, and often the same word means opposite things. But you know, data wrangling or interpreting plots is like it's always the same thing, again and again. And so, in fact, you know, one thing I found out is that I kind of do the, you know, the data plus x, I kind of start with the x first, and then the data kind of happens later. And so I think, you know, working on different projects, data driven in different domains, was really kind of helpful to see how the students, you know, because you end up being kind of like a student yourself, you're like you're learning something and being able to connect those dots, I think this is really helpful for, you know, explain same thing to students. You know, that's how you go from a particular domain problem to having the data pipeline in place.

Eric Van Dusen 6:40

Nice. Thank you. All right, the next question is a little bit related, but sort of thinking like, how do you do that? Like both of you are collaborating with people in other disciplines, and do you have advice for faculty or program people building new programs or trying to build collaborations? And both your labs really seem to be like having these real world examples that you connect with energy markets, biology, so how do you find those partnerships, of those applied projects? There's a lot of interest in data science, and be like, let's have real world projects that the students do. What are some recommendations for finding real world projects?

Mike Ludkovski 7:20

Yeah, let me take this first. I think it's, you know, you just have to be well, it really helps to be curious and being able to, you know, talk to people in their domain, and kind of, you know, be willing to learn. And so I enjoy, you know, I kind of enjoy being, kind of feeling as a beginner, I'm learning something new, and I'm trying to understand an area I don't know very much, and being humble about it. And, you know, again, over time, you build a kind of an arsenal of tools that are helpful, and you realize, you know, unexpected connections pop up, and that kind of makes it easier to keep going. But you know, again, to me, it's really, I just think, I guess that's why I'm a professor. I just like learning new things, and lots of things make me interested. And that's kind of the starting point for connecting with folks.

Alex Franks 8:15

And I'll just really briefly add to that, I think, in my experience, researchers in different domains are really eager for help with programming and statistics and data science, and so if you just reach out, and like Mike said, If you are curious, if you do show interest in their research, Usually they're pretty excited to engage and reciprocate.

Eric Van Dusen 8:42

That's awesome. Okay, now let's talk about UCSB specifically. How has statistics becoming a statistics and data science curriculum evolved to meet the students coming in with different backgrounds, different interests. My understanding is that the majors really grew a lot. You know, what do you see as, what's working? What do you want to like, evolve and improve?

Mike Ludkovski 9:09

Yeah, so let me, let me break into a couple, because it's quite a long question. So let me give you a few different answers. So I think just as history is forever, for a variety of reasons. You know, at this point, we are actually sort of the second largest, I believe, statistics undergraduate program in the nation, which is pretty impressive. Actually, multiple UCs are also in the top 10 list, but we are now at Berkeley with UCLA. We're like number two now. And so we had this, you know, is a virtuous or vicious cycle, or whatever I call it, where we have students who are interested in classes, we get to hire faculty. Faculty come in, you know, assistant professors like Alex and other folks, and they want to teach, you know, something new and something exciting for themselves, which kind of was lining up nicely with data science generally. And so they offer a class, and the class is successful, and it creates new demand, and then there's more students, and then we get more positions. And so this has been kind of happening, really, for the last decade, you know, a little ups and down, but really that's been the process.

So it's been really, you know, easy to create new classes, because administration was like, Yeah, we know you have lots of students, and they need more classes. So go ahead, if you want to have something new, go for it. At the same time, this growth was organic in terms of curriculum. So there was no master plan, like we did not have a thing saying we want to have a you know, here's a set of 642, whatever number you like, of data science classes. It was, you know, one class at a time. So at a pace, let's say one in one per year, and when that was a large collection covering pretty much all the different, you know, nooks and crannies of data science, broadly speaking, from a statistical perspective in place now. So because there were no master plans, master plan we it's a little bit idiosyncratic.

It's a little bit reflecting the specific people who are creating the class and teaching it originally, and it's not necessarily aligning with how other schools do it. So now we are trying to go back a little bit and fix and say, okay, you know, this is what the other places ended up doing, and this is kind of the common curriculum now. And how do we, you know, realign ourselves so that it helps with transfer students. We don't, we cover the same topics in the same sequence, and so we kind of keep tinkering with our own curriculum. But, yeah, I mean, we are blessed or cursed with, you know, we have a class. Well, next thing we know, there's 200 students who want to take it, and then we have to think, okay, who's going to offer. Besides, there's one person who actually designed it.

Lauren Chu 11:44

Great and Michael, you're involved in two major initiatives focused on expanding data science pathways, the Southern California Consortium and the Pacific Alliance for low income inclusion. Could you talk a little bit about what those efforts look like on the ground and how they're helping shape a more inclusive future for the field?

Mike Ludkovski 12:05

Yeah, so, so the Palisades project, this is a NSF stem as STEM stands for solid scholarships in STEM. That's been a flagship program of NSF undergraduate education division for about 30 years, and so it's a very well established program, and literally hundreds of projects have been run through that. But we are, actually, I think we are the third in the history ever, if they have data science as part of our mission. And so the STEM program is about giving scholarships to low income students. So it's a very simple model. You find low income students who are, you know, basically in this means Pell eligible for us, and you offer them a scholarship. They know they have to have a good GPA, and they have to be interested in no specific topics. It has to be very, it's very narrow in terms of the major.

So it's kind of aligning nicely with our major, and you support them, and then you offer them support around that. So we have graduate mentors, which I think is critical to kind of have some, you know, accessible role models. We have faculty mentors, and we have, you know, a lecture series. Next month, we're going to have an in-person event here at UCSB. So the main point about palisade is that, you know, the NSF wants to have, like, a large scale thing. So we have seven different schools involved in this palisade program, several Cal States, Irvine, University of Washington up in Seattle, so kind of up and down the West Coast. And so right now we have 47 scholars. So we are targeting juniors, seniors and first year graduate students. We have 47 scholars right now, and we're supposed to end up with a total of about 125 over five years of the program.

So the this, this Palisades, is all about inclusion, because it's really supposed to give a, you know, leg up and support to the low income students who are generally, you know, without the scholarship monies, they have to work and they have a hard time and understanding how the career tracks works, because it's still a bit, you know, mysterious for most of them. They have even more, you know, unsure about how graduate school works and if it's worth it for them or not.

And, yeah, it's trying to kind of level the playing field for them, the SECDS, it's quite a different animal. This is run through the California Learning Lab, which is a legislative California legislature funded initiative which is also broad, and they had one year where they had a data science theme for their projects. This is also. So as a large consortium. But there, the idea is all about transfer students and having a pathway so that students can move across institutions. So in California, this is a big thing for our sort of state legislature and the governor, so the students can move as seamlessly as possible between the community colleges. And so we have several community college partners, including Santa Barbara City College.

I know that Natalie was your guest a few weeks ago. Several Cal States, which are in different stages of where they are with data science, as well as UCSB is kind of the one you see, let's say lead in this project. So our particular project has a big outreach component. So one, one thing we found out, and again, you know, I always feel like I'm kind of learning something new. So one thing I learned along the way was that even though, you know, we're in the same state and often very close to each other, the awareness of data science is vastly different across campuses within just a few miles of each other.

So just an example, one partner we have is Long Beach. Cal State. Long Beach, they've been doing what they call "Data Day" at the beach for a few years now and then. And they feel like, you know, in their local neighborhood, people know about data science a lot, and so they have contacts with high school and so forth. But then Cal State Channel Islands, which is in Camarillo, which is, you know, an hour drive for Long Beach. I mean, when they joined our project, they had no awareness. I mean, I think, you know, the department didn't really know about data sciences. Had no classes. Their students were really unaware of what's going on, what the opportunities are. And so we are trying to help you know different places to stand up data science courses, programs, and share best practices. We organize events.

So we've been very successful in building what we call datathons, which is a Saturday half day event where students come, high school students or community college students come, and they will kind of give them a basic coding exercise, some data manipulation and some really kind of, you know, practical questions, like, you know, what does this data tell you about, you know, kind of a program, a problem we can relate to, you know, in a day to day life, you know, how do you phrase it as a business type question? And so they spend about three, four hours doing that, and we get, we get them a little certificate, and we tell them about, here what data science feels like as a class, and here it feels like as a kind of a potential career.

Alex Franks 17:31

Great. Thank you! Alex, your work on sensitivity analysis and causal inference touches on some pretty advanced topics. How do you make these ideas approachable for students, especially if it's the first time that they're encountering them.

Yeah, so this is a great question. I like this question because it's about stuff I think about all the time, and it relates to a broader theme, not just in my research, but in my teaching. So I really like to just tell students to start by thinking about the story behind the data. So every store, every data set, has some underlying story, and we're trying to figure out what that story might be. And with people just getting started thinking about data, I try to kind of introduce different ideas of how different stories could be consistent with the data that we have. So just thinking about, you know, stats, 101, when you teach hypothesis testing, we're considering two different kinds of stories. So let's say we're testing for the difference in means in two groups. And one story is that the observed difference in means just arose due to chance, right? That actually the two groups were generated from the same process, and that there's just an observed difference due to chance.

And then the other story is that actually, no, the observed difference is not due to chance, that actually there's something fundamentally different about the two groups, the processes are generated from a distribution that has different means. And then we introduced this notion of a statistical test to actually maybe reject the first story in favor of the second story. So this kind of thinking, thinking about the stories underlying the data, is especially important when trying to infer causal effects, especially from observational data, because you have so many stories that could possibly explain the data, some of which are consistent with causal interpretations, and some that may not be right.

You might be familiar with this saying, correlation does not imply causation, right? Well, correlation could be due to causation, or maybe it's not due to causation. So what's kind of unique in causal inference is that to get causal effects from observational data, we have to make assumptions, and those assumptions are usually not testable from data. In the same way as the previous example, you really have to reason about those assumptions. And fundamentally, that's what sensitivity analysis is. Is probing. It's probing how sensitive your conclusions are to different kinds. Of assumptions about the process that generated your data? Yeah.

Lauren Chu 20:09

Great. And whether it's through advising, teaching, lab work, at the end of the day, you're both supporting students in a field that's very quickly evolving. Could you tell us a little bit about the capstone course, kind of as a means for helping students build confidence and grow?

Alex Franks 20:26

Yeah. So this came out of an NSF funded project, the NSF-HDR harnessing the data revolution, and we offer this capstone sequence in which we have groups of, typically about four upper division students in a group working with a mentor in one of the sort of data, data related disciplines, so either statistics or computer Science, and then another mentor that comes from a sponsoring company or a research lab on campus. And this is a three quarter sequence course, so full year course, or the first quarter in the fall, they actually have a slightly more of like a lecture style course, where they learn about collaboration and doing mini projects, what it's like to sort of start a project and from end to end, start to finish, and then starting in in winter quarter, in January, they actually get paired up with their their team and one of these sponsoring groups.

They learn about this, this domain. They learn all about the tools that are used in this domain, and what it takes to actually see a project through from start to finish. So I think it's a really good opportunity, and I think it really gives the students a sense of what it's like to do data science in the real world. And you're right, because things are evolving so quickly, it's a lot we see actually a change in in some of the tools that are being used, and oftentimes, you know, some of these research labs are companies are actually quicker to pick up these new tools than we might be in in the classroom.

So it's a way to expose students to some of the state of the art ideas, right? Like we have caps on groups now doing projects related to AI and large language models and those just, of course, were not around even just a couple of years ago, and it's something that it's sort of hard for us to start to incorporate into our curriculum on a short timeline, but it is giving students that that early exposure and experience with the tool that I think is going to be certainly prevalent in the workplace.

Mike Ludkovski 22:50

Yeah, maybe I can add one more angle on the capstone. So one thing happens in the Capstone that the students have a sort of external sponsor, somebody kind of from the domain, who is there as a mentor, and then they have a second spot, a second mentor. You know, typically, right now, it's a graduate student, but doesn't have to be, it can be, you know, teaching professors, regular faculty and so forth. And so there's really, you know, and a very nice feature, I think, with students is they have this multiple mentors coming from different domains, kind of a different perspective, complementary. I think this is, you know, one of the downsides of the normal, you know, undergraduate education is that, you know, there's always like, well, there's a class and there's a professor and that's it.

You know, you just get to see the subject and you get it from one point of view, and you never get to feel like, well, it's actually multifaceted that really has its own perspective. And of course, you know, if you're doing a math class and it's just, you know, theory and proof, then it doesn't really matter who is teaching that. But you know, for a data science and when you have really kind of, you know, anything from ethical to, you know, a culture or aesthetic perspective about how to visualize things or how to describe things, I think it's super helpful to to see multiple mentors grappling with the same topic, especially if they're coming from different disciplines. So I think this, this is, this is working really well. It does create a big challenge for the coordinator, because you have to, you know, sustainability becomes difficult because you can have, you know, maybe people don't get along, or you have to find twice as many people.

So it's tricky to sustain this, but I think otherwise. And I'm a big fan of co teaching in general, and we've tried this for a couple of classes. Again, it's hard to sustain co teaching, but when it does work, I think it's the best for data science education. And Alex, Alex, I've tried this multiple times, and at some point you kind of get burned out, because it ends up being, you know, 80% of the work for each person. But again, I think this is kind of these, these ideal scenarios to have more than one instructor.

Alex Franks 24:55

I was just going to add to that. I think the other thing about the capstones, it touches on some of the themes we've already discussed is that, you know, the students are in this cohort. We have about 60 students, I think, give or take in the Capstone sequence this year, and they're working in these groups of four. But throughout the remainder of the school year, they're giving presentations and updates about their work to the other students in this cohort. So they have exposure to everything that all of the students are doing across these very disparate disciplines, like, like we address.

So, you know, one group is doing, you know, generative models for music, artificially generated music. The other one is working with a research lab studying, you know, the sound that orcas make in the ocean and environmental issues, right? So fairly disparate things. And again, it's another opportunity for them to start making these connections, and be exposed to the different kinds of tools that arise across these fields.

Lauren Chu 26:00

Great. Thank you. And as we reach the end of the interview, something that we always ask all of our guests is, do you have any parting thoughts or words of wisdom for data science educators around the world who are listening in?

Alex Franks 26:16

You know, this is a really tough question, because, as we pointed out, things are really evolving so quickly. So I would say, I mean, I think the first thing is something, I think I've already discussed it to a degree, which is always trying to motivate what we're doing, what you're doing, with interesting, real world examples. I think if you can't motivate the students, then it's going to be really hard to progress from there. But beyond that, I think, you know, for me, it's the advice, I guess, would be not to get too attached to a particular tool or method.

Focus on developing skills that are transportable, that will generalize to new areas and to every so often, take stock of where the field is, where it's going, what some of these new tools are, so that you can incorporate them into your teaching. I think you know, where I see things sometimes not working out so well is when there's classes that have sort of been unchanged and taught the same way for 15 plus, plus years, right? And because of the pace of change, I think we always have to continually take stock of where the field is, where it's going, and think about how we need to update our curriculum.

Mike Ludkovski 27:34

And I would just give a quick plug for, for lack of better, what I call a community of practice, you know, connecting to other educators and kind of discussing in a semi-formal environment about what's happening, what's new, and how, you know, what are the best practices. You know, data science is new, and so there's not a well formed community of, you know, the instructors or the faculty or the educators who are actually discussing or designing new courses. And I find this always super helpful to have an informal but also focused discussion with a clear, sort of agenda discussion in a medium sized group about what are the challenges, what's coming up, what are the new tools that are happening?

Because, again, the infrastructure is evolving quite, quite fast. You know, Jupiter hub did not exist, really, 10 years ago at all, and five years ago, it was still a novelty. The whole quarto and environment, which actually, Alex is a big fan of, is also like, three years old, really. And so, you know, you kind of have to stay on top of this and just chat to other folks. I think this is so helpful, and always just fun.

Eric Van Dusen 28:42

Nice. Thank you. Thanks for a great conversation.

Mike Ludkovski 28:47

Great to be here.

Alex Franks 28:47

Thank you.

Eric Van Dusen 28:57

Thanks for listening to today's podcast. If you're interested in learning more about data science education resources, please subscribe to our Substack to get notified when we release any future podcasts, and join our community Slack channel through the link provided in this episode's description. Thank you.

Transcribed by https://otter.ai