Enterprise MLOps Interview-Simon Stiebellehner

Sep 23 2022 • 56 mins

If you enjoyed this video, here are additional resources to look at:

Coursera + Duke Specialization: Building Cloud Computing Solutions at Scale Specialization: https://www.coursera.org/specializations/building-cloud-computing-solutions-at-scale

Python, Bash, and SQL Essentials for Data Engineering Specialization: https://www.coursera.org/specializations/python-bash-sql-data-engineering-duke

AWS Certified Solutions Architect - Professional (SAP-C01) Cert Prep: 1 Design for Organizational Complexity:
https://www.linkedin.com/learning/aws-certified-solutions-architect-professional-sap-c01-cert-prep-1-design-for-organizational-complexity/design-for-organizational-complexity?autoplay=true

O'Reilly Book: Practical MLOps: https://www.amazon.com/Practical-MLOps-Operationalizing-Machine-Learning/dp/1098103017

O'Reilly Book: Python for DevOps: https://www.amazon.com/gp/product/B082P97LDW/

O'Reilly Book: Developing on AWS with C#: A Comprehensive Guide on Using C# to Build Solutions on the AWS Platform
https://www.amazon.com/Developing-AWS-Comprehensive-Solutions-Platform/dp/1492095877

Pragmatic AI: An Introduction to Cloud-based Machine Learning: https://www.amazon.com/gp/product/B07FB8F8QP/

Pragmatic AI Labs Book: Python Command-Line Tools: https://www.amazon.com/gp/product/B0855FSFYZ

Pragmatic AI Labs Book: Cloud Computing for Data Analysis: https://www.amazon.com/gp/product/B0992BN7W8

Pragmatic AI Book: Minimal Python: https://www.amazon.com/gp/product/B0855NSRR7

Pragmatic AI Book: Testing in Python: https://www.amazon.com/gp/product/B0855NSRR7

Subscribe to Pragmatic AI Labs YouTube Channel: https://www.youtube.com/channel/UCNDfiL0D1LUeKWAkRE1xO5Q

Subscribe to 52 Weeks of AWS Podcast: https://52-weeks-of-cloud.simplecast.com

View content on noahgift.com: https://noahgift.com/

View content on Pragmatic AI Labs Website: https://paiml.com/

[00:00.000 --> 00:02.260] Hey, three, two, one, there we go, we're live.
[00:02.260 --> 00:07.260] All right, so welcome Simon to Enterprise ML Ops interviews.
[00:09.760 --> 00:13.480] The goal of these interviews is to get people exposed
[00:13.480 --> 00:17.680] to real professionals who are doing work in ML Ops.
[00:17.680 --> 00:20.360] It's such a cutting edge field
[00:20.360 --> 00:22.760] that I think a lot of people are very curious about.
[00:22.760 --> 00:23.600] What is it?
[00:23.600 --> 00:24.960] You know, how do you do it?
[00:24.960 --> 00:27.760] And very honored to have Simon here.
[00:27.760 --> 00:29.200] And do you wanna introduce yourself
[00:29.200 --> 00:31.520] and maybe talk a little bit about your background?
[00:31.520 --> 00:32.360] Sure.
[00:32.360 --> 00:33.960] Yeah, thanks again for inviting me.
[00:34.960 --> 00:38.160] My name is Simon Stebelena or Simon.
[00:38.160 --> 00:40.440] I am originally from Austria,
[00:40.440 --> 00:43.120] but currently working in the Netherlands and Amsterdam
[00:43.120 --> 00:46.080] at Transaction Monitoring Netherlands.
[00:46.080 --> 00:48.780] Here I am the lead ML Ops engineer.
[00:49.840 --> 00:51.680] What are we doing at TML actually?
[00:51.680 --> 00:55.560] We are a data processing company actually.
[00:55.560 --> 00:59.320] We are owned by the five large banks of Netherlands.
[00:59.320 --> 01:02.080] And our purpose is kind of what the name says.
[01:02.080 --> 01:05.920] We are basically lifting specifically anti money laundering.
[01:05.920 --> 01:08.040] So anti money laundering models that run
[01:08.040 --> 01:11.440] on a personalized transactions of businesses
[01:11.440 --> 01:13.240] we get from these five banks
[01:13.240 --> 01:15.760] to detect unusual patterns on that transaction graph
[01:15.760 --> 01:19.000] that might indicate money laundering.
[01:19.000 --> 01:20.520] That's a natural what we do.
[01:20.520 --> 01:21.800] So as you can imagine,
[01:21.800 --> 01:24.160] we are really focused on building models
[01:24.160 --> 01:27.280] and obviously ML Ops is a big component there
[01:27.280 --> 01:29.920] because that is really the core of what you do.
[01:29.920 --> 01:32.680] You wanna do it efficiently and effectively as well.
[01:32.680 --> 01:34.760] In my role as lead ML Ops engineer,
[01:34.760 --> 01:36.880] I'm on the one hand the lead engineer
[01:36.880 --> 01:38.680] of the actual ML Ops platform team.
[01:38.680 --> 01:40.200] So this is actually a centralized team
[01:40.200 --> 01:42.680] that builds out lots of the infrastructure
[01:42.680 --> 01:47.320] that's needed to do modeling effectively and efficiently.
[01:47.320 --> 01:50.360] But also I am the craft lead
[01:50.360 --> 01:52.640] for the machine learning engineering craft.
[01:52.640 --> 01:55.120] These are actually in our case, the machine learning engineers,
[01:55.120 --> 01:58.360] the people working within the model development teams
[01:58.360 --> 01:59.360] and cross functional teams
[01:59.360 --> 02:01.680] actually building these models.
[02:01.680 --> 02:03.640] That's what I'm currently doing
[02:03.640 --> 02:05.760] during the evenings and weekends.
[02:05.760 --> 02:09.400] I'm also lecturer at the University of Applied Sciences, Vienna.
[02:09.400 --> 02:12.080] And there I'm teaching data mining
[02:12.080 --> 02:15.160] and data warehousing to master students, essentially.
[02:16.240 --> 02:19.080] Before TMNL, I was at bold.com,
[02:19.080 --> 02:21.960] which is the largest eCommerce retailer in the Netherlands.
[02:21.960 --> 02:25.040] So I always tend to see the Amazon of the Netherlands
[02:25.040 --> 02:27.560] or been a lux actually.
[02:27.560 --> 02:30.920] It is still the biggest eCommerce retailer in the Netherlands
[02:30.920 --> 02:32.960] even before Amazon actually.
[02:32.960 --> 02:36.160] And there I was an expert machine learning engineer.
[02:36.160 --> 02:39.240] So doing somewhat comparable stuff,
[02:39.240 --> 02:42.440] a bit more still focused on the actual modeling part.
[02:42.440 --> 02:44.800] Now it's really more on the infrastructure end.
[02:45.760 --> 02:46.760] And well, before that,
[02:46.760 --> 02:49.360] I spent some time in consulting, leading a data science team.
[02:49.360 --> 02:50.880] That's actually where I kind of come from.
[02:50.880 --> 02:53.360] I really come from originally the data science end.
[02:54.640 --> 02:57.840] And there I kind of started drifting towards ML Ops
[02:57.840 --> 02:59.200] because we started building out
[02:59.200 --> 03:01.640] a deployment and serving platform
[03:01.640 --> 03:04.440] that would as consulting company would make it easier
[03:04.440 --> 03:07.920] for us to deploy models for our clients
[03:07.920 --> 03:10.840] to serve these models, to also monitor these models.
[03:10.840 --> 03:12.800] And that kind of then made me drift further and further
[03:12.800 --> 03:15.520] down the engineering lane all the way to ML Ops.
[03:17.000 --> 03:19.600] Great, yeah, that's a great background.
[03:19.600 --> 03:23.200] I'm kind of curious in terms of the data science
[03:23.200 --> 03:25.240] to ML Ops journey,
[03:25.240 --> 03:27.720] that I think would be a great discussion
[03:27.720 --> 03:29.080] to dig into a little bit.
[03:30.280 --> 03:34.320] My background is originally more on the software engineering
[03:34.320 --> 03:36.920] side and when I was in the Bay Area,
[03:36.920 --> 03:41.160] I did individual contributor and then ran companies
[03:41.160 --> 03:44.240] at one point and ran multiple teams.
[03:44.240 --> 03:49.240] And then as the data science field exploded,
[03:49.240 --> 03:52.880] I hired multiple data science teams and worked with them.
[03:52.880 --> 03:55.800] But what was interesting is that I found that
[03:56.840 --> 03:59.520] I think the original approach of data science
[03:59.520 --> 04:02.520] from my perspective was lacking
[04:02.520 --> 04:07.240] in that there wasn't really like deliverables.
[04:07.240 --> 04:10.520] And I think when you look at a software engineering team,
[04:10.520 --> 04:12.240] it's very clear there's deliverables.
[04:12.240 --> 04:14.800] Like you have a mobile app and it has to get better
[04:14.800 --> 04:15.880] each week, right?
[04:15.880 --> 04:18.200] Where else, what are you doing?
[04:18.200 --> 04:20.880] And so I would love to hear your story
[04:20.880 --> 04:25.120] about how you went from doing kind of more pure data science
[04:25.120 --> 04:27.960] to now it sounds like ML Ops.
[04:27.960 --> 04:30.240] Yeah, yeah, actually.
[04:30.240 --> 04:33.800] So back then in consulting one of the,
[04:33.800 --> 04:36.200] which was still at least back then in Austria,
[04:36.200 --> 04:39.280] data science and everything around it was still kind of
[04:39.280 --> 04:43.720] in this infancy back then 2016 and so on.
[04:43.720 --> 04:46.560] It was still really, really new to many organizations,
[04:46.560 --> 04:47.400] at least in Austria.
[04:47.400 --> 04:50.120] There might be some years behind in the US and stuff.
[04:50.120 --> 04:52.040] But back then it was still relatively fresh.
[04:52.040 --> 04:55.240] So in consulting, what we very often struggled with was
[04:55.240 --> 04:58.520] on the modeling end, problems could be solved,
[04:58.520 --> 05:02.040] but actually then easy deployment,
[05:02.040 --> 05:05.600] keeping these models in production at client side.
[05:05.600 --> 05:08.880] That was always a bit more of the challenge.
[05:08.880 --> 05:12.400] And so naturally kind of I started thinking
[05:12.400 --> 05:16.200] and focusing more on the actual bigger problem that I saw,
[05:16.200 --> 05:19.440] which was not so much building the models,
[05:19.440 --> 05:23.080] but it was really more, how can we streamline things?
[05:23.080 --> 05:24.800] How can we keep things operating?
[05:24.800 --> 05:27.960] How can we make that move easier from a prototype,
[05:27.960 --> 05:30.680] from a PUC to a productionized model?
[05:30.680 --> 05:33.160] Also how can we keep it there and maintain it there?
[05:33.160 --> 05:35.480] So personally I was really more,
[05:35.480 --> 05:37.680] I saw that this problem was coming up
[05:38.960 --> 05:40.320] and that really fascinated me.
[05:40.320 --> 05:44.120] So I started jumping more on that exciting problem.
[05:44.120 --> 05:45.080] That's how it went for me.
[05:45.080 --> 05:47.000] And back then we then also recognized it
[05:47.000 --> 05:51.560] as a potential product in our case.
[05:51.560 --> 05:54.120] So we started building out that deployment
[05:54.120 --> 05:56.960] and serving and monitoring platform, actually.
[05:56.960 --> 05:59.520] And that then really for me, naturally,
[05:59.520 --> 06:01.840] I fell into that rabbit hole
[06:01.840 --> 06:04.280] and I also never wanted to get out of it again.
[06:05.680 --> 06:09.400] So the system that you built initially,
[06:09.400 --> 06:10.840] what was your stack?
[06:10.840 --> 06:13.760] What were some of the things you were using?
[06:13.760 --> 06:17.000] Yeah, so essentially we had,
[06:17.000 --> 06:19.560] when we talk about the stack on the backend,
[06:19.560 --> 06:20.560] there was a lot of,
[06:20.560 --> 06:23.000] so the full backend was written in Java.
[06:23.000 --> 06:25.560] We were using more from a user perspective,
[06:25.560 --> 06:28.040] the contract that we kind of had,
[06:28.040 --> 06:32.560] our goal was to build a drag and drop platform for models.
[06:32.560 --> 06:35.760] So basically the contract was you package your model
[06:35.760 --> 06:37.960] as an MLflow model,
[06:37.960 --> 06:41.520] and then you basically drag and drop it into a web UI.
[06:41.520 --> 06:43.640] It's gonna be wrapped in containers.
[06:43.640 --> 06:45.040] It's gonna be deployed.
[06:45.040 --> 06:45.880] It's gonna be,
[06:45.880 --> 06:49.680] there will be a monitoring layer in front of it
[06:49.680 --> 06:52.760] based on whatever the dataset is you trained it on.
[06:52.760 --> 06:55.920] You would automatically calculate different metrics,
[06:55.920 --> 06:57.360] different distributional metrics
[06:57.360 --> 06:59.240] around your variables that you are using.
[06:59.240 --> 07:02.080] And so we were layering this approach
[07:02.080 --> 07:06.840] to, so that eventually every incoming request would be,
[07:06.840 --> 07:08.160] you would have a nice dashboard.
[07:08.160 --> 07:10.040] You could monitor all that stuff.
[07:10.040 --> 07:12.600] So stackwise it was actually MLflow.
[07:12.600 --> 07:15.480] Specifically MLflow models a lot.
[07:15.480 --> 07:17.920] Then it was Java in the backend, Python.
[07:17.920 --> 07:19.760] There was a lot of Python,
[07:19.760 --> 07:22.040] especially PySpark component as well.
[07:23.000 --> 07:25.880] There was a, it's been quite a while actually,
[07:25.880 --> 07:29.160] there was a quite some part written in Scala.
[07:29.160 --> 07:32.280] Also, because there was a component of this platform
[07:32.280 --> 07:34.800] was also a bit of an auto ML approach,
[07:34.800 --> 07:36.480] but that died then over time.
[07:36.480 --> 07:40.120] And that was also based on PySpark
[07:40.120 --> 07:43.280] and vanilla Spark written in Scala.
[07:43.280 --> 07:45.560] So we could facilitate the auto ML part.
[07:45.560 --> 07:48.600] And then later on we actually added that deployment,
[07:48.600 --> 07:51.480] the easy deployment and serving part.
[07:51.480 --> 07:55.280] So that was kind of, yeah, a lot of custom build stuff.
[07:55.280 --> 07:56.120] Back then, right?
[07:56.120 --> 07:59.720] There wasn't that much MLOps tooling out there yet.
[07:59.720 --> 08:02.920] So you need to build a lot of that stuff custom.
[08:02.920 --> 08:05.280] So it was largely custom built.
[08:05.280 --> 08:09.280] Yeah, the MLflow concept is an interesting concept
[08:09.280 --> 08:13.880] because they provide this package structure
[08:13.880 --> 08:17.520] that at least you have some idea of,
[08:17.520 --> 08:19.920] what is gonna be sent into the model
[08:19.920 --> 08:22.680] and like there's a format for the model.
[08:22.680 --> 08:24.720] And I think that part of MLflow
[08:24.720 --> 08:27.520] seems to be a pretty good idea,
[08:27.520 --> 08:30.080] which is you're creating a standard where,
[08:30.080 --> 08:32.360] you know, if in the case of,
[08:32.360 --> 08:34.720] if you're using scikit learn or something,
[08:34.720 --> 08:37.960] you don't necessarily want to just throw
[08:37.960 --> 08:40.560] like a pickled model somewhere and just say,
[08:40.560 --> 08:42.720] okay, you know, let's go.
[08:42.720 --> 08:44.760] Yeah, that was also our thinking back then.
[08:44.760 --> 08:48.040] So we thought a lot about what would be a,
[08:48.040 --> 08:51.720] what would be, what could become the standard actually
[08:51.720 --> 08:53.920] for how you package models.
[08:53.920 --> 08:56.200] And back then MLflow was one of the little tools
[08:56.200 --> 08:58.160] that was already there, already existent.
[08:58.160 --> 09:00.360] And of course there was data bricks behind it.
[09:00.360 --> 09:02.680] So we also made a bet on that back then and said,
[09:02.680 --> 09:04.920] all right, let's follow that packaging standard
[09:04.920 --> 09:08.680] and make it the contract how you would as a data scientist,
[09:08.680 --> 09:10.800] then how you would need to package it up
[09:10.800 --> 09:13.640] and submit it to the platform.
[09:13.640 --> 09:16.800] Yeah, it's interesting because the,
[09:16.800 --> 09:19.560] one of the, this reminds me of one of the issues
[09:19.560 --> 09:21.800] that's happening right now with cloud computing,
[09:21.800 --> 09:26.800] where in the cloud AWS has dominated for a long time
[09:29.480 --> 09:34.480] and they have 40% market share, I think globally.
[09:34.480 --> 09:38.960] And Azure's now gaining and they have some pretty good traction
[09:38.960 --> 09:43.120] and then GCP's been down for a bit, you know,
[09:43.120 --> 09:45.760] in that maybe the 10% range or something like that.
[09:45.760 --> 09:47.760] But what's interesting is that it seems like
[09:47.760 --> 09:51.480] in the case of all of the cloud providers,
[09:51.480 --> 09:54.360] they haven't necessarily been leading the way
[09:54.360 --> 09:57.840] on things like packaging models, right?
[09:57.840 --> 10:01.480] Or, you know, they have their own proprietary systems
[10:01.480 --> 10:06.480] which have been developed and are continuing to be developed
[10:06.640 --> 10:08.920] like Vertex AI in the case of Google,
[10:09.760 --> 10:13.160] the SageMaker in the case of Amazon.
[10:13.160 --> 10:16.480] But what's interesting is, let's just take SageMaker,
[10:16.480 --> 10:20.920] for example, there isn't really like this, you know,
[10:20.920 --> 10:25.480] industry wide standard of model packaging
[10:25.480 --> 10:28.680] that SageMaker uses, they have their own proprietary stuff
[10:28.680 --> 10:31.040] that kind of builds in and Vertex AI
[10:31.040 --> 10:32.440] has their own proprietary stuff.
[10:32.440 --> 10:34.920] So, you know, I think it is interesting
[10:34.920 --> 10:36.960] to see what's gonna happen
[10:36.960 --> 10:41.120] because I think your original hypothesis which is,
[10:41.120 --> 10:44.960] let's pick, you know, this looks like it's got some traction
[10:44.960 --> 10:48.760] and it wasn't necessarily tied directly to a cloud provider
[10:48.760 --> 10:51.600] because Databricks can work on anything.
[10:51.600 --> 10:53.680] It seems like that in particular,
[10:53.680 --> 10:56.800] that's one of the more sticky problems right now
[10:56.800 --> 11:01.800] with MLopsis is, you know, who's the leader?
[11:02.280 --> 11:05.440] Like, who's developing the right, you know,
[11:05.440 --> 11:08.880] kind of a standard for tooling.
[11:08.880 --> 11:12.320] And I don't know, maybe that leads into kind of you talking
[11:12.320 --> 11:13.760] a little bit about what you're doing currently.
[11:13.760 --> 11:15.600] Like, do you have any thoughts about the, you know,
[11:15.600 --> 11:18.720] current tooling and what you're doing at your current company
[11:18.720 --> 11:20.920] and what's going on with that?
[11:20.920 --> 11:21.760] Absolutely.
[11:21.760 --> 11:24.200] So at my current organization,
[11:24.200 --> 11:26.040] Transaction Monitor Netherlands,
[11:26.040 --> 11:27.480] we are fully on AWS.
[11:27.480 --> 11:32.000] So we're really almost cloud native AWS.
[11:32.000 --> 11:34.840] And so that also means everything we do on the modeling side
[11:34.840 --> 11:36.600] really evolves around SageMaker.
[11:37.680 --> 11:40.840] So for us, specifically for us as MLops team,
[11:40.840 --> 11:44.680] we are building the platform around SageMaker capabilities.
[11:45.680 --> 11:48.360] And on that end, at least company internal,
[11:48.360 --> 11:52.880] we have a contract how you must actually deploy models.
[11:52.880 --> 11:56.200] There is only one way, what we call the golden path,
[11:56.200 --> 11:59.800] in that case, this is the streamlined highly automated path
[11:59.800 --> 12:01.360] that is supported by the platform.
[12:01.360 --> 12:04.360] This is the only way how you can actually deploy models.
[12:04.360 --> 12:09.360] And in our case, that is actually a SageMaker pipeline object.
[12:09.640 --> 12:12.680] So in our company, we're doing large scale batch processing.
[12:12.680 --> 12:15.040] So we're actually not doing anything real time at present.
[12:15.040 --> 12:17.040] We are doing post transaction monitoring.
[12:17.040 --> 12:20.960] So that means you need to submit essentially DAX, right?
[12:20.960 --> 12:23.400] This is what we use for training.
[12:23.400 --> 12:25.680] This is what we also deploy eventually.
[12:25.680 --> 12:27.720] And this is our internal contract.
[12:27.720 --> 12:32.200] You need to provision a SageMaker in your model repository.
[12:32.200 --> 12:34.640] You got to have one place,
[12:34.640 --> 12:37.840] and there must be a function with a specific name
[12:37.840 --> 12:41.440] and that function must return a SageMaker pipeline object.
[12:41.440 --> 12:44.920] So this is our internal contract actually.
[12:44.920 --> 12:46.600] Yeah, that's interesting.
[12:46.600 --> 12:51.200] I mean, and I could see like for, I know many people
[12:51.200 --> 12:53.880] that are using SageMaker in production,
[12:53.880 --> 12:58.680] and it does seem like where it has some advantages
[12:58.680 --> 13:02.360] is that AWS generally does a pretty good job
[13:02.360 --> 13:04.240] at building solutions.
[13:04.240 --> 13:06.920] And if you just look at the history of services,
[13:06.920 --> 13:09.080] the odds are pretty high
[13:09.080 --> 13:12.880] that they'll keep getting better, keep improving things.
[13:12.880 --> 13:17.080] And it seems like what I'm hearing from people,
[13:17.080 --> 13:19.080] and it sounds like maybe with your organization as well,
[13:19.080 --> 13:24.080] is that potentially the SDK for SageMaker
[13:24.440 --> 13:29.120] is really the win versus some of the UX tools they have
[13:29.120 --> 13:32.680] and the interface for Canvas and Studio.
[13:32.680 --> 13:36.080] Is that what's happening?
[13:36.080 --> 13:38.720] Yeah, so I think, right,
[13:38.720 --> 13:41.440] what we try to do is we always try to think about our users.
[13:41.440 --> 13:44.880] So how do our users, who are our users?
[13:44.880 --> 13:47.000] What capabilities and skills do they have?
[13:47.000 --> 13:50.080] And what freedom should they have
[13:50.080 --> 13:52.640] and what abilities should they have to develop models?
[13:52.640 --> 13:55.440] In our case, we don't really have use cases
[13:55.440 --> 13:58.640] for stuff like Canvas because our users
[13:58.640 --> 14:02.680] are fairly mature teams that know how to do their,
[14:02.680 --> 14:04.320] on the one hand, the data science stuff, of course,
[14:04.320 --> 14:06.400] but also the engineering stuff.
[14:06.400 --> 14:08.160] So in our case, things like Canvas
[14:08.160 --> 14:10.320] do not really play so much role
[14:10.320 --> 14:12.960] because obviously due to the high abstraction layer
[14:12.960 --> 14:15.640] of more like graphical user interfaces,
[14:15.640 --> 14:17.360] drag and drop tooling,
[14:17.360 --> 14:20.360] you are also limited in what you can do,
[14:20.360 --> 14:22.480] or what you can do easily.
[14:22.480 --> 14:26.320] So in our case, really, it is the strength of the flexibility
[14:26.320 --> 14:28.320] that the SageMaker SDK gives you.
[14:28.320 --> 14:33.040] And in general, the SDK around most AWS services.
[14:34.080 --> 14:36.760] But also it comes with challenges, of course.
[14:37.720 --> 14:38.960] You give a lot of freedom,
[14:38.960 --> 14:43.400] but also you're creating a certain ask,
[14:43.400 --> 14:47.320] certain requirements for your model development teams,
[14:47.320 --> 14:49.600] which is also why we've also been working
[14:49.600 --> 14:52.600] about abstracting further away from the SDK.
[14:52.600 --> 14:54.600] So our objective is actually
[14:54.600 --> 14:58.760] that you should not be forced to interact with the raw SDK
[14:58.760 --> 15:00.600] when you use SageMaker anymore,
[15:00.600 --> 15:03.520] but you have a thin layer of abstraction
[15:03.520 --> 15:05.480] on top of what you are doing.
[15:05.480 --> 15:07.480] That's actually something we are moving towards
[15:07.480 --> 15:09.320] more and more as well.
[15:09.320 --> 15:11.120] Because yeah, it gives you the flexibility,
[15:11.120 --> 15:12.960] but also flexibility comes at a cost,
[15:12.960 --> 15:15.080] comes often at the cost of speeds,
[15:15.080 --> 15:18.560] specifically when it comes to the 90% default stuff
[15:18.560 --> 15:20.720] that you want to do, yeah.
[15:20.720 --> 15:24.160] And one of the things that I have as a complaint
[15:24.160 --> 15:29.160] against SageMaker is that it only uses virtual machines,
[15:30.000 --> 15:35.000] and it does seem like a strange strategy in some sense.
[15:35.000 --> 15:40.000] Like for example, I guess if you're doing batch only,
[15:40.000 --> 15:42.000] it doesn't matter as much,
[15:42.000 --> 15:45.000] which I think is a good strategy actually
[15:45.000 --> 15:50.000] to get your batch based predictions very, very strong.
[15:50.000 --> 15:53.000] And in that case, maybe the virtual machines
[15:53.000 --> 15:56.000] make a little bit less of a complaint.
[15:56.000 --> 16:00.000] But in the case of the endpoints with SageMaker,
[16:00.000 --> 16:02.000] the fact that you have to spend up
[16:02.000 --> 16:04.000] these really expensive virtual machines
[16:04.000 --> 16:08.000] and let them run 24 seven to do online prediction,
[16:08.000 --> 16:11.000] is that something that your organization evaluated
[16:11.000 --> 16:13.000] and decided not to use?
[16:13.000 --> 16:15.000] Or like, what are your thoughts behind that?
[16:15.000 --> 16:19.000] Yeah, in our case, doing real time
[16:19.000 --> 16:22.000] or near real time inference is currently not really relevant
[16:22.000 --> 16:25.000] for the simple reason that when you think a bit more
[16:25.000 --> 16:28.000] about the money laundering or anti money laundering space,
[16:28.000 --> 16:31.000] typically when, right,
[16:31.000 --> 16:34.000] all every individual bank must do anti money laundering
[16:34.000 --> 16:37.000] and they have armies of people doing that.
[16:37.000 --> 16:39.000] But on the other hand,
[16:39.000 --> 16:43.000] the time it actually takes from one of their systems,
[16:43.000 --> 16:46.000] one of their AML systems actually detecting something
[16:46.000 --> 16:49.000] that's unusual that then goes into a review process
[16:49.000 --> 16:54.000] until it eventually hits the governmental institution
[16:54.000 --> 16:56.000] that then takes care of the cases that have been
[16:56.000 --> 16:58.000] at least twice validated that they are indeed,
[16:58.000 --> 17:01.000] they look very unusual.
[17:01.000 --> 17:04.000] So this takes a while, this can take quite some time,
[17:04.000 --> 17:06.000] which is also why it doesn't really matter
[17:06.000 --> 17:09.000] whether you ship your prediction within a second
[17:09.000 --> 17:13.000] or whether it takes you a week or two weeks.
[17:13.000 --> 17:15.000] It doesn't really matter, hence for us,
[17:15.000 --> 17:19.000] that problem so far thinking about real time inference
[17:19.000 --> 17:21.000] has not been there.
[17:21.000 --> 17:25.000] But yeah, indeed, for other use cases,
[17:25.000 --> 17:27.000] for also private projects,
[17:27.000 --> 17:29.000] we've also been considering SageMaker Endpoints
[17:29.000 --> 17:31.000] for a while, but exactly what you said,
[17:31.000 --> 17:33.000] the fact that you need to have a very beefy machine
[17:33.000 --> 17:35.000] running all the time,
[17:35.000 --> 17:39.000] specifically when you have heavy GPU loads, right,
[17:39.000 --> 17:43.000] and you're actually paying for that machine running 2047,
[17:43.000 --> 17:46.000] although you do have quite fluctuating load.
[17:46.000 --> 17:49.000] Yeah, then that definitely becomes quite a consideration
[17:49.000 --> 17:51.000] of what you go for.
[17:51.000 --> 17:58.000] Yeah, and I actually have been talking to AWS about that,
[17:58.000 --> 18:02.000] because one of the issues that I have is that
[18:02.000 --> 18:07.000] the AWS platform really pushes serverless,
[18:07.000 --> 18:10.000] and then my question for AWS is,
[18:10.000 --> 18:13.000] so why aren't you using it?
[18:13.000 --> 18:16.000] I mean, if you're pushing serverless for everything,
[18:16.000 --> 18:19.000] why is SageMaker nothing serverless?
[18:19.000 --> 18:21.000] And so maybe they're going to do that, I don't know.
[18:21.000 --> 18:23.000] I don't have any inside information,
[18:23.000 --> 18:29.000] but it is interesting to hear you had some similar concerns.
[18:29.000 --> 18:32.000] I know that there's two questions here.
[18:32.000 --> 18:37.000] One is someone asked about what do you do for data versioning,
[18:37.000 --> 18:41.000] and a second one is how do you do event based MLOps?
[18:41.000 --> 18:43.000] So maybe kind of following up.
[18:43.000 --> 18:46.000] Yeah, what do we do for data versioning?
[18:46.000 --> 18:51.000] On the one hand, we're running a data lakehouse,
[18:51.000 --> 18:54.000] where after data we get from the financial institutions,
[18:54.000 --> 18:57.000] from the banks that runs through massive data pipeline,
[18:57.000 --> 19:01.000] also on AWS, we're using glue and step functions actually for that,
[19:01.000 --> 19:03.000] and then eventually it ends up modeled to some extent,
[19:03.000 --> 19:06.000] sanitized, quality checked in our data lakehouse,
[19:06.000 --> 19:10.000] and there we're actually using hoodie on top of S3.
[19:10.000 --> 19:13.000] And this is also what we use for versioning,
[19:13.000 --> 19:16.000] which we use for time travel and all these things.
[19:16.000 --> 19:19.000] So that is hoodie on top of S3,
[19:19.000 --> 19:21.000] when then pipelines,
[19:21.000 --> 19:24.000] so actually our model pipelines plug in there
[19:24.000 --> 19:27.000] and spit out predictions, alerts,
[19:27.000 --> 19:29.000] what we call alerts eventually.
[19:29.000 --> 19:33.000] That is something that we version based on unique IDs.
[19:33.000 --> 19:36.000] So processing IDs, we track pretty much everything,
[19:36.000 --> 19:39.000] every line of code that touched,
[19:39.000 --> 19:43.000] is related to a specific row in our data.
[19:43.000 --> 19:46.000] So we can exactly track back for every single row
[19:46.000 --> 19:48.000] in our predictions and in our alerts,
[19:48.000 --> 19:50.000] what pipeline ran on it,
[19:50.000 --> 19:52.000] which jobs were in that pipeline,
[19:52.000 --> 19:56.000] which code exactly was running in each job,
[19:56.000 --> 19:58.000] which intermediate results were produced.
[19:58.000 --> 20:01.000] So we're basically adding lineage information
[20:01.000 --> 20:03.000] to everything we output along that line,
[20:03.000 --> 20:05.000] so we can track everything back
[20:05.000 --> 20:09.000] using a few tools we've built.
[20:09.000 --> 20:12.000] So the tool you mentioned,
[20:12.000 --> 20:13.000] I'm not familiar with it.
[20:13.000 --> 20:14.000] What is it called again?
[20:14.000 --> 20:15.000] It's called hoodie?
[20:15.000 --> 20:16.000] Hoodie.
[20:16.000 --> 20:17.000] Hoodie.
[20:17.000 --> 20:18.000] Oh, what is it?
[20:18.000 --> 20:19.000] Maybe you can describe it.
[20:19.000 --> 20:22.000] Yeah, hoodie is essentially,
[20:22.000 --> 20:29.000] it's quite similar to other tools such as
[20:29.000 --> 20:31.000] Databricks, how is it called?
[20:31.000 --> 20:32.000] Databricks?
[20:32.000 --> 20:33.000] Delta Lake maybe?
[20:33.000 --> 20:34.000] Yes, exactly.
[20:34.000 --> 20:35.000] Exactly.
[20:35.000 --> 20:38.000] It's basically, it's equivalent to Delta Lake,
[20:38.000 --> 20:40.000] just back then when we looked into
[20:40.000 --> 20:42.000] what are we going to use.
[20:42.000 --> 20:44.000] Delta Lake was not open sourced yet.
[20:44.000 --> 20:46.000] Databricks open sourced a while ago.
[20:46.000 --> 20:47.000] We went for Hoodie.
[20:47.000 --> 20:50.000] It essentially, it is a layer on top of,
[20:50.000 --> 20:53.000] in our case, S3 that allows you
[20:53.000 --> 20:58.000] to more easily keep track of what you,
[20:58.000 --> 21:03.000] of the actions you are performing on your data.
[21:03.000 --> 21:08.000] So it's essentially very similar to Delta Lake,
[21:08.000 --> 21:13.000] just already before an open sourced solution.
[21:13.000 --> 21:15.000] Yeah, that's, I didn't know anything about that.
[21:15.000 --> 21:16.000] So now I do.
[21:16.000 --> 21:19.000] So thanks for letting me know.
[21:19.000 --> 21:21.000] I'll have to look into that.
[21:21.000 --> 21:27.000] The other, I guess, interesting stack related question is,
[21:27.000 --> 21:29.000] what are your thoughts about,
[21:29.000 --> 21:32.000] I think there's two areas that I think
[21:32.000 --> 21:34.000] are interesting and that are emerging.
[21:34.000 --> 21:36.000] Oh, actually there's, there's multiple.
[21:36.000 --> 21:37.000] Maybe I'll just bring them all up.
[21:37.000 --> 21:39.000] So we'll do one by one.
[21:39.000 --> 21:42.000] So these are some emerging areas that I'm, that I'm seeing.
[21:42.000 --> 21:49.000] So one is the concept of event driven, you know,
[21:49.000 --> 21:54.000] architecture versus, versus maybe like a static architecture.
[21:54.000 --> 21:57.000] And so I think obviously you're using step functions.
[21:57.000 --> 22:00.000] So you're a fan of, of event driven architecture.
[22:00.000 --> 22:04.000] Maybe we start, we'll start with that one is what are your,
[22:04.000 --> 22:08.000] what are your thoughts on going more event driven in your organization?
[22:08.000 --> 22:09.000] Yeah.
[22:09.000 --> 22:13.000] In, in, in our case, essentially everything works event driven.
[22:13.000 --> 22:14.000] Right.
[22:14.000 --> 22:19.000] So since we on AWS, we're using event bridge or cloud watch events.
[22:19.000 --> 22:21.000] I think now it's called everywhere.
[22:21.000 --> 22:22.000] Right.
[22:22.000 --> 22:24.000] This is how we trigger pretty much everything in our stack.
[22:24.000 --> 22:27.000] This is how we trigger our data pipelines when data comes in.
[22:27.000 --> 22:32.000] This is how we trigger different, different lambdas that parse our
[22:32.000 --> 22:35.000] certain information from your log, store them in different databases.
[22:35.000 --> 22:40.000] This is how we also, how we, at some point in the back in the past,
[22:40.000 --> 22:44.000] how we also triggered new deployments when new models were approved in
[22:44.000 --> 22:46.000] your model registry.
[22:46.000 --> 22:50.000] So basically everything we've been doing is, is fully event driven.
[22:50.000 --> 22:51.000] Yeah.
[22:51.000 --> 22:56.000] So, so I think this is a key thing you bring up here is that I've,
[22:56.000 --> 23:00.000] I've talked to many people who don't use AWS, who are, you know,
[23:00.000 --> 23:03.000] all alternatively experts at technology.
[23:03.000 --> 23:06.000] And one of the things that I've heard some people say is like, oh,
[23:06.000 --> 23:13.000] well, AWS is in as fast as X or Y, like Lambda is in as fast as X or Y or,
[23:13.000 --> 23:17.000] you know, Kubernetes or, but, but the point you bring up is exactly the
[23:17.000 --> 23:24.000] way I think about AWS is that the true advantage of AWS platform is the,
[23:24.000 --> 23:29.000] is the tight integration with the services and you can design event
[23:29.000 --> 23:31.000] driven workflows.
[23:31.000 --> 23:33.000] Would you say that's, that's absolutely.
[23:33.000 --> 23:34.000] Yeah.
[23:34.000 --> 23:35.000] Yeah.
[23:35.000 --> 23:39.000] I think designing event driven workflows on AWS is incredibly easy to do.
[23:39.000 --> 23:40.000] Yeah.
[23:40.000 --> 23:43.000] And it also comes incredibly natural and that's extremely powerful.
[23:43.000 --> 23:44.000] Right.
[23:44.000 --> 23:49.000] And simply by, by having an easy way how to trigger lambdas event driven,
[23:49.000 --> 23:52.000] you can pretty much, right, pretty much do everything and glue
[23:52.000 --> 23:54.000] everything together that you want.
[23:54.000 --> 23:56.000] I think that gives you a tremendous flexibility.
[23:56.000 --> 23:57.000] Yeah.
[23:57.000 --> 24:00.000] So, so I think there's two things that come to mind now.
[24:00.000 --> 24:07.000] One is that, that if you are developing an ML ops platform that you
[24:00.000 -->[23:57.000 -->[23:56.000 -->[23:54.000 -->[23:52.000 -->[23:49.000 -->[23:44.000 -->[23:43.000 -->[23:40.000 -->[23:39.000 -->[23:35.000 -->[23:34.000 -->[23:33.000 -->[23:31.000 -->[23:29.000 -->[23:24.000 -->[23:17.000 -->[23:13.000 -->[23:06.000 -->[23:03.000 -->[23:00.000 -->[22:56.000 -->[22:51.000 -->[22:50.000 -->[22:46.000 -->[22:44.000 -->[22:40.000 -->[22:35.000 -->[22:32.000 -->[22:27.000 -->[22:24.000 -->[22:22.000 -->[22:21.000 -->[22:19.000 -->[22:14.000 -->[22:13.000 -->[22:09.000 -->[22:08.000 -->[22:04.000 -->[22:00.000 -->[21:57.000 -->[21:54.000 -->[21:49.000 -->[21:42.000 -->[21:39.000 -->[21:37.000 -->[21:36.000 -->[21:34.000 -->[21:32.000 -->[21:29.000 -->[21:27.000 -->[21:21.000 -->[21:19.000 -->[21:16.000 -->[21:15.000 -->[21:13.000 -->[21:08.000 -->[21:03.000 -->[20:58.000 -->[20:53.000 -->[20:50.000 -->[20:47.000 -->[20:46.000 -->[20:44.000 -->[20:42.000 -->[20:40.000 -->[20:38.000 -->[20:35.000 -->[20:34.000 -->[20:33.000 -->[20:32.000 -->[20:31.000 -->[20:29.000 -->[20:22.000 -->[20:19.000 -->[20:18.000 -->[20:17.000 -->[20:16.000 -->[20:15.000 -->[20:14.000 -->[20:13.000 -->[20:12.000 -->[20:09.000 -->[20:05.000 -->[20:03.000 -->[20:01.000 -->[19:58.000 -->[19:56.000 -->[19:52.000 -->[19:50.000 -->[19:48.000 -->[19:46.000 -->[19:43.000 -->[19:39.000 -->[19:36.000 -->[19:33.000 -->[19:29.000 -->[19:27.000 -->[19:24.000 -->[19:21.000 -->[19:19.000 -->[19:16.000 -->[19:13.000 -->[19:10.000 -->[19:06.000 -->[19:03.000 -->[19:01.000 -->[18:57.000 -->[18:54.000 -->[18:51.000 -->[18:46.000 -->[18:43.000 -->[18:41.000 -->[18:37.000 -->[18:32.000 -->[18:29.000 -->[18:23.000 -->[18:21.000 -->[18:19.000 -->[18:16.000 -->[18:13.000 -->[18:10.000 -->[18:07.000 -->[18:02.000 -->[17:58.000 -->[17:51.000 -->[17:49.000 -->[17:46.000 -->[17:43.000 -->[17:39.000 -->[17:35.000 -->[17:33.000 -->[17:31.000 -->[17:29.000 -->[17:27.000 -->[17:25.000 -->[17:21.000 -->[17:19.000 -->[17:15.000 -->[17:13.000 -->[17:09.000 -->[17:06.000 -->[17:04.000 -->[17:01.000 -->[16:58.000 -->[16:56.000 -->[16:54.000 -->[16:49.000 -->[16:46.000 -->[16:43.000 -->[16:39.000 -->[16:37.000 -->[16:34.000 -->[16:31.000 -->[16:28.000 -->[16:25.000 -->[16:22.000 -->[16:19.000 -->[16:15.000 -->[16:13.000 -->[16:11.000 -->[16:08.000 -->[16:04.000 -->[16:02.000 -->[16:00.000 -->[15:56.000 -->[15:53.000 -->[15:50.000 -->[15:45.000 -->[15:42.000 -->[15:40.000 -->[15:35.000 -->[15:30.000 -->[15:24.160 -->[15:20.720 -->[15:18.560 -->[15:15.080 -->[15:12.960 -->[15:11.120 -->[15:09.320 -->[15:07.480 -->[15:05.480 -->[15:03.520 -->[15:00.600 -->[14:58.760 -->[14:54.600 -->[14:52.600 -->[14:49.600 -->[14:47.320 -->[14:43.400 -->[14:38.960 -->[14:37.720 -->[14:34.080 -->[14:28.320 -->[14:26.320 -->[14:22.480 -->[14:20.360 -->[14:17.360 -->[14:15.640 -->[14:12.960 -->[14:10.320 -->[14:08.160 -->[14:06.400 -->[14:04.320 -->[14:02.680 -->[13:58.640 -->[13:55.440 -->[13:52.640 -->[13:50.080 -->[13:47.000 -->[13:44.880 -->[13:41.440 -->[13:38.720 -->[13:36.080 -->[13:32.680 -->[13:29.120 -->[13:24.440 -->[13:19.080 -->[13:17.080 -->[13:12.880 -->[13:09.080 -->[13:06.920 -->[13:04.240 -->[13:02.360 -->[12:58.680 -->[12:53.880 -->[12:51.200 -->[12:46.600 -->[12:44.920 -->[12:41.440 -->[12:37.840 -->[12:34.640 -->[12:32.200 -->[12:27.720 -->[12:25.680 -->[12:23.400 -->[12:20.960 -->[12:17.040 -->[12:15.040 -->[12:12.680 -->[12:09.640 -->[12:04.360 -->[12:01.360 -->[11:59.800 -->[11:56.200 -->[11:52.880 -->[11:48.360 -->[11:45.680 -->[11:40.840 -->[11:37.680 -->[11:34.840 -->[11:32.000 -->[11:27.480 -->[11:26.040 -->[11:24.200 -->[11:21.760 -->[11:20.920 -->[11:18.720 -->[11:15.600 -->[11:13.760 -->[11:12.320 -->[11:08.880 -->[11:05.440 -->[11:02.280 -->[10:56.800 -->[10:53.680 -->[10:51.600 -->[10:48.760 -->[10:44.960 -->[10:41.120 -->[10:36.960 -->[10:34.920 -->[10:32.440 -->[10:31.040 -->[10:28.680 -->[10:25.480 -->[10:20.920 -->[10:16.480 -->[10:13.160 -->[10:09.760 -->[10:06.640 -->[10:01.480 -->[09:57.840 -->[09:54.360 -->[09:51.480 -->[09:47.760 -->[09:45.760 -->[09:43.120 -->[09:38.960 -->[09:34.480 -->[09:29.480 -->[09:21.800 -->[09:19.560 -->[09:16.800 -->[09:13.640 -->[09:10.800 -->[09:08.680 -->[09:04.920 -->[09:02.680 -->[09:00.360 -->[08:58.160 -->[08:56.200 -->[08:53.920 -->[08:51.720 -->[08:48.040 -->[08:44.760 -->[08:42.720 -->[08:40.560 -->[08:37.960 -->[08:34.720 -->[08:32.360 -->[08:30.080 -->[08:27.520 -->[08:24.720 -->[08:22.680 -->[08:19.920 -->[08:17.520 -->[08:13.880 -->[08:09.280 -->[08:05.280 -->[08:02.920 -->[07:59.720 -->[07:56.120 -->[07:55.280 -->[07:51.480 -->[07:48.600 -->[07:45.560 -->[07:43.280 -->[07:40.120 -->[07:36.480 -->[07:34.800 -->[07:32.280 -->[07:29.160 -->[07:25.880 -->[07:23.000 -->[07:19.760 -->[07:17.920 -->[07:15.480 -->[07:12.600 -->[07:10.040 -->[07:08.160 -->[07:06.840 -->[07:02.080 -->[06:59.240 -->[06:57.360 -->[06:55.920 -->[06:52.760 -->[06:49.680 -->[06:45.880 -->[06:45.040 -->[06:43.640 -->[06:41.520 -->[06:37.960 -->[06:35.760 -->[06:32.560 -->[06:28.040 -->[06:25.560 -->[06:23.000 -->[06:20.560 -->[06:19.560 -->[06:17.000 -->[06:13.760 -->[06:10.840 -->[06:09.400 -->[06:05.680 -->[06:01.840 -->[05:59.520 -->[05:56.960 -->[05:54.120 -->[05:51.560 -->[05:47.000 -->[05:45.080 -->[05:44.120 -->[05:40.320 -->[05:38.960 -->[05:35.480 -->[05:33.160 -->[05:30.680 -->[05:27.960 -->[05:24.800 -->[05:23.080 -->[05:19.440 -->[05:16.200 -->[05:12.400 -->[05:08.880 -->[05:05.600 -->[05:02.040 -->[04:58.520 -->[04:55.240 -->[04:52.040 -->[04:50.120 -->[04:47.400 -->[04:46.560 -->[04:43.720 -->[04:39.280 -->[04:36.200 -->[04:33.800 -->[04:30.240 -->[04:27.960 -->[04:25.120 -->[04:20.880 -->[04:18.200 -->[04:15.880 -->[04:14.800 -->[04:12.240 -->[04:10.520 -->[04:07.240 -->[04:02.520 -->[03:59.520 -->[03:56.840 -->[03:52.880 -->[03:49.240 -->[03:44.240 -->[03:41.160 -->[03:36.920 -->[03:34.320 -->[03:30.280 -->[03:27.720 -->[03:25.240 -->[03:23.200 -->[03:19.600 -->[03:17.000 -->[03:12.800 -->[03:10.840 -->[03:07.920 -->[03:04.440 -->[03:01.640 -->[02:59.200 -->[02:57.840 -->[02:54.640 -->[02:50.880 -->[02:49.360 -->[02:46.760 -->[02:45.760 -->[02:42.440 -->[02:39.240 -->[02:36.160 -->[02:32.960 -->[02:30.920 -->[02:27.560 -->[02:25.040 -->[02:21.960 -->[02:19.080 -->[02:16.240 -->[02:12.080 -->[02:09.400 -->[02:05.760 -->[02:03.640 -->[02:01.680 -->[01:59.360 -->[01:58.360 -->[01:55.120 -->[01:52.640 -->[01:50.360 -->[01:47.320 -->[01:42.680 -->[01:40.200 -->[01:38.680 -->[01:36.880 -->[01:34.760 -->[01:32.680 -->[01:29.920 -->[01:27.280 -->[01:24.160 -->[01:21.800 -->[01:20.520 -->[01:19.000 -->[01:15.760 -->[01:13.240 -->[01:11.440 -->[01:08.040 -->[01:05.920 -->[01:02.080 -->[00:59.320 -->[00:55.560 -->[00:51.680 -->[00:49.840 -->[00:46.080 -->[00:43.120 -->[00:40.440 -->[00:38.160 -->[00:34.960 -->[00:32.360 -->[00:31.520 -->[00:29.200 -->[00:27.760 -->[00:24.960 -->[00:23.600 -->[00:22.760 -->[00:20.360 -->[00:17.680 -->[00:13.480 -->[00:09.760 -->[00:02.260 -->

[00:00.000 -->