Enterprise MLOps Interview-Simon Stiebellehner

52 Weeks of Cloud

Sep 23 2022 • 56 mins

If you enjoyed this video, here are additional resources to look at:

Coursera + Duke Specialization: Building Cloud Computing Solutions at Scale Specialization: https://www.coursera.org/specializations/building-cloud-computing-solutions-at-scale

Python, Bash, and SQL Essentials for Data Engineering Specialization: https://www.coursera.org/specializations/python-bash-sql-data-engineering-duke

AWS Certified Solutions Architect - Professional (SAP-C01) Cert Prep: 1 Design for Organizational Complexity:
https://www.linkedin.com/learning/aws-certified-solutions-architect-professional-sap-c01-cert-prep-1-design-for-organizational-complexity/design-for-organizational-complexity?autoplay=true

O'Reilly Book: Practical MLOps: https://www.amazon.com/Practical-MLOps-Operationalizing-Machine-Learning/dp/1098103017

O'Reilly Book: Python for DevOps: https://www.amazon.com/gp/product/B082P97LDW/

O'Reilly Book: Developing on AWS with C#: A Comprehensive Guide on Using C# to Build Solutions on the AWS Platform
https://www.amazon.com/Developing-AWS-Comprehensive-Solutions-Platform/dp/1492095877

Pragmatic AI: An Introduction to Cloud-based Machine Learning: https://www.amazon.com/gp/product/B07FB8F8QP/

Pragmatic AI Labs Book: Python Command-Line Tools: https://www.amazon.com/gp/product/B0855FSFYZ

Pragmatic AI Labs Book: Cloud Computing for Data Analysis: https://www.amazon.com/gp/product/B0992BN7W8

Pragmatic AI Book: Minimal Python: https://www.amazon.com/gp/product/B0855NSRR7

Pragmatic AI Book: Testing in Python: https://www.amazon.com/gp/product/B0855NSRR7

Subscribe to Pragmatic AI Labs YouTube Channel: https://www.youtube.com/channel/UCNDfiL0D1LUeKWAkRE1xO5Q

Subscribe to 52 Weeks of AWS Podcast: https://52-weeks-of-cloud.simplecast.com

View content on noahgift.com: https://noahgift.com/

View content on Pragmatic AI Labs Website: https://paiml.com/

[00:00.000 --> 00:02.260] Hey, three, two, one, there we go, we're live.
[00:02.260 --> 00:07.260] All right, so welcome Simon to Enterprise ML Ops interviews.
[00:09.760 --> 00:13.480] The goal of these interviews is to get people exposed
[00:13.480 --> 00:17.680] to real professionals who are doing work in ML Ops.
[00:17.680 --> 00:20.360] It's such a cutting edge field
[00:20.360 --> 00:22.760] that I think a lot of people are very curious about.
[00:22.760 --> 00:23.600] What is it?
[00:23.600 --> 00:24.960] You know, how do you do it?
[00:24.960 --> 00:27.760] And very honored to have Simon here.
[00:27.760 --> 00:29.200] And do you wanna introduce yourself
[00:29.200 --> 00:31.520] and maybe talk a little bit about your background?
[00:31.520 --> 00:32.360] Sure.
[00:32.360 --> 00:33.960] Yeah, thanks again for inviting me.
[00:34.960 --> 00:38.160] My name is Simon Stebelena or Simon.
[00:38.160 --> 00:40.440] I am originally from Austria,
[00:40.440 --> 00:43.120] but currently working in the Netherlands and Amsterdam
[00:43.120 --> 00:46.080] at Transaction Monitoring Netherlands.
[00:46.080 --> 00:48.780] Here I am the lead ML Ops engineer.
[00:49.840 --> 00:51.680] What are we doing at TML actually?
[00:51.680 --> 00:55.560] We are a data processing company actually.
[00:55.560 --> 00:59.320] We are owned by the five large banks of Netherlands.
[00:59.320 --> 01:02.080] And our purpose is kind of what the name says.
[01:02.080 --> 01:05.920] We are basically lifting specifically anti money laundering.
[01:05.920 --> 01:08.040] So anti money laundering models that run
[01:08.040 --> 01:11.440] on a personalized transactions of businesses
[01:11.440 --> 01:13.240] we get from these five banks
[01:13.240 --> 01:15.760] to detect unusual patterns on that transaction graph
[01:15.760 --> 01:19.000] that might indicate money laundering.
[01:19.000 --> 01:20.520] That's a natural what we do.
[01:20.520 --> 01:21.800] So as you can imagine,
[01:21.800 --> 01:24.160] we are really focused on building models
[01:24.160 --> 01:27.280] and obviously ML Ops is a big component there
[01:27.280 --> 01:29.920] because that is really the core of what you do.
[01:29.920 --> 01:32.680] You wanna do it efficiently and effectively as well.
[01:32.680 --> 01:34.760] In my role as lead ML Ops engineer,
[01:34.760 --> 01:36.880] I'm on the one hand the lead engineer
[01:36.880 --> 01:38.680] of the actual ML Ops platform team.
[01:38.680 --> 01:40.200] So this is actually a centralized team
[01:40.200 --> 01:42.680] that builds out lots of the infrastructure
[01:42.680 --> 01:47.320] that's needed to do modeling effectively and efficiently.
[01:47.320 --> 01:50.360] But also I am the craft lead
[01:50.360 --> 01:52.640] for the machine learning engineering craft.
[01:52.640 --> 01:55.120] These are actually in our case, the machine learning engineers,
[01:55.120 --> 01:58.360] the people working within the model development teams
[01:58.360 --> 01:59.360] and cross functional teams
[01:59.360 --> 02:01.680] actually building these models.
[02:01.680 --> 02:03.640] That's what I'm currently doing
[02:03.640 --> 02:05.760] during the evenings and weekends.
[02:05.760 --> 02:09.400] I'm also lecturer at the University of Applied Sciences, Vienna.
[02:09.400 --> 02:12.080] And there I'm teaching data mining
[02:12.080 --> 02:15.160] and data warehousing to master students, essentially.
[02:16.240 --> 02:19.080] Before TMNL, I was at bold.com,
[02:19.080 --> 02:21.960] which is the largest eCommerce retailer in the Netherlands.
[02:21.960 --> 02:25.040] So I always tend to see the Amazon of the Netherlands
[02:25.040 --> 02:27.560] or been a lux actually.
[02:27.560 --> 02:30.920] It is still the biggest eCommerce retailer in the Netherlands
[02:30.920 --> 02:32.960] even before Amazon actually.
[02:32.960 --> 02:36.160] And there I was an expert machine learning engineer.
[02:36.160 --> 02:39.240] So doing somewhat comparable stuff,
[02:39.240 --> 02:42.440] a bit more still focused on the actual modeling part.
[02:42.440 --> 02:44.800] Now it's really more on the infrastructure end.
[02:45.760 --> 02:46.760] And well, before that,
[02:46.760 --> 02:49.360] I spent some time in consulting, leading a data science team.
[02:49.360 --> 02:50.880] That's actually where I kind of come from.
[02:50.880 --> 02:53.360] I really come from originally the data science end.
[02:54.640 --> 02:57.840] And there I kind of started drifting towards ML Ops
[02:57.840 --> 02:59.200] because we started building out
[02:59.200 --> 03:01.640] a deployment and serving platform
[03:01.640 --> 03:04.440] that would as consulting company would make it easier
[03:04.440 --> 03:07.920] for us to deploy models for our clients
[03:07.920 --> 03:10.840] to serve these models, to also monitor these models.
[03:10.840 --> 03:12.800] And that kind of then made me drift further and further
[03:12.800 --> 03:15.520] down the engineering lane all the way to ML Ops.
[03:17.000 --> 03:19.600] Great, yeah, that's a great background.
[03:19.600 --> 03:23.200] I'm kind of curious in terms of the data science
[03:23.200 --> 03:25.240] to ML Ops journey,
[03:25.240 --> 03:27.720] that I think would be a great discussion
[03:27.720 --> 03:29.080] to dig into a little bit.
[03:30.280 --> 03:34.320] My background is originally more on the software engineering
[03:34.320 --> 03:36.920] side and when I was in the Bay Area,
[03:36.920 --> 03:41.160] I did individual contributor and then ran companies
[03:41.160 --> 03:44.240] at one point and ran multiple teams.
[03:44.240 --> 03:49.240] And then as the data science field exploded,
[03:49.240 --> 03:52.880] I hired multiple data science teams and worked with them.
[03:52.880 --> 03:55.800] But what was interesting is that I found that
[03:56.840 --> 03:59.520] I think the original approach of data science
[03:59.520 --> 04:02.520] from my perspective was lacking
[04:02.520 --> 04:07.240] in that there wasn't really like deliverables.
[04:07.240 --> 04:10.520] And I think when you look at a software engineering team,
[04:10.520 --> 04:12.240] it's very clear there's deliverables.
[04:12.240 --> 04:14.800] Like you have a mobile app and it has to get better
[04:14.800 --> 04:15.880] each week, right?
[04:15.880 --> 04:18.200] Where else, what are you doing?
[04:18.200 --> 04:20.880] And so I would love to hear your story
[04:20.880 --> 04:25.120] about how you went from doing kind of more pure data science
[04:25.120 --> 04:27.960] to now it sounds like ML Ops.
[04:27.960 --> 04:30.240] Yeah, yeah, actually.
[04:30.240 --> 04:33.800] So back then in consulting one of the,
[04:33.800 --> 04:36.200] which was still at least back then in Austria,
[04:36.200 --> 04:39.280] data science and everything around it was still kind of
[04:39.280 --> 04:43.720] in this infancy back then 2016 and so on.
[04:43.720 --> 04:46.560] It was still really, really new to many organizations,
[04:46.560 --> 04:47.400] at least in Austria.
[04:47.400 --> 04:50.120] There might be some years behind in the US and stuff.
[04:50.120 --> 04:52.040] But back then it was still relatively fresh.
[04:52.040 --> 04:55.240] So in consulting, what we very often struggled with was
[04:55.240 --> 04:58.520] on the modeling end, problems could be solved,
[04:58.520 --> 05:02.040] but actually then easy deployment,
[05:02.040 --> 05:05.600] keeping these models in production at client side.
[05:05.600 --> 05:08.880] That was always a bit more of the challenge.
[05:08.880 --> 05:12.400] And so naturally kind of I started thinking
[05:12.400 --> 05:16.200] and focusing more on the actual bigger problem that I saw,
[05:16.200 --> 05:19.440] which was not so much building the models,
[05:19.440 --> 05:23.080] but it was really more, how can we streamline things?
[05:23.080 --> 05:24.800] How can we keep things operating?
[05:24.800 --> 05:27.960] How can we make that move easier from a prototype,
[05:27.960 --> 05:30.680] from a PUC to a productionized model?
[05:30.680 --> 05:33.160] Also how can we keep it there and maintain it there?
[05:33.160 --> 05:35.480] So personally I was really more,
[05:35.480 --> 05:37.680] I saw that this problem was coming up
[05:38.960 --> 05:40.320] and that really fascinated me.
[05:40.320 --> 05:44.120] So I started jumping more on that exciting problem.
[05:44.120 --> 05:45.080] That's how it went for me.
[05:45.080 --> 05:47.000] And back then we then also recognized it
[05:47.000 --> 05:51.560] as a potential product in our case.
[05:51.560 --> 05:54.120] So we started building out that deployment
[05:54.120 --> 05:56.960] and serving and monitoring platform, actually.
[05:56.960 --> 05:59.520] And that then really for me, naturally,
[05:59.520 --> 06:01.840] I fell into that rabbit hole
[06:01.840 --> 06:04.280] and I also never wanted to get out of it again.
[06:05.680 --> 06:09.400] So the system that you built initially,
[06:09.400 --> 06:10.840] what was your stack?
[06:10.840 --> 06:13.760] What were some of the things you were using?
[06:13.760 --> 06:17.000] Yeah, so essentially we had,
[06:17.000 --> 06:19.560] when we talk about the stack on the backend,
[06:19.560 --> 06:20.560] there was a lot of,
[06:20.560 --> 06:23.000] so the full backend was written in Java.
[06:23.000 --> 06:25.560] We were using more from a user perspective,
[06:25.560 --> 06:28.040] the contract that we kind of had,
[06:28.040 --> 06:32.560] our goal was to build a drag and drop platform for models.
[06:32.560 --> 06:35.760] So basically the contract was you package your model
[06:35.760 --> 06:37.960] as an MLflow model,
[06:37.960 --> 06:41.520] and then you basically drag and drop it into a web UI.
[06:41.520 --> 06:43.640] It's gonna be wrapped in containers.
[06:43.640 --> 06:45.040] It's gonna be deployed.
[06:45.040 --> 06:45.880] It's gonna be,
[06:45.880 --> 06:49.680] there will be a monitoring layer in front of it
[06:49.680 --> 06:52.760] based on whatever the dataset is you trained it on.
[06:52.760 --> 06:55.920] You would automatically calculate different metrics,
[06:55.920 --> 06:57.360] different distributional metrics
[06:57.360 --> 06:59.240] around your variables that you are using.
[06:59.240 --> 07:02.080] And so we were layering this approach
[07:02.080 --> 07:06.840] to, so that eventually every incoming request would be,
[07:06.840 --> 07:08.160] you would have a nice dashboard.
[07:08.160 --> 07:10.040] You could monitor all that stuff.
[07:10.040 --> 07:12.600] So stackwise it was actually MLflow.
[07:12.600 --> 07:15.480] Specifically MLflow models a lot.
[07:15.480 --> 07:17.920] Then it was Java in the backend, Python.
[07:17.920 --> 07:19.760] There was a lot of Python,
[07:19.760 --> 07:22.040] especially PySpark component as well.
[07:23.000 --> 07:25.880] There was a, it's been quite a while actually,
[07:25.880 --> 07:29.160] there was a quite some part written in Scala.
[07:29.160 --> 07:32.280] Also, because there was a component of this platform
[07:32.280 --> 07:34.800] was also a bit of an auto ML approach,
[07:34.800 --> 07:36.480] but that died then over time.
[07:36.480 --> 07:40.120] And that was also based on PySpark
[07:40.120 --> 07:43.280] and vanilla Spark written in Scala.
[07:43.280 --> 07:45.560] So we could facilitate the auto ML part.
[07:45.560 --> 07:48.600] And then later on we actually added that deployment,
[07:48.600 --> 07:51.480] the easy deployment and serving part.
[07:51.480 --> 07:55.280] So that was kind of, yeah, a lot of custom build stuff.
[07:55.280 --> 07:56.120] Back then, right?
[07:56.120 --> 07:59.720] There wasn't that much MLOps tooling out there yet.
[07:59.720 --> 08:02.920] So you need to build a lot of that stuff custom.
[08:02.920 --> 08:05.280] So it was largely custom built.
[08:05.280 --> 08:09.280] Yeah, the MLflow concept is an interesting concept
[08:09.280 --> 08:13.880] because they provide this package structure
[08:13.880 --> 08:17.520] that at least you have some idea of,
[08:17.520 --> 08:19.920] what is gonna be sent into the model
[08:19.920 --> 08:22.680] and like there's a format for the model.
[08:22.680 --> 08:24.720] And I think that part of MLflow
[08:24.720 --> 08:27.520] seems to be a pretty good idea,
[08:27.520 --> 08:30.080] which is you're creating a standard where,
[08:30.080 --> 08:32.360] you know, if in the case of,
[08:32.360 --> 08:34.720] if you're using scikit learn or something,
[08:34.720 --> 08:37.960] you don't necessarily want to just throw
[08:37.960 --> 08:40.560] like a pickled model somewhere and just say,
[08:40.560 --> 08:42.720] okay, you know, let's go.
[08:42.720 --> 08:44.760] Yeah, that was also our thinking back then.
[08:44.760 --> 08:48.040] So we thought a lot about what would be a,
[08:48.040 --> 08:51.720] what would be, what could become the standard actually
[08:51.720 --> 08:53.920] for how you package models.
[08:53.920 --> 08:56.200] And back then MLflow was one of the little tools
[08:56.200 --> 08:58.160] that was already there, already existent.
[08:58.160 --> 09:00.360] And of course there was data bricks behind it.
[09:00.360 --> 09:02.680] So we also made a bet on that back then and said,
[09:02.680 --> 09:04.920] all right, let's follow that packaging standard
[09:04.920 --> 09:08.680] and make it the contract how you would as a data scientist,
[09:08.680 --> 09:10.800] then how you would need to package it up
[09:10.800 --> 09:13.640] and submit it to the platform.
[09:13.640 --> 09:16.800] Yeah, it's interesting because the,
[09:16.800 --> 09:19.560] one of the, this reminds me of one of the issues
[09:19.560 --> 09:21.800] that's happening right now with cloud computing,
[09:21.800 --> 09:26.800] where in the cloud AWS has dominated for a long time
[09:29.480 --> 09:34.480] and they have 40% market share, I think globally.
[09:34.480 --> 09:38.960] And Azure's now gaining and they have some pretty good traction
[09:38.960 --> 09:43.120] and then GCP's been down for a bit, you know,
[09:43.120 --> 09:45.760] in that maybe the 10% range or something like that.
[09:45.760 --> 09:47.760] But what's interesting is that it seems like
[09:47.760 --> 09:51.480] in the case of all of the cloud providers,
[09:51.480 --> 09:54.360] they haven't necessarily been leading the way
[09:54.360 --> 09:57.840] on things like packaging models, right?
[09:57.840 --> 10:01.480] Or, you know, they have their own proprietary systems
[10:01.480 --> 10:06.480] which have been developed and are continuing to be developed
[10:06.640 --> 10:08.920] like Vertex AI in the case of Google,
[10:09.760 --> 10:13.160] the SageMaker in the case of Amazon.
[10:13.160 --> 10:16.480] But what's interesting is, let's just take SageMaker,
[10:16.480 --> 10:20.920] for example, there isn't really like this, you know,
[10:20.920 --> 10:25.480] industry wide standard of model packaging
[10:25.480 --> 10:28.680] that SageMaker uses, they have their own proprietary stuff
[10:28.680 --> 10:31.040] that kind of builds in and Vertex AI
[10:31.040 --> 10:32.440] has their own proprietary stuff.
[10:32.440 --> 10:34.920] So, you know, I think it is interesting
[10:34.920 --> 10:36.960] to see what's gonna happen
[10:36.960 --> 10:41.120] because I think your original hypothesis which is,
[10:41.120 --> 10:44.960] let's pick, you know, this looks like it's got some traction
[10:44.960 --> 10:48.760] and it wasn't necessarily tied directly to a cloud provider
[10:48.760 --> 10:51.600] because Databricks can work on anything.
[10:51.600 --> 10:53.680] It seems like that in particular,
[10:53.680 --> 10:56.800] that's one of the more sticky problems right now
[10:56.800 --> 11:01.800] with MLopsis is, you know, who's the leader?
[11:02.280 --> 11:05.440] Like, who's developing the right, you know,
[11:05.440 --> 11:08.880] kind of a standard for tooling.
[11:08.880 --> 11:12.320] And I don't know, maybe that leads into kind of you talking
[11:12.320 --> 11:13.760] a little bit about what you're doing currently.
[11:13.760 --> 11:15.600] Like, do you have any thoughts about the, you know,
[11:15.600 --> 11:18.720] current tooling and what you're doing at your current company
[11:18.720 --> 11:20.920] and what's going on with that?
[11:20.920 --> 11:21.760] Absolutely.
[11:21.760 --> 11:24.200] So at my current organization,
[11:24.200 --> 11:26.040] Transaction Monitor Netherlands,
[11:26.040 --> 11:27.480] we are fully on AWS.
[11:27.480 --> 11:32.000] So we're really almost cloud native AWS.
[11:32.000 --> 11:34.840] And so that also means everything we do on the modeling side
[11:34.840 --> 11:36.600] really evolves around SageMaker.
[11:37.680 --> 11:40.840] So for us, specifically for us as MLops team,
[11:40.840 --> 11:44.680] we are building the platform around SageMaker capabilities.
[11:45.680 --> 11:48.360] And on that end, at least company internal,
[11:48.360 --> 11:52.880] we have a contract how you must actually deploy models.
[11:52.880 --> 11:56.200] There is only one way, what we call the golden path,
[11:56.200 --> 11:59.800] in that case, this is the streamlined highly automated path
[11:59.800 --> 12:01.360] that is supported by the platform.
[12:01.360 --> 12:04.360] This is the only way how you can actually deploy models.
[12:04.360 --> 12:09.360] And in our case, that is actually a SageMaker pipeline object.
[12:09.640 --> 12:12.680] So in our company, we're doing large scale batch processing.
[12:12.680 --> 12:15.040] So we're actually not doing anything real time at present.
[12:15.040 --> 12:17.040] We are doing post transaction monitoring.
[12:17.040 --> 12:20.960] So that means you need to submit essentially DAX, right?
[12:20.960 --> 12:23.400] This is what we use for training.
[12:23.400 --> 12:25.680] This is what we also deploy eventually.
[12:25.680 --> 12:27.720] And this is our internal contract.
[12:27.720 --> 12:32.200] You need to provision a SageMaker in your model repository.
[12:32.200 --> 12:34.640] You got to have one place,
[12:34.640 --> 12:37.840] and there must be a function with a specific name
[12:37.840 --> 12:41.440] and that function must return a SageMaker pipeline object.
[12:41.440 --> 12:44.920] So this is our internal contract actually.
[12:44.920 --> 12:46.600] Yeah, that's interesting.
[12:46.600 --> 12:51.200] I mean, and I could see like for, I know many people
[12:51.200 --> 12:53.880] that are using SageMaker in production,
[12:53.880 --> 12:58.680] and it does seem like where it has some advantages
[12:58.680 --> 13:02.360] is that AWS generally does a pretty good job
[13:02.360 --> 13:04.240] at building solutions.
[13:04.240 --> 13:06.920] And if you just look at the history of services,
[13:06.920 --> 13:09.080] the odds are pretty high
[13:09.080 --> 13:12.880] that they'll keep getting better, keep improving things.
[13:12.880 --> 13:17.080] And it seems like what I'm hearing from people,
[13:17.080 --> 13:19.080] and it sounds like maybe with your organization as well,
[13:19.080 --> 13:24.080] is that potentially the SDK for SageMaker
[13:24.440 --> 13:29.120] is really the win versus some of the UX tools they have
[13:29.120 --> 13:32.680] and the interface for Canvas and Studio.
[13:32.680 --> 13:36.080] Is that what's happening?
[13:36.080 --> 13:38.720] Yeah, so I think, right,
[13:38.720 --> 13:41.440] what we try to do is we always try to think about our users.
[13:41.440 --> 13:44.880] So how do our users, who are our users?
[13:44.880 --> 13:47.000] What capabilities and skills do they have?
[13:47.000 --> 13:50.080] And what freedom should they have
[13:50.080 --> 13:52.640] and what abilities should they have to develop models?
[13:52.640 --> 13:55.440] In our case, we don't really have use cases
[13:55.440 --> 13:58.640] for stuff like Canvas because our users
[13:58.640 --> 14:02.680] are fairly mature teams that know how to do their,
[14:02.680 --> 14:04.320] on the one hand, the data science stuff, of course,
[14:04.320 --> 14:06.400] but also the engineering stuff.
[14:06.400 --> 14:08.160] So in our case, things like Canvas
[14:08.160 --> 14:10.320] do not really play so much role
[14:10.320 --> 14:12.960] because obviously due to the high abstraction layer
[14:12.960 --> 14:15.640] of more like graphical user interfaces,
[14:15.640 --> 14:17.360] drag and drop tooling,
[14:17.360 --> 14:20.360] you are also limited in what you can do,
[14:20.360 --> 14:22.480] or what you can do easily.
[14:22.480 --> 14:26.320] So in our case, really, it is the strength of the flexibility
[14:26.320 --> 14:28.320] that the SageMaker SDK gives you.
[14:28.320 --> 14:33.040] And in general, the SDK around most AWS services.
[14:34.080 --> 14:36.760] But also it comes with challenges, of course.
[14:37.720 --> 14:38.960] You give a lot of freedom,
[14:38.960 --> 14:43.400] but also you're creating a certain ask,
[14:43.400 --> 14:47.320] certain requirements for your model development teams,
[14:47.320 --> 14:49.600] which is also why we've also been working
[14:49.600 --> 14:52.600] about abstracting further away from the SDK.
[14:52.600 --> 14:54.600] So our objective is actually
[14:54.600 --> 14:58.760] that you should not be forced to interact with the raw SDK
[14:58.760 --> 15:00.600] when you use SageMaker anymore,
[15:00.600 --> 15:03.520] but you have a thin layer of abstraction
[15:03.520 --> 15:05.480] on top of what you are doing.
[15:05.480 --> 15:07.480] That's actually something we are moving towards
[15:07.480 --> 15:09.320] more and more as well.
[15:09.320 --> 15:11.120] Because yeah, it gives you the flexibility,
[15:11.120 --> 15:12.960] but also flexibility comes at a cost,
[15:12.960 --> 15:15.080] comes often at the cost of speeds,
[15:15.080 --> 15:18.560] specifically when it comes to the 90% default stuff
[15:18.560 --> 15:20.720] that you want to do, yeah.
[15:20.720 --> 15:24.160] And one of the things that I have as a complaint
[15:24.160 --> 15:29.160] against SageMaker is that it only uses virtual machines,
[15:30.000 --> 15:35.000] and it does seem like a strange strategy in some sense.
[15:35.000 --> 15:40.000] Like for example, I guess if you're doing batch only,
[15:40.000 --> 15:42.000] it doesn't matter as much,
[15:42.000 --> 15:45.000] which I think is a good strategy actually
[15:45.000 --> 15:50.000] to get your batch based predictions very, very strong.
[15:50.000 --> 15:53.000] And in that case, maybe the virtual machines
[15:53.000 --> 15:56.000] make a little bit less of a complaint.
[15:56.000 --> 16:00.000] But in the case of the endpoints with SageMaker,
[16:00.000 --> 16:02.000] the fact that you have to spend up
[16:02.000 --> 16:04.000] these really expensive virtual machines
[16:04.000 --> 16:08.000] and let them run 24 seven to do online prediction,
[16:08.000 --> 16:11.000] is that something that your organization evaluated
[16:11.000 --> 16:13.000] and decided not to use?
[16:13.000 --> 16:15.000] Or like, what are your thoughts behind that?
[16:15.000 --> 16:19.000] Yeah, in our case, doing real time
[16:19.000 --> 16:22.000] or near real time inference is currently not really relevant
[16:22.000 --> 16:25.000] for the simple reason that when you think a bit more
[16:25.000 --> 16:28.000] about the money laundering or anti money laundering space,
[16:28.000 --> 16:31.000] typically when, right,
[16:31.000 --> 16:34.000] all every individual bank must do anti money laundering
[16:34.000 --> 16:37.000] and they have armies of people doing that.
[16:37.000 --> 16:39.000] But on the other hand,
[16:39.000 --> 16:43.000] the time it actually takes from one of their systems,
[16:43.000 --> 16:46.000] one of their AML systems actually detecting something
[16:46.000 --> 16:49.000] that's unusual that then goes into a review process
[16:49.000 --> 16:54.000] until it eventually hits the governmental institution
[16:54.000 --> 16:56.000] that then takes care of the cases that have been
[16:56.000 --> 16:58.000] at least twice validated that they are indeed,
[16:58.000 --> 17:01.000] they look very unusual.
[17:01.000 --> 17:04.000] So this takes a while, this can take quite some time,
[17:04.000 --> 17:06.000] which is also why it doesn't really matter
[17:06.000 --> 17:09.000] whether you ship your prediction within a second
[17:09.000 --> 17:13.000] or whether it takes you a week or two weeks.
[17:13.000 --> 17:15.000] It doesn't really matter, hence for us,
[17:15.000 --> 17:19.000] that problem so far thinking about real time inference
[17:19.000 --> 17:21.000] has not been there.
[17:21.000 --> 17:25.000] But yeah, indeed, for other use cases,
[17:25.000 --> 17:27.000] for also private projects,
[17:27.000 --> 17:29.000] we've also been considering SageMaker Endpoints
[17:29.000 --> 17:31.000] for a while, but exactly what you said,
[17:31.000 --> 17:33.000] the fact that you need to have a very beefy machine
[17:33.000 --> 17:35.000] running all the time,
[17:35.000 --> 17:39.000] specifically when you have heavy GPU loads, right,
[17:39.000 --> 17:43.000] and you're actually paying for that machine running 2047,
[17:43.000 --> 17:46.000] although you do have quite fluctuating load.
[17:46.000 --> 17:49.000] Yeah, then that definitely becomes quite a consideration
[17:49.000 --> 17:51.000] of what you go for.
[17:51.000 --> 17:58.000] Yeah, and I actually have been talking to AWS about that,
[17:58.000 --> 18:02.000] because one of the issues that I have is that
[18:02.000 --> 18:07.000] the AWS platform really pushes serverless,
[18:07.000 --> 18:10.000] and then my question for AWS is,
[18:10.000 --> 18:13.000] so why aren't you using it?
[18:13.000 --> 18:16.000] I mean, if you're pushing serverless for everything,
[18:16.000 --> 18:19.000] why is SageMaker nothing serverless?
[18:19.000 --> 18:21.000] And so maybe they're going to do that, I don't know.
[18:21.000 --> 18:23.000] I don't have any inside information,
[18:23.000 --> 18:29.000] but it is interesting to hear you had some similar concerns.
[18:29.000 --> 18:32.000] I know that there's two questions here.
[18:32.000 --> 18:37.000] One is someone asked about what do you do for data versioning,
[18:37.000 --> 18:41.000] and a second one is how do you do event based MLOps?
[18:41.000 --> 18:43.000] So maybe kind of following up.
[18:43.000 --> 18:46.000] Yeah, what do we do for data versioning?
[18:46.000 --> 18:51.000] On the one hand, we're running a data lakehouse,
[18:51.000 --> 18:54.000] where after data we get from the financial institutions,
[18:54.000 --> 18:57.000] from the banks that runs through massive data pipeline,
[18:57.000 --> 19:01.000] also on AWS, we're using glue and step functions actually for that,
[19:01.000 --> 19:03.000] and then eventually it ends up modeled to some extent,
[19:03.000 --> 19:06.000] sanitized, quality checked in our data lakehouse,
[19:06.000 --> 19:10.000] and there we're actually using hoodie on top of S3.
[19:10.000 --> 19:13.000] And this is also what we use for versioning,
[19:13.000 --> 19:16.000] which we use for time travel and all these things.
[19:16.000 --> 19:19.000] So that is hoodie on top of S3,
[19:19.000 --> 19:21.000] when then pipelines,
[19:21.000 --> 19:24.000] so actually our model pipelines plug in there
[19:24.000 --> 19:27.000] and spit out predictions, alerts,
[19:27.000 --> 19:29.000] what we call alerts eventually.
[19:29.000 --> 19:33.000] That is something that we version based on unique IDs.
[19:33.000 --> 19:36.000] So processing IDs, we track pretty much everything,
[19:36.000 --> 19:39.000] every line of code that touched,
[19:39.000 --> 19:43.000] is related to a specific row in our data.
[19:43.000 --> 19:46.000] So we can exactly track back for every single row
[19:46.000 --> 19:48.000] in our predictions and in our alerts,
[19:48.000 --> 19:50.000] what pipeline ran on it,
[19:50.000 --> 19:52.000] which jobs were in that pipeline,
[19:52.000 --> 19:56.000] which code exactly was running in each job,
[19:56.000 --> 19:58.000] which intermediate results were produced.
[19:58.000 --> 20:01.000] So we're basically adding lineage information
[20:01.000 --> 20:03.000] to everything we output along that line,
[20:03.000 --> 20:05.000] so we can track everything back
[20:05.000 --> 20:09.000] using a few tools we've built.
[20:09.000 --> 20:12.000] So the tool you mentioned,
[20:12.000 --> 20:13.000] I'm not familiar with it.
[20:13.000 --> 20:14.000] What is it called again?
[20:14.000 --> 20:15.000] It's called hoodie?
[20:15.000 --> 20:16.000] Hoodie.
[20:16.000 --> 20:17.000] Hoodie.
[20:17.000 --> 20:18.000] Oh, what is it?
[20:18.000 --> 20:19.000] Maybe you can describe it.
[20:19.000 --> 20:22.000] Yeah, hoodie is essentially,
[20:22.000 --> 20:29.000] it's quite similar to other tools such as
[20:29.000 --> 20:31.000] Databricks, how is it called?
[20:31.000 --> 20:32.000] Databricks?
[20:32.000 --> 20:33.000] Delta Lake maybe?
[20:33.000 --> 20:34.000] Yes, exactly.
[20:34.000 --> 20:35.000] Exactly.
[20:35.000 --> 20:38.000] It's basically, it's equivalent to Delta Lake,
[20:38.000 --> 20:40.000] just back then when we looked into
[20:40.000 --> 20:42.000] what are we going to use.
[20:42.000 --> 20:44.000] Delta Lake was not open sourced yet.
[20:44.000 --> 20:46.000] Databricks open sourced a while ago.
[20:46.000 --> 20:47.000] We went for Hoodie.
[20:47.000 --> 20:50.000] It essentially, it is a layer on top of,
[20:50.000 --> 20:53.000] in our case, S3 that allows you
[20:53.000 --> 20:58.000] to more easily keep track of what you,
[20:58.000 --> 21:03.000] of the actions you are performing on your data.
[21:03.000 --> 21:08.000] So it's essentially very similar to Delta Lake,
[21:08.000 --> 21:13.000] just already before an open sourced solution.
[21:13.000 --> 21:15.000] Yeah, that's, I didn't know anything about that.
[21:15.000 --> 21:16.000] So now I do.
[21:16.000 --> 21:19.000] So thanks for letting me know.
[21:19.000 --> 21:21.000] I'll have to look into that.
[21:21.000 --> 21:27.000] The other, I guess, interesting stack related question is,
[21:27.000 --> 21:29.000] what are your thoughts about,
[21:29.000 --> 21:32.000] I think there's two areas that I think
[21:32.000 --> 21:34.000] are interesting and that are emerging.
[21:34.000 --> 21:36.000] Oh, actually there's, there's multiple.
[21:36.000 --> 21:37.000] Maybe I'll just bring them all up.
[21:37.000 --> 21:39.000] So we'll do one by one.
[21:39.000 --> 21:42.000] So these are some emerging areas that I'm, that I'm seeing.
[21:42.000 --> 21:49.000] So one is the concept of event driven, you know,
[21:49.000 --> 21:54.000] architecture versus, versus maybe like a static architecture.
[21:54.000 --> 21:57.000] And so I think obviously you're using step functions.
[21:57.000 --> 22:00.000] So you're a fan of, of event driven architecture.
[22:00.000 --> 22:04.000] Maybe we start, we'll start with that one is what are your,
[22:04.000 --> 22:08.000] what are your thoughts on going more event driven in your organization?
[22:08.000 --> 22:09.000] Yeah.
[22:09.000 --> 22:13.000] In, in, in our case, essentially everything works event driven.
[22:13.000 --> 22:14.000] Right.
[22:14.000 --> 22:19.000] So since we on AWS, we're using event bridge or cloud watch events.
[22:19.000 --> 22:21.000] I think now it's called everywhere.
[22:21.000 --> 22:22.000] Right.
[22:22.000 --> 22:24.000] This is how we trigger pretty much everything in our stack.
[22:24.000 --> 22:27.000] This is how we trigger our data pipelines when data comes in.
[22:27.000 --> 22:32.000] This is how we trigger different, different lambdas that parse our
[22:32.000 --> 22:35.000] certain information from your log, store them in different databases.
[22:35.000 --> 22:40.000] This is how we also, how we, at some point in the back in the past,
[22:40.000 --> 22:44.000] how we also triggered new deployments when new models were approved in
[22:44.000 --> 22:46.000] your model registry.
[22:46.000 --> 22:50.000] So basically everything we've been doing is, is fully event driven.
[22:50.000 --> 22:51.000] Yeah.
[22:51.000 --> 22:56.000] So, so I think this is a key thing you bring up here is that I've,
[22:56.000 --> 23:00.000] I've talked to many people who don't use AWS, who are, you know,
[23:00.000 --> 23:03.000] all alternatively experts at technology.
[23:03.000 --> 23:06.000] And one of the things that I've heard some people say is like, oh,
[23:06.000 --> 23:13.000] well, AWS is in as fast as X or Y, like Lambda is in as fast as X or Y or,
[23:13.000 --> 23:17.000] you know, Kubernetes or, but, but the point you bring up is exactly the
[23:17.000 --> 23:24.000] way I think about AWS is that the true advantage of AWS platform is the,
[23:24.000 --> 23:29.000] is the tight integration with the services and you can design event
[23:29.000 --> 23:31.000] driven workflows.
[23:31.000 --> 23:33.000] Would you say that's, that's absolutely.
[23:33.000 --> 23:34.000] Yeah.
[23:34.000 --> 23:35.000] Yeah.
[23:35.000 --> 23:39.000] I think designing event driven workflows on AWS is incredibly easy to do.
[23:39.000 --> 23:40.000] Yeah.
[23:40.000 --> 23:43.000] And it also comes incredibly natural and that's extremely powerful.
[23:43.000 --> 23:44.000] Right.
[23:44.000 --> 23:49.000] And simply by, by having an easy way how to trigger lambdas event driven,
[23:49.000 --> 23:52.000] you can pretty much, right, pretty much do everything and glue
[23:52.000 --> 23:54.000] everything together that you want.
[23:54.000 --> 23:56.000] I think that gives you a tremendous flexibility.
[23:56.000 --> 23:57.000] Yeah.
[23:57.000 --> 24:00.000] So, so I think there's two things that come to mind now.
[24:00.000 --> 24:07.000] One is that, that if you are developing an ML ops platform that you
[24:00.000 -->[23:57.000 -->[23:56.000 -->[23:54.000 -->[23:52.000 -->[23:49.000 -->[23:44.000 -->[23:43.000 -->[23:40.000 -->[23:39.000 -->[23:35.000 -->[23:34.000 -->[23:33.000 -->[23:31.000 -->[23:29.000 -->[23:24.000 -->[23:17.000 -->[23:13.000 -->[23:06.000 -->[23:03.000 -->[23:00.000 -->[22:56.000 -->[22:51.000 -->[22:50.000 -->[22:46.000 -->[22:44.000 -->[22:40.000 -->[22:35.000 -->[22:32.000 -->[22:27.000 -->[22:24.000 -->[22:22.000 -->[22:21.000 -->[22:19.000 -->[22:14.000 -->[22:13.000 -->[22:09.000 -->[22:08.000 -->[22:04.000 -->[22:00.000 -->[21:57.000 -->[21:54.000 -->[21:49.000 -->[21:42.000 -->[21:39.000 -->[21:37.000 -->[21:36.000 -->[21:34.000 -->[21:32.000 -->[21:29.000 -->[21:27.000 -->[21:21.000 -->[21:19.000 -->[21:16.000 -->[21:15.000 -->[21:13.000 -->[21:08.000 -->[21:03.000 -->[20:58.000 -->[20:53.000 -->[20:50.000 -->[20:47.000 -->[20:46.000 -->[20:44.000 -->[20:42.000 -->[20:40.000 -->[20:38.000 -->[20:35.000 -->[20:34.000 -->[20:33.000 -->[20:32.000 -->[20:31.000 -->[20:29.000 -->[20:22.000 -->[20:19.000 -->[20:18.000 -->[20:17.000 -->[20:16.000 -->[20:15.000 -->[20:14.000 -->[20:13.000 -->[20:12.000 -->[20:09.000 -->[20:05.000 -->[20:03.000 -->[20:01.000 -->[19:58.000 -->[19:56.000 -->[19:52.000 -->[19:50.000 -->[19:48.000 -->[19:46.000 -->[19:43.000 -->[19:39.000 -->[19:36.000 -->[19:33.000 -->[19:29.000 -->[19:27.000 -->[19:24.000 -->[19:21.000 -->[19:19.000 -->[19:16.000 -->[19:13.000 -->[19:10.000 -->[19:06.000 -->[19:03.000 -->[19:01.000 -->[18:57.000 -->[18:54.000 -->[18:51.000 -->[18:46.000 -->[18:43.000 -->[18:41.000 -->[18:37.000 -->[18:32.000 -->[18:29.000 -->[18:23.000 -->[18:21.000 -->[18:19.000 -->[18:16.000 -->[18:13.000 -->[18:10.000 -->[18:07.000 -->[18:02.000 -->[17:58.000 -->[17:51.000 -->[17:49.000 -->[17:46.000 -->[17:43.000 -->[17:39.000 -->[17:35.000 -->[17:33.000 -->[17:31.000 -->[17:29.000 -->[17:27.000 -->[17:25.000 -->[17:21.000 -->[17:19.000 -->[17:15.000 -->[17:13.000 -->[17:09.000 -->[17:06.000 -->[17:04.000 -->[17:01.000 -->[16:58.000 -->[16:56.000 -->[16:54.000 -->[16:49.000 -->[16:46.000 -->[16:43.000 -->[16:39.000 -->[16:37.000 -->[16:34.000 -->[16:31.000 -->[16:28.000 -->[16:25.000 -->[16:22.000 -->[16:19.000 -->[16:15.000 -->[16:13.000 -->[16:11.000 -->[16:08.000 -->[16:04.000 -->[16:02.000 -->[16:00.000 -->[15:56.000 -->[15:53.000 -->[15:50.000 -->[15:45.000 -->[15:42.000 -->[15:40.000 -->[15:35.000 -->[15:30.000 -->[15:24.160 -->[15:20.720 -->[15:18.560 -->[15:15.080 -->[15:12.960 -->[15:11.120 -->[15:09.320 -->[15:07.480 -->[15:05.480 -->[15:03.520 -->[15:00.600 -->[14:58.760 -->[14:54.600 -->[14:52.600 -->[14:49.600 -->[14:47.320 -->[14:43.400 -->[14:38.960 -->[14:37.720 -->[14:34.080 -->[14:28.320 -->[14:26.320 -->[14:22.480 -->[14:20.360 -->[14:17.360 -->[14:15.640 -->[14:12.960 -->[14:10.320 -->[14:08.160 -->[14:06.400 -->[14:04.320 -->[14:02.680 -->[13:58.640 -->[13:55.440 -->[13:52.640 -->[13:50.080 -->[13:47.000 -->[13:44.880 -->[13:41.440 -->[13:38.720 -->[13:36.080 -->[13:32.680 -->[13:29.120 -->[13:24.440 -->[13:19.080 -->[13:17.080 -->[13:12.880 -->[13:09.080 -->[13:06.920 -->[13:04.240 -->[13:02.360 -->[12:58.680 -->[12:53.880 -->[12:51.200 -->[12:46.600 -->[12:44.920 -->[12:41.440 -->[12:37.840 -->[12:34.640 -->[12:32.200 -->[12:27.720 -->[12:25.680 -->[12:23.400 -->[12:20.960 -->[12:17.040 -->[12:15.040 -->[12:12.680 -->[12:09.640 -->[12:04.360 -->[12:01.360 -->[11:59.800 -->[11:56.200 -->[11:52.880 -->[11:48.360 -->[11:45.680 -->[11:40.840 -->[11:37.680 -->[11:34.840 -->[11:32.000 -->[11:27.480 -->[11:26.040 -->[11:24.200 -->[11:21.760 -->[11:20.920 -->[11:18.720 -->[11:15.600 -->[11:13.760 -->[11:12.320 -->[11:08.880 -->[11:05.440 -->[11:02.280 -->[10:56.800 -->[10:53.680 -->[10:51.600 -->[10:48.760 -->[10:44.960 -->[10:41.120 -->[10:36.960 -->[10:34.920 -->[10:32.440 -->[10:31.040 -->[10:28.680 -->[10:25.480 -->[10:20.920 -->[10:16.480 -->[10:13.160 -->[10:09.760 -->[10:06.640 -->[10:01.480 -->[09:57.840 -->[09:54.360 -->[09:51.480 -->[09:47.760 -->[09:45.760 -->[09:43.120 -->[09:38.960 -->[09:34.480 -->[09:29.480 -->[09:21.800 -->[09:19.560 -->[09:16.800 -->[09:13.640 -->[09:10.800 -->[09:08.680 -->[09:04.920 -->[09:02.680 -->[09:00.360 -->[08:58.160 -->[08:56.200 -->[08:53.920 -->[08:51.720 -->[08:48.040 -->[08:44.760 -->[08:42.720 -->[08:40.560 -->[08:37.960 -->[08:34.720 -->[08:32.360 -->[08:30.080 -->[08:27.520 -->[08:24.720 -->[08:22.680 -->[08:19.920 -->[08:17.520 -->[08:13.880 -->[08:09.280 -->[08:05.280 -->[08:02.920 -->[07:59.720 -->[07:56.120 -->[07:55.280 -->[07:51.480 -->[07:48.600 -->[07:45.560 -->[07:43.280 -->[07:40.120 -->[07:36.480 -->[07:34.800 -->[07:32.280 -->[07:29.160 -->[07:25.880 -->[07:23.000 -->[07:19.760 -->[07:17.920 -->[07:15.480 -->[07:12.600 -->[07:10.040 -->[07:08.160 -->[07:06.840 -->[07:02.080 -->[06:59.240 -->[06:57.360 -->[06:55.920 -->[06:52.760 -->[06:49.680 -->[06:45.880 -->[06:45.040 -->[06:43.640 -->[06:41.520 -->[06:37.960 -->[06:35.760 -->[06:32.560 -->[06:28.040 -->[06:25.560 -->[06:23.000 -->[06:20.560 -->[06:19.560 -->[06:17.000 -->[06:13.760 -->[06:10.840 -->[06:09.400 -->[06:05.680 -->[06:01.840 -->[05:59.520 -->[05:56.960 -->[05:54.120 -->[05:51.560 -->[05:47.000 -->[05:45.080 -->[05:44.120 -->[05:40.320 -->[05:38.960 -->[05:35.480 -->[05:33.160 -->[05:30.680 -->[05:27.960 -->[05:24.800 -->[05:23.080 -->[05:19.440 -->[05:16.200 -->[05:12.400 -->[05:08.880 -->[05:05.600 -->[05:02.040 -->[04:58.520 -->[04:55.240 -->[04:52.040 -->[04:50.120 -->[04:47.400 -->[04:46.560 -->[04:43.720 -->[04:39.280 -->[04:36.200 -->[04:33.800 -->[04:30.240 -->[04:27.960 -->[04:25.120 -->[04:20.880 -->[04:18.200 -->[04:15.880 -->[04:14.800 -->[04:12.240 -->[04:10.520 -->[04:07.240 -->[04:02.520 -->[03:59.520 -->[03:56.840 -->[03:52.880 -->[03:49.240 -->[03:44.240 -->[03:41.160 -->[03:36.920 -->[03:34.320 -->[03:30.280 -->[03:27.720 -->[03:25.240 -->[03:23.200 -->[03:19.600 -->[03:17.000 -->[03:12.800 -->[03:10.840 -->[03:07.920 -->[03:04.440 -->[03:01.640 -->[02:59.200 -->[02:57.840 -->[02:54.640 -->[02:50.880 -->[02:49.360 -->[02:46.760 -->[02:45.760 -->[02:42.440 -->[02:39.240 -->[02:36.160 -->[02:32.960 -->[02:30.920 -->[02:27.560 -->[02:25.040 -->[02:21.960 -->[02:19.080 -->[02:16.240 -->[02:12.080 -->[02:09.400 -->[02:05.760 -->[02:03.640 -->[02:01.680 -->[01:59.360 -->[01:58.360 -->[01:55.120 -->[01:52.640 -->[01:50.360 -->[01:47.320 -->[01:42.680 -->[01:40.200 -->[01:38.680 -->[01:36.880 -->[01:34.760 -->[01:32.680 -->[01:29.920 -->[01:27.280 -->[01:24.160 -->[01:21.800 -->[01:20.520 -->[01:19.000 -->[01:15.760 -->[01:13.240 -->[01:11.440 -->[01:08.040 -->[01:05.920 -->[01:02.080 -->[00:59.320 -->[00:55.560 -->[00:51.680 -->[00:49.840 -->[00:46.080 -->[00:43.120 -->[00:40.440 -->[00:38.160 -->[00:34.960 -->[00:32.360 -->[00:31.520 -->[00:29.200 -->[00:27.760 -->[00:24.960 -->[00:23.600 -->[00:22.760 -->[00:20.360 -->[00:17.680 -->[00:13.480 -->[00:09.760 -->[00:02.260 -->

[00:00.000 -->

You Might Like

Acquired
Acquired
Ben Gilbert and David Rosenthal
Darknet Diaries
Darknet Diaries
Jack Rhysider
Hard Fork
Hard Fork
The New York Times
Marketplace Tech
Marketplace Tech
Marketplace
WSJ’s The Future of Everything
WSJ’s The Future of Everything
The Wall Street Journal
Search Engine
Search Engine
PJ Vogt, Audacy, Jigsaw
Rich On Tech
Rich On Tech
Rich DeMuro
TechStuff
TechStuff
iHeartPodcasts
Fortnite Emotes
Fortnite Emotes
Lawrence Hopkinson