ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
questionanswer
2
0Has anyone used Erlang or Elixir in production? How does it compare to Python or Go?{'summary_answer': "Yes, some people have used Erlang and Elixir in production, including v7labs and WhatsApp. In terms of comparison to Python or Go, it depends on the specific use case. Erlang and Elixir are well-established in telco and are good for zero-downtime upgrades and high availability. However, they may not be the best choice unless uptime and seamless upgrades are absolutely crucial to the product. It's also harder to hire experienced people and there is less general community support and fewer SDKs and libraries compared to more popular languages like Python and Go.", 'citations': [{'id': '2022-01-20 18:58:40.0412 UTC', 'message': 'U02U6J2HZ2P: Recently I\'ve watched a video on Erlang/Elixir and it mentions that its usage in edge computing is <https://youtu.be/-m31ag9z4VY?t=1236|increasing>\n\nHave you guys used any of them in production? How it compares to, say, using Python or Go?\nThanks!U015BH45ZK6: if I remember v7labs is using itU015BH45ZK6: "increasing" is a very relative term...U015BH45ZK6: WhatsApp uses erlang and their claims about scalability is crazyU015BH45ZK6: but the biggest problem using a tech like this is where you will have a steady supply of employees? People will be reluctant to join because they don\'t know how to use it and how will they use it when they leave you?U015BH45ZK6: even golang is problematicU02U6J2HZ2P: Maybe reduction of downtime is a reason important enough for critical systems, AFAIK erlang is well established in telco which are pretty critical systemsU02U6J2HZ2P: But ML has not been long enough to be critical, and commonly ends up as a nice to haveU02U0DGM4RH: I\'ve used Erlang on a couple of projects. Mostly around Linux-based embedded systems. It\'s really nice when dealing with zero-downtime upgrades. The functional nature helped avoid some nasty bugs.\n\nIt\'s not too hard to adopt functional programming concepts to other languages. You could easily write functional Python or Go.\n\nAt this point, I\'d be hard-pressed to recommend Erlang or Elixir unless you\'re in the domain where uptime, seamless upgrades and high availability are absolutely crucial to your product.\n\nIt\'s harder to develop in more niche languages. As Laszlo pointed out, it\'s harder to hire experienced people. General community support is less, as well as lack of SDKs and libraries.UP3T8K9M5: <@U01BCHHP8P9> is V7 using it?'}]}
3
1Is Object Oriented Programming (OOP) terrible when writing software that deals with huge data and has to be performant in complex cloud-native setups? What do people exactly mean when they refer to 'OOP' and 'OOP developers'?{'summary_answer': "According to the context, some people think that Object Oriented Programming (OOP) is terrible when writing software that deals with huge data and has to be performant in complex cloud-native setups. When people refer to 'OOP' and 'OOP developers', they are referring to developers who use Object Oriented Programming as their primary programming paradigm.", 'citations': [{'id': '2021-07-18 10:58:23.1146 UTC', 'message': 'U020FR2F9HU: DAE think Object Oriented Programming is absolutely terrible when writing software that deals with huge data (models) and has to be performant (infernence) in complex cloud-native setups (kubernetes) ? It\'s also very common to encounter OOP developers that are completely unable to understand infrastructure and performance and can\'t think beyond their OOP objects\\U015BH45ZK6: DAE? also this might be better for <#C0253TQTUQ0|production-code>U0147G90NDA: It\'s also very common to encounter people who think functional programming is the only true way of programming and the rest of developers are morons because they don\'t agreeU015BH45ZK6: Trying not to reignite multiple culture wars at the same time, but when someone refers to "OOP" and "OOP developers" what do they exactly mean?U0147G90NDA: I think it\'s great to have balance. For example, I was quite happy to see steams in Java 8. I could still do OOP and do some functional processing on top of thatU015BH45ZK6: What FP techniques do you use?U020FR2F9HU: DAE- \'Does Anyone Else\'U020FR2F9HU: ok moving to the other channel'}]}
4
2How do you track inferences over time for ML observability and what tools do you suggest to use for this? Do these tools support unstructured data like images, voice, and text?{'error': [{'body': 'Transformation prompt_completion with name prompt_completion failed on state {"input":{"prompt":"Context:[\\"U030F09BMUY: Can we talk ML observability?\\\\n\\\\nHow do you track your inferences over time?\\\\n\\\\nI\'m thinking we\'d want to see if our predictions or our input data have changed significantly from what they \\\\\\"usually\\\\\\" have been, and maybe get some kind of alert if so. But even _no_ alert and just a visualization would be great.\\\\n\\\\nI\'ve been looking at WhyLabs, but it seems expensive and potentially more advanced that what we need. I\'d be down to self-host something or just use a python library to log data to... somewhere so we can review it.\\\\n\\\\n[EDIT] This comment may not fairly represent the relative cost of WhyLabs. It may be well priced in comparison with other vendors. My comment about it being expensive is a reaction to the jump in price of paying for _any_ vendor relative to potentially building a simple solution ourselves.U031F7D87RA: I like the whylabs pitch. Don’t they have an open source version that’s free?U01QNSUJXTL: Hey Eric — sounds with error Error: Prompt is too long. Please reduce prompt in length.', 'step_name': 'prompt_completion', 'raw': 'Error: Prompt is too long. Please reduce prompt in length.'}]}
5
3Is there a blog or repo with code for a simple model built and deployed in Sagemaker, Kubeflow, and MetaFlow? Any insights on DX differences? Any comparable refarchs available? Any suggestions on good use cases where these tools excel? {'summary_answer': 'The initial speaker in the conversation is looking for a blog or repo with code for a simple model built and deployed in Sagemaker, Kubeflow, and MetaFlow. Another participant suggests a tool for deploying to Sagemaker called Bentoml. There is also a helpful comparison of different authoring paradigms available on Github. However, it is noted that good refarchs around these tools are hard to come by. There is a suggestion to explore these tools further in an engineering lab project. There is no mention of insights on DX differences or suggestions on good use cases where these tools excel.', 'citations': [{'id': '2021-11-04 20:33:46.2482 UTC', 'message': "U023LG9V6R5: I wonder if someone has a blog or a repo where I can find a code for a simple model, built and deployed in: Sagemaker, Kubeflow and say:MetaFlow. Just want to compare all three from a development stand pointU02JBQ3997D: Let me know if you find/write one. I'm very interested in the DX differences you find.U02BFAGFDNV: Bentoml's tool for deploying to Sagemaker: <https://github.com/bentoml/aws-sagemaker-deploy>U01BNPM89U1: Not sagemaker or kubeflow - but a helpful comparison of how to write a very simple workflow using a few different authoring paradigms (courtesy of some very nice people at Anyscale) - <https://github.com/ray-project/ray/tree/7d1f408de966f90c95efe5b0cf488e82cd1c6e17/python/ray/workflow/examples/comparisons>U02KP9JBPUH: Sadly good refarchs around these tools are hard to come byU01A92UN1RT: Good project for an <#C0198RL5Y01|engineeringlabs> someday?U023LG9V6R5: <@U02KP9JBPUH> - yaah ! I wonder whyUP3T8K9M5: <@U0134LJPWQ3> maybe we just found the next topic?U01FVLXFDDM: BentoML caught my attention initially because it had deployments for AWS Lambda / AWS Sagemaker / Ec2U01FVLXFDDM: <@U019YUCL1EH>has been working on those tools - he's the best person to help you !U0134LJPWQ3: Great catch <@U01A92UN1RT> ! <@UP3T8K9M5> sounds like a great idea. I’ll start to think about a possible lab "}]}
6
4What are some best practices or open source frameworks for caching pre-computed ML predictions, especially for high cardinality use cases that require invalidation and re-triggering on a regular basis?{'summary_answer': 'Feast is an open source framework that offers a caching system to serve features quickly, and it may be a good fit for pre-computing ML predictions. However, the concept of feature in Feast seems to be limited to features that can result from simple calculations. For more complex predictions, Redis and DynamoDB are mentioned as possible options for caching pre-computed ML predictions, especially for high cardinality use cases that require invalidation and re-triggering on a regular basis. Additionally, the article from Google Cloud introduces the concept of "ML Gateway" or "Prediction Gateway" that retrieves pre-computed predictions if they exist and computes them online if they don\'t.', 'citations': [{'id': '2021-04-28 15:17:52.4943 UTC', 'message': "U01NGLDQD5M: Hi MLOps Community :wave:,\n\nMy team’s product requires serving lots of ML (and rules engines) predictions. Data for most of our entities is available in advance, so it is possible to pre-compute, or cache, the predictions before serving them. We would like to do so to make our serving faster (e.g. go from 10s required to compute predictions from scratch, to 0.100s where we would just fetch pre-computed results).\n\nWe have ideas on how we could implement this from scratch using a mix of batch processes and some database or cache system, but we were wondering if any best practices already existed, maybe even open source frameworks, to solve that problem. Indeed, building this caching infrastructure would represent quite a serious investment on top of our existing and fairly simple FastAPI / Kubernetes based stateless serving.\n\nWe found this <https://cloud.google.com/architecture/minimizing-predictive-serving-latency-in-machine-learning#precomputing_and_caching_predictions|interesting article from Google Cloud> that introduces the concept of *“ML Gateway”,* or “*Prediction Gateway”*. As described in the article, one of the functionality of this component is to retrieve pre-computed predictions if they exist; if they don’t, compute them online. Curious if the community is familiar with those concepts?\n\nNote: our use case is high cardinality (lots of entities) and would involve invalidating / re-trigger the pre-computation on a regular basis (multiple times a day)\n\nThanks!U01G8G02WBG: A few years back I used to pre-compute recommendations and persist them to Elasticsearch, which is *very* fast at serving queries.\n\nThe cardinality was ~10k and this was re-computed once every hour (via Spark jobs).U01A6GJ20NT: <@U01NGLDQD5M> can you describe the kinds of lookups you would like to do?U01A6GJ20NT: If you are looking for point lookups (look up a list of features/columns for a list of entities) then this sounds like <https://feast.dev/> may be a good fitU01A6GJ20NT: Most teams use BigQuery + Firestore and can easily hit your latency targetsU01DHAF466L: <@U01994M42BF>, this sounds like what you've been working onU01UTPFBVKM: What kind of models? Can you translate them into a DB query? I'm working on something in this space, but for now it covers easier models (regressions, trees, random forests..no NNs).U01NGLDQD5M: Alex, thanks for sharing, good to know Elasticsearch can be leveraged for that use case!U01NGLDQD5M: <@U01A6GJ20NT> thanks for the pointer! We’ve been looking at Feast with a lot of interest and are considering evaluating it. It indeed seems to offer a caching system to serve features quickly. Our only reservation at this stage is that the concept of feature in Feast seems to be limited to features that can result of simple calculations (e.g. aggregations that could be done in SQL or Spark). What we are trying to cache and lookup are the predictions of a CPU intensive rules engine (that cannot be expressed in simple SQL or Spark) or of an ML model. Those predictions could actually be seen as “features” too, since they potentially can be used downstream as inputs to feed other models. Hope it answers your question about the lookup typeU01NGLDQD5M: <@U01UTPFBVKM> thanks for your answer! Curious to learn more about what you are doing. We are working with a lot of text. For the time being our models are mostly rules, but we have plans to move soon to more advanced NLP approaches (most likely deep learning based - LSTMs or Transformers)U01A6GJ20NT: &gt; Those predictions could actually be seen as “features” too, since they potentially can be used downstream as inputs to feed other models. Hope it answers your question about the lookup type\nIs there a functional difference between your predictions and a feature? We consider models to be transformations on data, so they can be seen as features. I'd love to find out if there is a use case here that we don't support. At Gojek we used to serve many of our model outputs as features using Feast.U01NGLDQD5M: <@U01A6GJ20NT> I don’t think there is a functional difference between our predictions and features. This is very interesting and exciting to know that Feast can be used to serve model outputs as features! Apologies for suggesting above that this use case was not supported, my understanding of Feast is still very limited. I’m curious, how were you pre-computing those model output features in the Feast framework? Would you deploy Python deps, model artifacts in the Spark cluster and run a PySpark workload for ex?U01994M42BF: We're open-sourcing an embedding store soon that allows you to define feature generation logic (that can be a model) that outputs embeddings and run ANN, and other operations on them, is that sort of what you're thinking about? If so happy to chat. This is part of our larger feature store offering\n\n(thanks for the mention <@U01DHAF466L>)U01A6GJ20NT: &gt; I’m curious, how were you pre-computing those model output features in the\xa0Feast\xa0framework? Would you deploy Python deps, model artifacts in the Spark cluster and run a PySpark workload for ex?\nIt was a combination of\n• Stream processors. Basically it was Flink that would trigger based on events on Kafka. It would hit a model endpoint and output results to Kafka\n• Batch processors. BQ SQL transformations or models for batch predictions. Output to BQ.\nThen the output of both processes above would be sunk into the feature store. From there the values could then be used for training/prediction again, and the loop continues.U017CCSNYDV: <@U01NGLDQD5M>\n&gt; our use case is high cardinality (lots of entities) and would involve invalidating / re-trigger the pre-computation on a regular basis (multiple times a day)\nThis sounds like a Redis shaped problem.\n<https://redis.io/commands/expire>\n<https://redis.io/topics/notifications>\nI have used it for running incremental predictions on live sporting events. Yes, 2 orders of magnitude speed up is possible.U01N8ERHH9T: Like above, we precompute recommendations in spark and then serve from lookup via dynamodb, which is very low latency.U01N8ERHH9T: Similar to the elasticsearch option but maybe even cheaper/easier if you're in AWSU01NGLDQD5M: <@U01994M42BF> we are not using embeddings as of today but it’s quite possible it may become part of our architecture in the future. Good to know you are building open source tooling around this, we will definitely have a look!U01NGLDQD5M: <@U01A6GJ20NT> thanks a lot for the details. We’ll dig more into Feast capabilities!U01NGLDQD5M: <@U017CCSNYDV> <@U01N8ERHH9T> thank you for the pointers! We are using AWS. We will take a closer look to Redis (ElastiCache managed offering?) and DynamoDB."}]}
7
5What are some examples of ML tooling companies with good documentation?{'summary_answer': 'Some examples of ML tooling companies with good documentation are mlflow, metaflow, fastapi, django, and pytorch.', 'citations': [{'id': '2022-08-16 07:56:36.028169 UTC', 'message': "U033T45JYAK: Hey everyone, we’re re-doing the docs of our open-source ML and are looking for examples of the docs that you love. Can you add the projects that have the perfect docs for you?U02NR6C4M29: <https://www.mlflow.org/docs/latest/index.html>U02NR6C4M29: <https://docs.metaflow.org/>U015BH45ZK6: <https://fastapi.tiangolo.com/|https://fastapi.tiangolo.com/>U033QREB56V: <https://docs.djangoproject.com/en/4.1/>\n<https://pytorch.org/docs/stable/index.html>U033T45JYAK: Thanks everyone!U02A3UU4G1H: Not an example of docs, but I've found the diataxis framework to be really helpful. It divides documentation into four main categories (tutorials, how-to guides, explanation, reference) with guidance on how to approach each. <https://diataxis.fr>"}]}
8
6What are the opinions of MLOps.community Slack channel members on TensorFlow Extended (TFX) for model development and productionization and how does it compare to other options?{'summary_answer': 'The opinions of MLOps.community Slack channel members on TensorFlow Extended (TFX) for model development and productionization are mixed. Some members find it complex, inflexible, and unproductive, while others see its benefits for pushing models to AirFlow and KubeFlow. Some members have found success with cobbling together a solution using Kubeflow Pipelines, while others believe that TFX is not really usable for most due to heavy Beam and Tensorflow dependency. One member mentioned a nascent project called TorchX as a potential alternative. Overall, the consensus seems to be that there is no one-size-fits-all solution and that the best approach may be to compose your own solutions for the best in class.', 'citations': [{'id': '2021-10-20 07:19:20.1619 UTC', 'message': "U01J0NVNE1G: Hello. What do you think about *TensorFlow Extended (TFX)* for developing and productionizing AI/ML models? This framework seems awfully complex, unflexible, and unproductive, but I see its benefit for pushing models to AirFlow and KubeFlow. I also don't see how to integrate TFX during development, as the solution seems too verbose and opiniated.U0201H6SRS8: I’m not really interested in TFX - my answer would be different if my ML Engineers actually used Tensorflow, but I’ve found that cobbling something together with Kubeflow Pipelines has been sufficientU029CU4G518: I'll phrase it as... I'm sure it works well for GoogleU01HTTTL2JC: All the right ideas, but not really usable due for most due to heavy Beam and Tensorflow dependency.U01HTTTL2JC: <https://pytorch.org/torchx/latest/> is a very very nascent project. Will be interesting to see if it ends up like the PyTorch vs. Tensorflow battle.U01J0NVNE1G: <@U029CU4G518> <@U01HTTTL2JC> agreed, I tried to use it on a Kaggle project, and it's just intractable. Still, I would love to find a similar but simpler project.U01J0NVNE1G: More and more, I think the only real option is to compose your own solutions for the best in class."}]}
9
7What are the best practices for batch model & online prediction architecture, and what approach should be used for scalable API?{'summary_answer': 'The best practices for batch model and online prediction architecture include using request batching and having a dedicated endpoint for batch jobs. For a scalable API, the approach of creating a scalable endpoint that can auto scale the load is recommended, along with using request batching. Another approach is to wrap the model as a container and run it in Kubernetes, either as a Kubernetes job for batch prediction or as a Kubernetes deployment and service for online prediction. The live prediction server can be created using a service, deployment, and HPA for auto-scaling, or KEDA for auto-scaling on events.', 'citations': [{'id': '2022-09-02 02:51:40.296679 UTC', 'message': "U0415FDBW65: Hi Team, Do we have any best practices on batch model &amp; online prediction architecture?\n*Context:* I have a model which is used for batch prediction of ~100 Million rows each month and I also want to use the model as scalable API\nThe am thinking of having approaches are between\n1. loading pickle file in the code to run (Faster running + low network overhead)\n2. creating a scalable endpoint that can auto scale the load. ( Better decoupling + more parallelization)\n\nAny ideas or challenges to lookout.U03EED5HZJ8: What type of model is this, what is the latency? This averages out to around 40 QPS on average (at peak I assume it can be higher). Are the inputs to the model large since modern datacenters have pretty fast networks (100s of MB/s on the smallest instance types for both send and receive)?\n\nThe scalable endpoint approach is simpler to deploy and manage in the long term since it means when 100 million becomes 200 million, you won't need to worry about keeping up.U03V24MBARW: I'll add my 2 cents here. If you decide to go with the scalable endpoint approach I would recommend a few things.\n\nFirst, use request batching (see here for example <https://docs.bentoml.org/en/latest/guides/batching.html|https://docs.bentoml.org/en/latest/guides/batching.html>, but note that a few more tools support it). Most ML/DL models run faster when running on a batch of inputs rather than one by one.\n\nSecond, use a dedicated endpoint for your batch job, don't reuse the one for the API, unless you are either fine throwing out the window any possible SLO or your load is evenly distributed throughout the month. U01JFBUKMCG: So if you wrap the model as a container, you can run it in Kubernetes, as either a Kubernetes job (batch prediction) or as a Kubernetes deployment + service (online).\n\nAs a Kubernetes job, you would need another container in the pod, that would load the data, call the model container and place the resulting prediction back in the data lake (e.g. S3 bucket).\n\nFor the live prediction server, you can create a service + deployment + hpa for auto-scaling, or you can keda if you want to autoscale on events (e.g. number of predictions).\n\nNotes that there are other concerns that might be involved here, like data validation (for your incoming serving data) and drift detection."}]}
10
8How do people manage different versioning schemes on their inference services?{'error': [{'body': 'Transformation prompt_completion with name prompt_completion failed on state {"input":{"prompt":"Context:[\\"U01MGAZNKSP: :wave: Hi all\\\\n\\\\nwe’ve gotten to the stage of deploying and running ML models where the models are changing and no longer backwards compatible, and we are realising that that causes a bunch of problems that we hadn’t thought through!\\\\nLike… do you keep around the code that can infer from the old models? And will the models themselves need to know which version of the code they were created with so that same version can be used to infer from them? And how do you keep that code in parallel with the new code? So we’ll probably have to start versioning things more thoroughly\\\\n\\\\nBut it seems there are a few different ways of doing this and I’m not sure what would be the best approach!! Does anyone have experience with this, or recommendations of patterns for dealing with this?U015BH45ZK6: Pretty much _yes_ to all of your questions :wink:U015BH45ZK6: Version control everything. We keep new models in branches and merge the final superior one in with error Error: Prompt is too long. Please reduce prompt in length.', 'step_name': 'prompt_completion', 'raw': 'Error: Prompt is too long. Please reduce prompt in length.'}]}
11
9How do you determine whether or not you are able to reproduce models?{'summary_answer': 'The given context does not provide any information to answer this question.', 'citations': [{'id': '2023-02-15 01:48:03.109299 UTC', 'message': 'U01N8ERHH9T: Which is the right metric to use for determining whether a model will be effective?'}]}
12
10What were the best MLOps conferences/events attended this year and why?{'summary_answer': 'The best MLOps conferences/events attended this year were the MLOps Salon events by Verta, apply(), and the Dask Distributed Summit. The MLOps Salon events were praised for getting external speakers to talk about real-world use cases without promoting their own products too much. Apply() had high-quality speakers talking about practical examples, and many talks were recorded for later viewing. The Dask Distributed Summit had talks intended for intermediate and advanced Dask audiences, which were very informative. However, the ODSC conference seemed to relax their policy against corporate speech substantially, resulting in many shallow sales demos or meaningless marketing slides.', 'citations': [{'id': '2021-12-15 17:03:28.4682 UTC', 'message': "U026QR1HB61: Given we're nearly at the end of 2021 now, I'm interested to know what were the best MLOps conferences/events people attended this year and why?U026QR1HB61: For me, I loved the MLOps Salon events that Verta put on. They did a great job of getting external speakers to talk about real-world use cases without promoting their own products too much :raised_hands:U01DHAF466L: apply() was the one to beat for meU01VCA57PD0: Really enjoyed *apply()* (<https://www.applyconf.com/|link>). High-quality speakers talking about practical examples, and I loved that so many talks were recorded so I could go back to them later.\n\nThe *Dask Distributed Summit* (<https://summit.dask.org/|link>) was very good. There were some talks there intended for intermediate and advanced Dask audiences, which I really learned a lot from.\n\nI have loved *ODSC* (<https://odsc.com/boston/|link>) since the beginning and probably will not stop going to it, but I found that this year they seemed to relax their policy against corporate speech substantially. A lot of talks were shallow sales demos or meaningless marketing slides :disappointed: . Still some very good ones there though!"}]}
13
11What are the challenges faced and recommendations for alternatives for building an end-to-end MLOps engine with Kubeflow and AWS?{'summary_answer': 'According to the conversation, building an end-to-end MLOps engine with Kubeflow and AWS is a challenging undertaking. The infrastructure underlying it is considered messy, and many teams struggle with it. The challenges include the varying maturity and activity of development of multiple interconnected pieces of Kubeflow, internal Google politics, and the lack of strong tools for data, model, and metadata versioning and lineage tracking. Recommendations for alternatives include checking out the AI Infra Alliance, which has resources about different vendors and tools to accomplish various MLOps tasks. Other options in the space include Seldon for model serving, Determined and ClearML for distributed training, Airflow and Pachyderm for Kubeflow pipelines, and Pachyderm, DVC, and Arrikto for data, model, and metadata versioning and lineage tracking.', 'citations': [{'id': '2021-08-22 08:04:29.4143 UTC', 'message': 'U02C4L9AT28: Hi did anyone build an end to end MLOPS engine on KubeFlow and AWS. If yes can someone shed light on the challenges faced and recommendations for alternativesU015RQSSY1M: Hi Hari, I’m the founder of a vendor in the space and while I haven’t done kubeflow + AWS extensively, In my experience seeing our users trying to tackle Kubeflow, building anything end-to-end is a ridiculously challenging undertaking. The infra underlying it is, IMO, just a mess and we see so many teams struggle and look for alternatives.\n\nWhich pieces of Kubeflow are you most interesting in and I can suggest some other options in the space. There’s pipelines, data versioning, distributed training, model serving, etc. One good option is to check out the AI Infra Alliance — It’s essentially CNCF but for MLOps and they have a bunch of resources about different vendors and tools to accomplish various MLOps tasks.\n\n<@U016L23709X> runs AIIA and can share some resources with you, but here’s a good getting started point <https://hackernoon.com/the-ai-infrastructure-alliance-and-the-evolution-of-the-canonical-stack-for-machine-learning-v6n372o>U02C4L9AT28: Hi Joey thanks let me check , basically aim is to build a Microservices driven architecture for end to end automation of enterprise MLOps + ML , basically like a off the shelf product for accompanying any use case within the Enterprise Ecosystem U015RQSSY1M: That makes sense <@U02C4L9AT28> but still pretty broad. There are many different ways to build end-to-end MLOps automation — some that include kubeflow and many that dont. Are there any more specific requirements your teams are looking for?U02C4L9AT28: Any specific reason why Kubeflow isn’t a better option than others out there !!U015RQSSY1M: In my experience, there are a lot of reasons. Here’s an entire thread about it: <https://medium.com/mlops-community/is-kubeflow-dead-d82aadba14c0>\n\nTLDR:\n1. Kubeflow is not one tool, but multiple interconnected pieces that have widely varying maturity and activity of development. This makes it VERY hard to get started with and Use long term effectively unless you have an incredibly capable k8s and MLOps infra team.\n2. Kubeflow is dealing with a lot of internal Google politics and so there is a lot fo uncertainty about various pieces of the system. E.g. MLDL was a core piece for a while and now is getting EOLed and moved into KFP. \n3. Other tools in the space are simply better at what they do. Which is why I asked which parts of kubeflow you are most interested in. If it’s model serving, Seldon is great option. If it distributed training (which IMO and experience is Kubeflows strength) there is also Determined, ClearML, and many other powerful options. If it’s Kubeflow pipelines, that can work well with many other systems but there’s also Airflow, Pachyderm (my company), and at least a dozen other options. Finally, Kubeflow doesn’t have anything strong for data, model, and metadata versioning and lineage tracking and so other tools like Pachyderm, DVC, and Arrikto are viable here. U02C4L9AT28: This is great stuff let me digest it across, I have had similar feeling with Kubeflow and have known of teams who kind of had tough time implementing and managing enterprise solutions on Kubeflow '}]}
14
12Are there any published benchmarks for the compute (cpu/gpu) needed by trained models in production such as spacy, mt5, and bert?{'summary_answer': "The person asking the question couldn't find any published benchmarks for the compute needed by trained models in production, but another person suggested checking out <https://mlcommons.org/en/> which has training and inferencing benchmarks for various models including spacy, mt5, bert, recommendation engines, object classification, and more.", 'citations': [{'id': '2021-06-22 14:25:27.0746 UTC', 'message': "U020FR2F9HU: curious - are there any published 'benchmarks' about how much compute (cpu/gpu) do trained models in production consume? Cannot find anything on the net!\nspacy, mt5, bert and similarU022KRHMWEB: Try checking out <https://mlcommons.org/en/> they have training and inferencing benchmarks they run and I’ve been analyzing them for my monthly newsletter the last couple of months. Graphic from last montths analysis in my newsletter and on my website <http://silvertonconsulting.com|silvertonconsulting.com>U025K8HEZV3: :point_up_2:MLPerf results in MLCommon's website :point_up_2:U020FR2F9HU: thank you!U020FR2F9HU: only BERT though!U022KRHMWEB: More than BERT in their benchmarks. including recommendation engines, object classification and more. Check it out."}]}
15
13When deploying Tensorflow models, is there a reason not to convert to TFLite and use their runtime? Is it faster even if not using edge devices?{'summary_answer': 'There is no reason not to convert to TFLite and use their runtime when deploying Tensorflow models. It can make the image lighter and run things faster even if not using edge devices. The chance of performance improvement is also high when using ONNX conversion. However, TFLite has less features compared to the saved model and the format might be different, which can affect accuracy. The use of TFLite can be beneficial for faster loading, especially in a use-case with many models where some are used frequently and many are used less frequently. The models load about 6-8 times faster and inference is about 1½-2 as fast on standard x86 CPU without any optimizations, just going from TF/Keras to TFLite.', 'citations': [{'id': '2022-09-22 08:44:19.504809 UTC', 'message': "U02U1NJL534: Hi folks. When deploying Tensorflow models is there any reason not to convert to tflite and use their runtime? It seems like I can make a lighter image and run things faster even if I'm not using an edge devices? Am I wrong or is there just something I'm not getting?U03LPQP9LSJ: I assume the TF lite comes with a subset of operators compared with the full version. You can visualize a model as a graph with different operators like <https://www.tensorflow.org/lite/guide/op_select_allowlist>\n\nThe chance of performance improvement is also high when using ONNX conversion.U01J0NVNE1G: I haven't use tflite. From what I read, you have less features compared to saved model: <https://www.tensorflow.org/lite/guide/ops_compatibility>U01J0NVNE1G: Disk space is not a big constraint. Moreover, this format might be different (i.e., not ISO) with the initial model if it optimizes the model. You will not know if a lack of accuracy comes from the format or the model itselfU02U1NJL534: I also saw someone mentioning that tflite was fast to load. I was thinking that might make sense if you were scaling to zero (i.e. spinning up instances or loading models as required rather than having them actively loaded 24/7)?U01J0NVNE1G: Indeed, you have a good point. I'll be curious to have your feedback on your use case, it can be worth trying.U03GA57GA13: Morten, what's your goal? I'm getting (1) lighter image (reduced autoscaling cold start issues?) and (2) faster inference.\n\nPer Adrian, I'd normally go for ONNX conversion if I'm looking for acceleration, but I'm curious about your use case.U02U1NJL534: <@U03GA57GA13> Faster loading especially. I have a use-case with many models where some are used frequently and many are used less frequently. And of course the largest models are used the most frequently. There's room for latency in one application but now we want to look at another application where there's much less flexibility. But I was looking at these optimized models and wondering why I'm not just doing this with all my products. I've only just heard of ONNX this week and haven't had time to really look at it.U03GA57GA13: Hmm, I wonder how much of a cost penalty you'd pay to load them all into a single Triton container. Dunno if it'd work from a memory management point of view but that would allow you to share GPU resources across all the models, and get you acceleration as wellU022KRHMWEB: I’d use Tflite if you were deploying to edge devices, ARM/RISC V CPUs… memory and compute footprint should be smaller for the same precision modelU02U1NJL534: So, I finally got around to messing with TFLite. On standard x86 CPU the models load about 6-8 times faster and inference is about 1½-2 as fast. Without any optimizations, just going from TF/Keras to TFLite. I used an activation functions from TF Addons so it wasn't totally vanilla but the conversion still worked fine. I guess running a TF/Keras model with jit would be faster but again - I want to be able to swap models quickly. And it doesn't seem to work with convolutions currently. I am also thinking about setting it up so I keep maybe 5 to 10 models loaded and ditch the oldest. Use a dummy request to load the model when a user is likely to load one. But I also feel like that would then require more babysitting since it would catastrophically fail if the system wanted to use more models than the cache had room for.\nI need to also try it out with ONNX and see what the numbers are for that. Model size itself is definitely not the main determining factor in how quickly a model is loaded."}]}
16
14What are the best practices for version controlling Jupyter Notebooks?{'error': [{'body': 'Transformation prompt_completion with name prompt_completion failed on state {"input":{"prompt":"Context:[\\"U02N5E8D68L: Hey everyone ! I have a question related to Jupyter Notebooks ! what are the best practices to track and version codes done by data scientists ? i think notebooks are pretty much impossible to version with git , how can we keep track of changes, should we create scripts based on the notebook then push it to git or there is some better way to do this. (if we take on consideration that we can’t just abandon notebooks )U027REJ3275: check out nbdime for version control <https://nbdime.readthedocs.io/en/latest/>U027REJ3275: papermill for parametrising and running notebooks like a script <https://papermill.readthedocs.io/en/latest/>U01ACU6EGMV: This question gets asked around here quite often so it might be worth looking around to other threads as well, but the gist of it generally is: “Don’t have important code in notebooks, so don’t version control them”. Instead, write normal packages that can be version controlled, and import the functions from with error Error: Prompt is too long. Please reduce prompt in length.', 'step_name': 'prompt_completion', 'raw': 'Error: Prompt is too long. Please reduce prompt in length.'}]}
17
15Are there any ML/MLOps/Engineering summer schools that are engineer focused for people that are not in academia?{'summary_answer': 'The person in the context is wondering if there are any ML/MLOps/Engineering summer schools that are engineer focused for people that are not in academia. Another person suggests checking out Full-stack Deep Learning, which is a hands-on, less-theoretical course on deep learning, including the ops stuff around model building and deployment, and is focused on building ML-powered products. However, it is not clear if this course is specifically for people who are not in academia.', 'citations': [{'id': '2023-01-17 13:06:31.283309 UTC', 'message': 'U01HBEHNX16: Not sure if this is the right channel. I was wondering if anyone knows of any ML/MLOps/Engineering summer schools, primarily focused on machine learning in the real world? I have seen some that are academically focused, but wondering if there are any that are engineer focused for people that are not in academia, or at least that are accepting of people not in academia.U04J20537C3: Hi there! You may want to check out Full-stack Deep Learning: <https://fullstackdeeplearning.com/>\n\nIt’s a hands-on, less-theoretical course on deep learning, including the ops stuff _around_ model building and deployment. An illustration on their homepage (please see attached screenshot) pretty well sums up the course contents: “You’ll learn how to build ML-powered products.”\n\nI’m not sure how often they offer the course - it was last held in autumn 2022 - and was $400, though they offered some scholarships if that price point was out of one’s budget.'}]}
18
16What are the best practices for logging, monitoring, and alerting with microservices and what are the critical things to log, monitor, and alert for initial deployment of microservices?{'error': [{'body': 'Transformation prompt_completion with name prompt_completion failed on state {"input":{"prompt":"Context:[\\"U01BZJQRN2D: Any recommendations on best practices for logging, monitoring and alerting with microservices?\\\\n\\\\nMy team owns ML microservices that we serve as internal facing APIs. The microservices are serverless functions hosted on k8s. Company mainly uses DataDog for monitoring and alerting and we have it integrated with our k8s cluster. Looking for best practices/critical things to log/moniter/alert for our initial deployment of these microservices.U018R57P83F: very interested to know about this too, I\'m currently facing this issue and need to make a decision on which tool to implement!U019G2NE5H6: So I did this work at Slack, although the services in question weren\'t serverless (just stateless). My two cents is that *standardization* of metrics and log formats is the most important thing by far, _especially_ if you can do it in a language-agnostic way. At Slack, we had Python/Java/Go-based services that all used an identical logging format and had a with error Error: Prompt is too long. Please reduce prompt in length.', 'step_name': 'prompt_completion', 'raw': 'Error: Prompt is too long. Please reduce prompt in length.'}]}
19
17Has anyone written or come across detailed lists of what MLOps engineer would require? Model deployment is a broad category, and we’re looking for something more targeted, perhaps inference resource management.{'summary_answer': 'Yes, the person with the username U017CCHCW8K has compiled a MLE curriculum that may be helpful. Additionally, the website ml-ops.org has been used as a foundational resource.', 'citations': [{'id': '2022-02-21 16:41:21.763919 UTC', 'message': "U024WRAA0D9: Hi all, I’m putting together a list of skills required for MLOps engineers (gonna be a very long list). Does anyone have any resources on that?\n\n*TLDR;*\nWe have a career skills matrix for each role at our company, and that is also mapped to the seniority. We use this to manage performance conversations and help our engineers grow mindfully - focusing on learning the right things at the right time. We have created this for our currently roles: data scientist, ML engineer, software engineer and DevOps engineer. MLOps engineers require specialised skills from MLE and DevOps, but I don’t want to simply combine the two lists.\n\nHappy to share with the community once that is done, but would be great to have some input on the bright minds here. Thanks!U024WRAA0D9: Has anyone written or come across detailed lists of what MLOps engineer would require? Model deployment is a broad category, and we’re looking for something more targeted, perhaps inference resource management.UP3T8K9M5: i feel like <@U017CCHCW8K> might have something she knows about?U01R2314L02: Sounds like a good idea. I'd love to see this when readyU017CCHCW8K: I have compiled the MLE curriculum here <https://docs.google.com/document/d/1QOPGOrtS_m2AVPH6xPWEpINb0YLV1DmyBkHrGcEM9ZQ/edit|https://docs.google.com/document/d/1QOPGOrtS_m2AVPH6xPWEpINb0YLV1DmyBkHrGcEM9ZQ/edit>\n\nI hope this helps.U024WRAA0D9: <@UP3T8K9M5>and <@U017CCHCW8K>, we are actually using <http://ml-ops.org|ml-ops.org> as a couple of our foundational categories. Glad it seems like we’re on the right track at least!U017CCHCW8K: <@U024WRAA0D9> glad to see that people are reading it :slightly_smiling_face:\nI am planning to add more chapters this year e.g about documentation in ML projects"}]}
20
18Is pyspark/spark still necessary for analytics on big data or are there alternative tools like Redshift, Snowflake and BigQuery? What are the downsides of other solutions compared to spark?{'error': [{'body': 'Transformation prompt_completion with name prompt_completion failed on state {"input":{"prompt":"Context:[\\"U01T78HPG3H: Is pyspark / spark still needed in the modern data world? Or are redshift, snowflake and big query filling all the requirements for analytics on big data?UP3T8K9M5: wow this is such a loaded question! I love it!U02U1H6RH53: I hear of a lot of people still using spark, I think for streams in particular spark is still idealU01FCPMHETD: I *wish* spark wasn\'t needed. But for many many use cases it\'s still the only tool out there that can do the job reliably. Every 6 months or so I go and try out the alternatives: essentially Modin/Ray and Dask. But for our use case they are not making the cut. Looking forward to trying out python on Snowpark when it becomes GA as another alternative!U02NR6C4M29: From someone with 0 experience on spark, what use case can\'t be solved with other solutions?\\\\n\\\\nStream processing?\\\\n\\\\nReally big data pipelines?U01U2BX26UW: I also am not a big fan of it because of the complications it brings (and you need to scratch with error Error: Prompt is too long. Please reduce prompt in length.', 'step_name': 'prompt_completion', 'raw': 'Error: Prompt is too long. Please reduce prompt in length.'}]}
21
19Is it feasible to use DVC as a data/model repo and stream models into memory for microservices in k8s? Is there a recommended way to fetch the latest committed model in PR?{'summary_answer': 'Yes, it is feasible to use DVC as a data/model repo and stream models into memory for microservices in k8s. There are multiple ways to fetch an artifact of a specific version using DVC, including using `dvc get` to download it as a file or directory, using `dvc.api` to stream it into memory, or doing `git clone` + `dvc pull mode.plk`. As for fetching the latest committed model in a PR, it is not specified in the context whether there is a recommended way to do so, but it may require modifying a config file for every change.', 'citations': [{'id': '2020-10-17 17:49:47.3747 UTC', 'message': 'U01CCDR3ZNK: Hey,\n\nI’m currently considering DVC as a data/model repo. My data scientists would be working on a Jupyter Notebook and saving their models/data experiments in DVC. I’m spinning up microservices using k8s will be loading the models on service start. Is this feasible with DVC? Should I just save the images with the models? Ideally, I’d want to be able to switch models as efficiently as possible once the DS have created SOTA models. UCVNTNL13: hi, Luigi! DVC provides a few way to get an artifact of a specific version when you need:\n\n1. `dvc get` can download it as a file (or directory). (<https://dvc.org/doc/command-reference/get#example-get-a-dvc-tracked-model>). E.g. you can run `dvc get` as part of the CI/CD that builds the image. \n2. `dvc.api` can stream it into memory if you don’t want to save it into the image - (<https://dvc.org/doc/api-reference/open#example-use-data-or-models-from-dvc-repos>).\n3. do `git clone` + `dvc pull mode.plk` - similar to 1. \nSo, it’s just a matter of how exactly do you want organizing your microservice. Saving file into the Docker image on CI/CD can save time when services starts by avoid downloading an artifact + you won’t need provisioning access to Github/Gitlab.\n\nOn the other hand fetching them into memory on start means that you make your image size smaller, no need for additional storage, etc. Propagating a new model might mean just restarting the service after you merged a new PR into master/prod.\n\nHope that answers some of your questions, please let me know if you need any additional details or examples.U01CCDR3ZNK: Hi Ivan,\n\nThis is truly great stuff thank you! Streaming into memory method is exactly what I was looking for. Do you have a recommended way to fetch the latest committed model in PR or do I need to modify a config file for every change? '}]}
22
20What are some other signals worth watching regarding the pullback on funding for Cloud ML products and where does Pitchbook get their data from?{'summary_answer': 'The question asks for two things: other signals worth watching regarding the pullback on funding for Cloud ML products and where Pitchbook gets their data from. The context provides the following responses:\n\n- U03EHUCF53K asks for "Other signals worth watching?"\n- U01JFEFSQV7 asks "curious to know how does Pitchbook get their data from?"\n- U022KRHMWEB comments that Pitchbook is "behind email wall, yech"\n\nTherefore, the other signals worth watching are not explicitly mentioned in the context, and Pitchbook\'s data source is also not revealed due to the email wall.', 'citations': [{'id': '2022-10-07 15:13:05.129689 UTC', 'message': 'U02U1H6RH53: Cloud ML products being hit the hardest by the funding pullback :(((\n<https://pitchbook.com/news/reports/q2-2022-artificial-intelligence-machine-learning-report>UP3T8K9M5: i get the feeling you are on to something.U03EHUCF53K: Say more about this… U03EHUCF53K: Other signals worth watching?U01JFEFSQV7: curious to know how does Pitchbook get their data from?U022KRHMWEB: behind email wall, yech'}]}
23
21When working with large image datasets for computer vision, what are the benefits and drawbacks of running augmentations in memory vs. in a pipeline with the results stored in S3?{'summary_answer': 'The benefits of running augmentations in memory include extremely fast training and reduced computational load, while the drawbacks include the need for a lot of storage and only having a fixed number of versions per sample. On the other hand, the benefits of running augmentations in a pipeline with the results stored in S3 include a lot more potential diversity and not needing to store all the versions, while the drawbacks include significantly longer processing time and higher resource demand. The choice between the two options depends on the computation cost vs. training time and the need for transparency in reporting the applied transforms.', 'citations': [{'id': '2022-03-29 12:39:26.732199 UTC', 'message': "U01T78HPG3H: With large image datasets for use in computer vision. How often are augmentations run in memory vs. in a pipeline with the results stored in something like S3? *Offline vs online image augmentations debate.*U01T78HPG3H: And what are the relative benefits / drawbacks of each:?U037F7TBHV3: Usually, I only run fixed preprocessing beforehand. Augmentations are always computed online. But I also have really big datasets of medical images (~ 10TB). That means, storing different versions of each image is simply not feasible for me.\n\nThe major benefits for computing beforehand and storing are:\n• Extremly fast during training\n• Reduced computational load (less RAM/CPU/GPU power required)\nThe main drawbacks there are:\n• May need a lot of storage\n• Only N fixed versions per sample\nFor online computation it’s the other way around.\nPro:\n• A lot more potential diversity (especially when combining many transforms)\n• No need to store them all (when combining many transforms, you potentially have a huge N -&gt; Need a lot of storage (~Nx your dataset)\nCon:\n• Takes significantly longer (especially if your transforms are costly. E.g. medical image computing with 3D image interpolation is very costly, whereas 2D transforms can be better parallelized and if applied with sufficient processes are only a minor bottleneck if at all)\n• Higher ressource demand (except for storage ofc)U02R4JWF0DS: As <@U037F7TBHV3> highlighted Pro and Con. In my short experience I have seen both and actually it boils down to computation cost vs training time. In experiment phase, we recently started doing on the fly augmentations to analyzing impact of other model parameters. (Use case : commercial Satellite image)U01T78HPG3H: <@U02R4JWF0DS> say you're using AWS, storage in S3 very cheap, GPU very expensive. Surely financially you'd be better off storing augmented versions rather than calculating on the fly?U024J4KM2M7: If I do online augmentation (maybe with horizontal flip, zoom range and adding noise), how can I report what and how augmentations had happened? I missing transparency. I know there is randomness. May at least the amount of transformation per image...? How do you handle it?U037F7TBHV3: My augmentation pipeline uses image classes storing the representation of applied transforms within themU02R4JWF0DS: <@U01T78HPG3H> I do agree that storing is comparatively cheaper and one can only store augmented images if they have already fixed type of augmentation strategy. But if you are experimenting with combination of multiple random augmentation and model parameter, you have to maintain big subsets of images containing augmented data and if you decided to test a new augmentation method, you have to first create a dir with such images, make a change in pipeline (though could be done with config file). If you have a system to manage and version images to maintain traceability, I think one can use pre augmented images in pipeline. Also during training time you have to move all these images to training instance, as streaming from S3 will increase i/o time.U01T78HPG3H: <@U02R4JWF0DS> Say you have a stable augmentation pipeline, no longer regularly experimenting with that side of things. Would you be accruing GPU costs as you transferred the data from S3? Time itself wouldn't be an issue, just the potential associated costs. Dataset of a few TB.U02R4JWF0DS: <@U01T78HPG3H> no transferring won't cost.U01T78HPG3H: <@U02R4JWF0DS> but there would be costs just for keeping the GPU live as you transferred the data?U02R4JWF0DS: <@U01T78HPG3H> yes I/O have cost factor but if you are streaming data and creating batch in RAM before training on 1st batch is still happening, you can reduce that cost. Considering you have sufficiently large enough ramU02R4JWF0DS: else moving data to local memory will also reduce costU02R4JWF0DS: as IO latency will be low in this caseU01T78HPG3H: Just revisiting this briefly, does anyone know of a package that lists the time taken for different augmentations. I'd like to understand this tradeoff specifically for horizontal and vertical flips for object detection.\n\nI'd assume they're about as low as you get in terms of computational difficulty, so wondering if there's any chance it's faster to flip in memory than read from disk.U01T78HPG3H: <@U01RN0J7M8R>?U01RN0J7M8R: <@U01T78HPG3H> you can refer to this: <https://github.com/albumentations-team/albumentations#benchmarking-results>"}]}
24
22What are some good resources for scaling inference workloads on multiple GPUs using Kubernetes clusters or other solutions?{'summary_answer': 'Some good resources for scaling inference workloads on multiple GPUs using Kubernetes clusters or other solutions include Ray, Deepspeed, Hugging Face Accelerate, Horovod, Nvidia Triton Servers, and papers on model parallelism or data parallelism. Additionally, OpenAI has deployed gigantic models at scale and may have insights into their technology and techniques. The paper "Efficiently Scaling Transformer Inference" and OpenAI\'s inference tech paper are also recommended. The kserve documentation provides information on autoscaling with GPU in an existing Kubernetes environment. Nvidia has a good document on using multi-instance GPU (MIG) on a single and latest generation like A1000. Other options may depend on the infrastructure being used.', 'citations': [{'id': '2023-02-08 14:45:08.860339 UTC', 'message': 'U04LQ2HM9DM: Hi , are there any good resources where I can get information on how to scale inference workloads on multiple GPUs on top of kubernetes clusters. Will also appreciate resources which talks about scaling inference across multiple GPUs without kubernetes as well. We want to build capabilities that can help us deploy large models (GPT , Stable Diffusion, DALL - E) on multiple GPUs with kserve(if its possible) or any other solution works as well.U02NR6C4M29: Search for ray, deepspeed, hugging face accelerateU02NR6C4M29: HorovodU02NR6C4M29: Mmm now I\'m thinking if those are mostly focus on training and not much on inference, not sureU04LQ2HM9DM: Yeah exactly there are solutions for training but very little resources on inference. But there are players like OpenAI and such deploying gigantic models at scale at catering to a very high QPS(query per second) not sure what technology and techniques they are using internally.U022KRHMWEB: I think you need to look at model parallelism or data parallelism and how they apply to inferencing. There’s quite a few papers on those that apply to training but there just might be some that apply to inferencing as well. good luck. I’ve got some of them in my “papers to read list” but it’s about 272 items long and getting longer everytime I check this slack community ;/U022KRHMWEB: But I did find this one which might help: <https://arxiv.org/pdf/1807.05358.pdf>U04LQ2HM9DM: Thanks <@U022KRHMWEB>U04LQ2HM9DM: I was just surfing the web while I found something called as Nvidia Triton Servers they claim that it can run on multiple GPUs and multiple nodes which is exactly what I was looking for but not sure how it would work like in a kubernetes cluster or do we even need a k8s cluster in the first place has been a puzzling question. Any thoughts or opinions on those?U04HKD72GH2: I just found this paper yesterday. I think it\'s what you\'re looking for. It\'s a pretty new and advanced topic so I wouldn\'t expect simple tutorials.\n\nThe paper has Jeff Dean on it so I suppose it\'s a big deal.\n\n<https://arxiv.org/abs/2211.05102|Efficiently Scaling Transformer Inference>U022KRHMWEB: ok now 273… and counting Thanks <@U04HKD72GH2>U04HKD72GH2: Mine is 147 after adding your paper!U04HKD72GH2: I went ahead and read the paper I linked. They talk about another paper in the "related work" section, which is literally the paper describing OpenAI\'s inference tech. (sorry for the additional paper <@U022KRHMWEB>)\n\nThey show metrics around GPT2, so if you want to deploy that, this is your best bet!\n\n<https://arxiv.org/abs/2207.00032>U038PKGAURW: From one of my colleagues: This is general topic but for now\n• In an existing k8s env with kserver, they can follow<https://kserve.github.io/website/0.8/modelserving/autoscaling/autoscaling/#autoscaling-on-gpu| autoscaling with GPU>\n• There\'s another approach that uses multi-instance GPU (MIG) on a single and latest generations like A1000. NVidia has a good doc <https://developer.nvidia.com/blog/getting-kubernetes-ready-for-the-a100-gpu-with-multi-instance-gpu/|here>\n• On non-k8s env, there\'re other options depending on the infra, if using bare-metal EC2-like instances or other services which need more info to give good recommendations\nU038PKGAURW: Another colleague of mine asked ChatGPT :slightly_smiling_face:U02NR6C4M29: Anyway the solutions mentioned at the top for training shouldn\'t work well for inference too? Maybe if requires special response times you need something specific, but the only difference between both is the back propogation, that is an extra step, the forward happens in bothU04LQ2HM9DM: THanks <@U04HKD72GH2> <@U038PKGAURW> <@U02NR6C4M29> will definitely go through all the resources you guys shared and there are quite a lot as I see :joy: .'}]}
25
23What are the pros and cons of using Dask to run data pipelines for ML compared to using Spark?{'summary_answer': 'The pros of using Dask to run data pipelines for ML include its easy integration with ML pipelines, its ability to handle batch processing of large amounts of data, and its developer-friendly environment. Additionally, deploying a Dask cluster in a Kubernetes environment on AKS can help scale up processing. \n\nOn the other hand, the cons of using Spark for the same workload include its heavyweight nature and the need for another environment like Databricks. Although it is technically possible to integrate Databricks step in an Azure ML Pipeline, it can make branching, local development, and debugging more difficult.', 'citations': [{'id': '2023-03-10 02:18:18.088119 UTC', 'message': 'U03640JDK2Q: Just wondering how many are using *Dask to run Data Pipelines for ML* !?\nI need to run Batch Processing Data Pipelines on a Data Lake with 10 to 100 GBs of data (across 1000s of Avro files). I might need to run it on a weekly or daily basis.\n\nI see dask based python code integrates well with ML pipelines. I can easily fit in a Dask-based data processing stage in my Azure ML Pipelines.\nI am considering deploying a Dask Cluster in a Kubernetes environment on AKS to scale up the processing.\n\nAs an alternative, I am considering Spark for the same workload. But, Spark seems heavy weight and it needs another environment like Databricks. Although it is technically possible to integrate Databricks step in an Azure ML Pipeline, it’s not really software developer friendly. Branching, local development, debugging, etc becomes a bit more difficult.\n\nI am kind of leaning toward using Dask for my Data Pipeline. But wanted to get your views on my thought process and see if I am missing or assuming something.'}]}
26
24What's the general way to answer if a model can run with a particular runtime?{'summary_answer': 'The general way to answer if a model can run with a particular runtime is to use a container-based deployment, where the runtime can be bundled in the image. Additionally, one should consider the target hardware and whether the runtime is tied to a specific framework. Using a generic runtime like ONNX runtime and a container is a good option for those who want something that works without worrying about the details. Another approach is to create a test suite that measures runtime capabilities and tests for missing operators or incorrect outputs. Simple models that are nothing more than operators can be used for individual coverage and integrated testing.', 'citations': [{'id': '2022-02-23 19:06:44.744839 UTC', 'message': 'U01KZAA85C5: One thing that I’ve been thinking about is the variety of ways that you can put models into production. With the growing number of ML runtimes and platforms that they target, is there a general way to answer the question “can I run a model with a particular runtime?”U024SSR9T5Z: I feel like docker-based deployments cover most use cases, since you can bundle pretty much whatever runtime you want in the image.U01KZAA85C5: So you match the runtime to the model you want to run?U01JFBUKMCG: So this lecture might interest you:\n\n<https://www.youtube.com/watch?v=BgP6fEIadtg>U033YB2HXLN: Some runtimes target specific hardware, like TensorRT, and some are made to be generic, like ONNX runtime. So the question you should also ask is what your target is. Also, some runtimes are tied to the framework (libtorch is for PyTorch models, and Tensorflow work great on Google TPUs). If you have the time and budget to roll your own hosting solution, then you might benefit from architecture lock-in. If you just want something to work without fretting about it, then go with something like ONNX runtime and a container.U033YB2HXLN: Oh and I just saw that you are dev advocate for TVM :slightly_smiling_face:. So it seems your strategy would be generic for all hardware with a container based solution.U01KZAA85C5: <@U01JFBUKMCG> interesting that the answer seems to be “onnx runtime for everything.” I’m in pretty strong agreement that a stable and compact runtime for inference is very important.U01KZAA85C5: <@U033YB2HXLN> yeah, I’m moving on to work on some other items related to model support in general. In the past I’ve worked on conformance programs, and have been looking at the ONNX scorecard. One thing I’ve been wondering is if there would be greater community value in creating collections of tests that measure runtime capabilities. So you can answer questions directly like “does runtime x support my model y” generically.U033YB2HXLN: A test suite would be very welcome by the community! I suppose the way to go about it would be by poking the model for operators and checking for anything missing from the runtime support. In some libraries (like Hugging Face Transformers, the supported model types have already been declared <https://huggingface.co/docs/transformers/serialization?highlight=onnx#onnx>)U01KZAA85C5: One approach that could work well is to just have very simple models that are nothing more than operators, but also provide a collection of standard models. In this way you get both individual coverage and integrated testing. Importantly, I’d like to separate the test format from the runtime, treating it fully as API testing.U033YB2HXLN: Having basic models for integration testing would be nice, but if you really want to answer the question "does my model work on X runtime", you\'d need to dig into the modelU01KZAA85C5: I think this is kind of how ONNX backend testing works. They treat operators as simple models and hand it off to the specific backend implementation. ONNX gets you a lot of the way there, but it ties you to ONNX specifically.U01KZAA85C5: Dig into the model in what way? For example, if I have an expected set of inputs and outputs alongside a model, is that sufficient to test? Crashing, throwing an import error, or giving the wrong answer can all be classified as answering that question as “no”.U033YB2HXLN: Yes, you are right, you can just try it out :slightly_smiling_face:. Perhaps I misunderstood your original question - what would the test suite offer? Just trying to convert the model to all runtimes and seeing if it works?U01KZAA85C5: Something like that. Using Kubernetes as an example, you can take the set of API tests provided by sonobouy and give a simple yes/no answer to “does my K8s installation conform to the API specification”. Similarly, if we think of operators and models as APIs, we can treat them as specs and ask “exactly what does my runtime support”. A large body of the work is in the specification of the tests, with implementation details of the backends left to the providers (this is similar to how ONNX runtime presents their backend scorecard, just more generic)'}]}
27
25Has anyone set up distributed multi-node GPU training in Pytorch using torch elastic? Are there any resources or performance benchmark analysis available? Any insights on using Horovod?{'summary_answer': 'Yes, someone has set up distributed multi-node GPU training in Pytorch using torch elastic. They found a blog post that helped them set it up and trained using it. They also shared a docker image for testing on CIFAR dataset. They also explored using Horovod for this purpose. There are resources and performance benchmark analysis available on this front, including the Pytorch tutorial on parallelism and the Kubeflow blog post on elastic training.', 'citations': [{'id': '2022-03-17 10:43:44.094229 UTC', 'message': 'U020ZV0UHL5: Did anyone set up distributed multi-node GPU training in Pytorch using torch elastic? Any resources or performance benchmark analysis on this front?U01RTRE8199: Did you have any more insights into this? Looks like there’s abilities to do this using Horovod\n\n<https://blog.kubeflow.org/elastic%20training/operators/2021/03/15/elastic-training.html>U020ZV0UHL5: Will check this out. Exploring this <https://pytorch.org/tutorials/beginner/former_torchies/parallelism_tutorial.html>\n\nWill let you know on the latest updatesU01RTRE8199: Thanks <@U020ZV0UHL5>! I’m going to start reading this nowU01RTRE8199: Did you make any more progress on your end?U020ZV0UHL5: Yeah David. I did. Found this blog. Set it up. Trained. <https://medium.com/pytorch/a-step-by-step-guide-to-building-a-distributed-spot-based-training-platform-on-aws-using-b54acd06ecb2>U020ZV0UHL5: Let me know if you try this. Can help you out if needed on this. I\'m trying to create a helm chart to make it easier. Setting up so many YAML\'s is very difficultU01RTRE8199: This looks great <@U020ZV0UHL5>, thanks for sharing the blog post with me.\n\nGoing to start reading it this afternoon and will look into setting it up. Were you able to get the spot instances to work the way you liked within a zone of your choice?\n\nI’ll give you a shout if I run into any issuesU020ZV0UHL5: Yeah, David. Sure.U020ZV0UHL5: Some issues with yamls in the blog are, in imagenet_trainer.YAML first line is "<http://orch.org/v1alpha1|orch.org/v1alpha1>" it\'s supposed to be "apiVersion: <http://elastic.pytorch.org/v1alpha1|elastic.pytorch.org/v1alpha1>"U01RTRE8199: Any other things you noticed? I noticed an issue building the first service yesterday, but was able to fix itU020ZV0UHL5: Were you able to create a cluster? The Yaml is so messy. I think ami: auto isn\'t working for the current kubernetes version. So created cluster from console and added nodegroups from console.U020ZV0UHL5: figuring out subnets was also tricky. Had to create manually all from AWS console.U01RTRE8199: Haven’t gotten to my cluster yet - currently on testing the original docker container and it looks like that even has issues because some of the packages have been depreciated since the article was wroteU01RTRE8199: Sounds about right though - last time I set up a kubeflow cluster off of EKS I also had a fun time solving subnets and other issuesU020ZV0UHL5: Aah okay. Kubeflow is a like a problem on another level. I\'ve created a docker image from my code and just added things that are needed for torch elasticU01RTRE8199: Yeah kf on AWS is also definitely a challenge. Let me know if you run into any issues there.\n\n By chance, did you publish your docker image public? I’d be interested to see it to get over some of the speedbumps I’m currently facing.U020ZV0UHL5: Yeah! Trying on CIFAR dataset: Repo url: "sachinnetbook/torchelastic-test"U01RTRE8199: Found it! Thanks a bunch :smile:U01RTRE8199: <@U020ZV0UHL5>, for your EKS cluster, what version did you use? I’m going to set up as 1.19U020ZV0UHL5: I used 1.20'}]}
28
26Is there an open-source platform/tool that allows defining on-demand scoring flows?{'summary_answer': 'Yes, Seldon Core is an open-source platform/tool that allows defining on-demand scoring flows. It has the concept of inference graphs allowing you to write custom inference logic including splitting out pre-processing and post-processing, ensembling models, routing, etc. Kubeflow Pipelines are also mentioned as a good option for building relatively complex inference graphs.', 'citations': [{'id': '2020-11-17 04:36:27.3067 UTC', 'message': 'U01DJD2PZ2M: Does anyone know a product that supports defining/executing on-demand scoring flows?\n\nJust like ML training can have multiple steps, and we have different products (like Kubeflow, MLFlow) supporting training pipelines; Inferencing can also involve various stages, and that flow can also be defined declaratively.\n\nIs there any open-source platform/tool that allows defining on-demand scoring flows?U017693U7PF: Seldon Core has the concept of inference graphs allowing you to write custom inference logic including splititng out pre-processing and post-processing, ensembling models, routing etc. Disclaimer: I work at Seldon.U01DHAF466L: Super interesting -- can you share a super simple example of a real world scenario for on-demand scoring flow? Would love to learn more :slightly_smiling_face:U01DJD2PZ2M: Thanks <@U017693U7PF> I did explore Seldon before posting this question. Can you share any link about inference graphs?U017693U7PF: I think the best place is Seldon docs: <https://docs.seldon.io/projects/seldon-core/en/v1.1.0/graph/inference-graph.html>U01BWK3R9UH: I actually think Kubeflow Pipelines are also pretty great for the inference case. You can build relatively complex inference graphs, and with the right tools to connect to training artifacts and data this has been pretty straightforward for me.U01DHABU6PK: Thanks <@U01BWK3R9UH>! Very insightful!'}]}
29
27Are successful DS/MLEs expected to be more proficient with coding and ops than normal SWEs?{'summary_answer': 'No, the context suggests that successful DS/MLEs are not necessarily expected to be more proficient with coding and ops than normal SWEs. However, the market is unsatisfied with DS performance in coding, and there is a trend towards non-tech DS roles that focus on YAMLs and analysis. The discussion also suggests that DS (without a SWE background) can learn engineering practices and improve their coding skills. MLEs are expected to have skills in cleaning and deploying models, but they may also need to have stats training and a quantitative mindset to be effective in the DS part of their role.', 'citations': [{'id': '2022-09-22 12:09:25.984639 UTC', 'message': "U02NR6C4M29: <https://mlops-community.slack.com/archives/C0253TQTUQ0/p1663846672701119?thread_ts=1663569734.321959&amp;cid=C0253TQTUQ0>\n&gt; Quite the opposite, the market is clearly unsatisfied with DS performance (who generally bad coders) and the market direction is going to the “non-tech DS” route meaning you will be restricted to do YAMLs and analysis.\nThat’s quite an interesting thing that merits its own thread :smile:\n\nMy unpopular opinion: As SWEs can learn ML and become MLEngineers, DS (non SWE background) can learn engineer practices and get better on what they do that is… writing code.\n\nI’m starting to see many offers with DS =&gt; you create a model, MLE =&gt; you clean it and you put it into production. But I feel MLE can be as good on the DS part as most DS, and DS that do not know SWE practices, essentially moving MLE from the modeling part to make room for DS.U015BH45ZK6: &gt; I’m starting to see many offers with DS =&gt; you create a model, MLE =&gt; you clean it and you put it into production.\nThis is pretty much the premise of my talk at PyData, that this leads to frustration because it breaks all three parts of Self Determination Theory\n\n&gt; But I feel MLE can be as good on the DS part as most DS\nYou absolutely must have stats training to be a DS (yeah, caveats, YMMV, because I start another 100 response thread). Analysis is a core part of the work because there is no specs in the SWE sense so you need to statistically validate your solutions.U015BH45ZK6: (Of course a lot of CSes will have good enough math to pick up DS level work but they do need quantitative mindsetU040G4526FK: in my view (very small one, since i am devops), MLE adds the unit tests, regression tests, adds test data job, validation data job, acc job. Adds Model monitoring and logging, and retraining model and deploys to the production/inference endpoint. Then guys like me, build the Data and ML infras + CI/CD infra.U015BH45ZK6: Ok, but which projects at which phase? If you do it too early, you might work on an otherwise infeasible project and you will definitely be inside the DS team's experimentation loop. But then you somehow need to move faster than they, and they are already supposed to move fast.U040G4526FK: “It depends”U015BH45ZK6: Here, this is how we think about this, the DS experimentation loop need to be faster than all the others and they can execute faster by improving the codeloop.U03GDD9HVR8: Completely agree <@U02NR6C4M29>. My MS in in applied stats and I've been doing data in industry since 2010. A good portion of the last 7 or so years has been just trying to get better at coding, learning git, etc. I thoroughly enjoy it, but I also have a healthy fear of not being able to keep up code-wise and the effect that would have on my career."}]}
30
28What model serving tools would you recommend for quickly deploying multiple ML models into production with flexibility for custom preprocessing?{'summary_answer': 'The recommended model serving tools for quickly deploying multiple ML models into production with flexibility for custom preprocessing are torchserve (for PyTorch models), TFX (for TensorFlow models), and MMS (for Mxnet models). Additionally, lambda functions can be used for workflow and scaling purposes. Other options include Algorithmia and Iguazio, which offer similar capabilities, and data orchestration products like Pachyderm for pre/post processing pipelines.', 'citations': [{'id': '2021-05-20 15:31:29.1008 UTC', 'message': 'U022CK8HXFC: Hey folks, question regarding model serving tools - we have a lots of ML models and we need to have possibility to quickly deploy them into production with flexibility of writing custom preprocessing for example. What would be your recommendation for such task?U01U5GE8ZBK: First, you need to deploy the model kubernetes cluster nodes using model serving tools like the torchserve(pytorch model), TFX(tensorflow model), MMS(Mxnet model). if you are using multi models, you would be good to put the whole inference pipeline to one container.\nSecond, you can make lambda function for the workflow so that you can easily scaling the performance and speed up the processing time.U016A3RAL5N: <@U022CK8HXFC> a number of of the ML serving tools in market will let you write direct functions for pre-processing and pos processing directly in them. Wilson mentioned one pattern with lamba to a container. We offer this as well at <http://algorithmia.com|algorithmia.com> as well as orchestrating the pipelines in between them for scaling up and down. I believe iguazio has some of these capabilities as well via their functions as a service offering. Other serving products will rely more on you building the pipelines. Alternatively you can look at data orchestration products like pachyderm for the pre/post processing pipelines (specially if data lineage is important to you)'}]}
31
29What are the experiences and thoughts of users heavily using or evaluating AWS SageMaker, and what tools have they found useful and where did they find value? Are there any gaps or issues requiring tools outside of the AWS ecosystem?{'summary_answer': 'The users heavily using or evaluating AWS SageMaker have shared their experiences and thoughts in a thread from 6 months ago. They have found some of the tools in the SageMaker suite to be good and useful, but there were also gaps or issues that required tools outside of the AWS ecosystem. Some users have specifically mentioned using SageMaker Monitor/Clarify. The person who initiated the conversation is looking for updates on what has changed since the previous thread. Another user is curious about issues with SageMaker that make people pick/switch to other tools like Kubeflow.', 'citations': [{'id': '2021-04-08 11:23:21.4531 UTC', 'message': 'U01C7FBKF0X: There was a great thread on this (<https://mlops-community.slack.com/archives/C015J2Y9RLM/p1601034271016700|here>) 6 months ago and I’m wondering what has changed since then. AWS released a few more products into the SageMaker suite and a bunch of updates. For those heavily using or doing a detailed evaluation of AWS Sagemaker.\n\t1. Which of the many tools there have you found are good and useful?\n\t2. Where did you get value?\n\t3. What were the gaps or issues that required tools outside of the AWS ecosystem?U01C7FBKF0X: <@U0134LJPWQ3> how was your experience?U01C7FBKF0X: <@U01NM3EG77B> saw you were evaluating SageMaker Monitor / Clarify. Curious to know if you are using it.U023LG9V6R5: <@U01C7FBKF0X> Can you share you thought on what you have learnt on the topic? I am curious about issues with Sagemaker that make people pick/switch to other tools like Kubeflow etc.U01C7FBKF0X: Hey <@U023LG9V6R5>, I did collect a bunch of info from here and external sources. Once I find the document, I’ll send youU023LG9V6R5: Thanks you <@U01C7FBKF0X>!!'}]}
32
30Has anyone used serverless GPU providers like banana.dev, slai.io or any other serverless GPU provider for production use cases? Would like to connect/chat to understand more about how you are using it.{'summary_answer': 'Yes, the person asking the question is inquiring if anyone has used serverless GPU providers like banana.dev, slai.io or any other serverless GPU provider for production use cases and would like to connect/chat to understand more about how they are using it.', 'citations': [{'id': '2023-02-27 08:15:52.439509 UTC', 'message': 'U04Q6807PK8: Hey Everyone,\nHas anyone used serverless GPU providers like <http://banana.dev|banana.dev>, <http://slai.io|slai.io> or any other serverless GPU provider etc for production use cases.\nWould like to connect/chat to understand more about how you are using it.'}]}
33
31What options are available for improving a machine vision application's generic pipeline consisting of image retrieval from cameras, running algorithms, storing images, sending data to a visualization platform, etc.?{'summary_answer': 'The context does not provide a specific answer to this question.', 'citations': [{'id': '2021-07-02 10:22:15.0165 UTC', 'message': 'U01C7838VJR: Hi everyone, is there such thing as a best of breed MLOps pipeline/workflow system for computer vision powered products? Any ideas or resources are much welcomed:alert:U01RN0J7M8R: Hi Remco, could you elaborate a bit on what you mean by “pipeline/workflow system”?U01RN0J7M8R: What are you trying to achieve?U01C7838VJR: Hi David, thanks for your reply. My goal is to learn more about best practices in computer vision in production. I was wondering if there is something like SoTa platform like Michelangelo for CV/NLP powered products and what best ML infra to choose. E2E platforms or specialist ML(Ops) tools. Btw i liked the example of embedding stores given :point_down::skin-tone-3: by <@U019ABXBYET> which could be part of a best practice system as well.'}]}
34
32What should be the standard practice when deploying your model weights with Docker and what are the common mistakes people make when downloading or baking their model weights?{'summary_answer': "The standard practice for deploying model weights with Docker varies among individuals, but some common approaches include baking the weights into the Docker image or storing them in S3 and fetching them during inference. Common mistakes when downloading or baking model weights include not properly versioning them and not establishing a reliable way to track changes. Tools such as MLFlow's model registry and KubeFlow can help with versioning and deployment. Other options include using a purpose-built model registry, Git LFS, or loading the model into memory while the service is launching.", 'citations': [{'id': '2022-03-31 05:12:05.835249 UTC', 'message': "U01PQLSMBRP: Hello everyone. Quick question on deploying ML models with docker.\n\nI noticed the common trend is to bake your model weight when building your docker image while developers deploying ML models on AWS store their model weights in s3 and during inference, fetch models from s3 and serve predictions.\n\nI watched <@U7DR3F82C> video on baking vs downloading model weights. You can find the video <https://www.youtube.com/watch?v=CHttwWGdWK4|here>. Although it is still not clear to me on what the standard practice should be.\n\nWhat should be the standard practice when deploying your model weights with Docker?\nWhat are the common mistakes people make when downloading or baking their model weights?U02NR6C4M29: what's better for your use case?U02GVVBQT5L: how would you reliably version weights in S3?U01V3JDJAQ6: I prefer baking them into the Docker image and/or a Python package.\n\nS3 can be okay.\n• Pros: Simplicity\n• Cons: Since you're not using a standard way of packaging software for distribution, you need to establish some other way of version tracking, so you can reliably tell what was used for a particular run.\nI find that lots of data science tools provide custom ways of doing things (including packaging of code). I think this is a case where they're prioritizing perceived ease of use over operational goodness. :shrug:U024SSR9T5Z: I also like baking in the weights, because its simpler than using S3 and also versions things nicely, as has been said.U0282TV4UEQ: I suppose having a versioned way to keep track of the models being used is quite important. I am testing out pulling the model into the docker image from a model store(<https://modelstore.readthedocs.io/en/latest/|modelstore>), further down the road I am planning on migrating over to MLflow's model registry.U01V3JDJAQ6: We're looking at using MLFlow's model registry as well. I'm thinking (hoping) if we use it, we will store some sort of MLFlow model key in the image or package, so we'll have model traceability in the image itself.\n\nI can't escape the feeling that MLFlow and similar tools want to lock you into their way of doing things, so if you ever abandoned the tool, you'd lose traceability. Clever for them, but maybe bad for users. :slightly_smiling_face:U02Q1AW1ABU: If you're using Docker and MLFlow maybe just bite the bullet and use KubeFlow? I don't view lock-in to OSS as particularly perniciousU02Q1AW1ABU: Unless the project is abandoned or becomes poorly maintained/sustainedU01V3JDJAQ6: I agree, it's not a terrible thing. I do like to keep my options open if the effort to do so is modest. But sometimes it makes sense to fully commit to something and reap the convenience. :raised_hands:U033QREB56V: I agree with <@U0282TV4UEQ> and <@U01V3JDJAQ6> about pulling from MLflow model registry (or similar) instead of S3 to get versioning capabilitiesU02CY3STNSH: I’ve come across vastly different solutions (incl. in this community) each with pros/cons.\n1. Expose S3 or some other object store. Maybe on top NFS\n2. DJL’s ModelZoo, which uses a non-default file format\n3. Baking in into the Docker image (as mentioned above). A substrategy is to add the model as the last layer, provided your code changes less frequently than than model. Otherwise reverse this.\n4. Using Huggingface’s private model hub\n5. Git LFS\nThe problem dimensions expand if your case involves if the model is updated more frequently than the code, or the other way around. Another thing is if you want to ensure a point of truth when doing CI testing, as well as production and other forms of runtime environments.U019ABXBYET: We don't bake the model in during build time. We load it into memory while the service is launching, and can then reload it on demand.\n\nOne angle on this is asking whether getting a model into a service should require a build: there are pros and cons of each side. We've opted to not require this, because we felt it sped us up. And we don't read directly from s3 either, we have a registry service that manages all of that for us \n\n(Disclosure: it uses <#C0227QJCDS8|oss-modelstore>!)U01PQLSMBRP: <@U01V3JDJAQ6> Thanks alot for your contribution. Apart from MLFlow, what other tool can help with thisU01PQLSMBRP: <@U02CY3STNSH> Indeed the dimensions expand if the model is updated frequently but with the right model versioning tool, one can be able to track changes.U031F7D87RA: Baking the model into the container is nice because it provides a super simple way of versioning, plus you get easy rollbacks etc... It's a fairly standard practice to deploy a new container when you have an update, so there's lots of tooling to help you.\n\nIf you weren't going to bake the model in I'd probably start by taking a look at a purpose built model registry."}]}
35
33What are some preferred methods for converting PyTorch models to TensorRT for deployment using Nvidia Triton? Are there any issues with using TensorRT for dynamic batching?{'summary_answer': 'Some preferred methods for converting PyTorch models to TensorRT for deployment using Nvidia Triton include using the torch_tensorrt library and compiling models on Nvidia PyTorch containers. However, there may be issues when trying to compile models that support dynamic batching using TensorRT with minimum and maximum batch sizes. An alternative plan is to convert PyTorch models to ONNX and then to TensorRT.', 'citations': [{'id': '2023-04-03 21:02:33.417059 UTC', 'message': "U0503457E3S: Hello, everyone! :blush: I'm curious about your preferred methods for converting PyTorch models to TensorRT, specifically for deployment using Nvidia Triton. I'm looking for a straightforward, reliable approach that can potentially be incorporated into a CI/CD pipeline. Currently, we're compiling models using torchscript and using the PyTorch backend. I'd like to explore using TensorRT more extensively.\n\nI've experimented with <https://github.com/pytorch/TensorRT|torch_tensorrt> by compiling models on Nvidia PyTorch containers, making sure the TensorRT version matches that of the Nvidia Triton container for deployment. However, I've encountered issues when trying to compile models that support dynamic batching (using TensorRT with minimum and maximum batch sizes, rather than a fixed tensor shape).\n\nIf I'm unable to resolve these issues, my alternative plan is to convert PyTorch models to ONNX and then to TensorRT. Any suggestions or experiences you can share would be greatly appreciated!"}]}
36
34How to resolve the issue of a pod remaining unschedulable despite adding tolerations?{'summary_answer': 'Install the necessary GPU drivers as mentioned in the Google Kubernetes Engine documentation.', 'citations': [{'id': '2021-04-07 15:59:36.4148 UTC', 'message': "U01NTK2841W: ```train_task.set_gpu_limit(1)\ntrain_task.apply(\n gcp.add_gpu_toleration()\n)\n\ntrain_task.add_toleration(\n tolerations=V1Toleration(\n effect='NoSchedule', key='<http://nvidia.com/gpu|nvidia.com/gpu>', operator='Equal', value='present'\n )\n)```\nI have added the tolerations, but the pod remains unschedulable, and still does not see the GCP gpu nodes as viable. Anyone ran into this issue?U017N5AUYUC: <https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers|https://cloud.google.com/kubernetes-engine/docs/how-to/gpus#installing_drivers>\n\nHave you tried this?U01NTK2841W: Yes! That was it! I thought I can install the drivers later. Thanks a lot. :smile:U017N5AUYUC: No problem, it’s not super obvious from the Kubeflow gcp docs, and I think it used to be included as part of the installation. Anyhow, glad it’s worked :+1:"}]}
37
35What would be the checklist for taking over an ML project for maintenance or further enhance the model between two teams with similar skill sets?{'summary_answer': 'The checklist for taking over an ML project for maintenance or further enhancing the model between two teams with similar skill sets would include meeting and decision minutes, code and data lineage, how-tos, reports and how they are made, documentation of everything, keeping anything mission-critical in version control, and automated tests.', 'citations': [{'id': '2022-03-29 07:53:27.099089 UTC', 'message': "U02TZ24T492: What would be your checklist of things which needs to be met , for taking handover ML project for maintenance / further enhancing ML model ? U015BH45ZK6: Handover from who to who?U02TZ24T492: :thinking_face: between 2 teams. U015BH45ZK6: I meant which two teams and what are their individual responsibilitiesU02TZ24T492: Sure. Two teams with or more less with similar set of skill set folks ( data scientists, data engineers , MLops engineers etc.,). First team is responsible for developing the ML application and put it to production, whereas Second Team responsible for maintaining/ enhancing the Model. \n\nHope this explanation helps in answering your question.U01R2314L02: Easy. Merge teams, leaving one team to build, productionise, and maintain / enhance the thing :slightly_smiling_face:U015BH45ZK6: Yeah, but it is weird why you need the handover. Product mindset would suggest that the team that developed the model will maintain it as they already gather a lot of knowledge about the domain/problem/solution. Especially if both teams have similar skillset.U015BH45ZK6: Ehh, <@U01R2314L02> was faster.U01R2314L02: hahaU015BH45ZK6: You snooze you loose (or brevity is king)U015BH45ZK6: If you need to hand over:\n• meeting and decision minutes (mainly rationalities of the decisions)\n• code and data lineage\n• howtos \n• reports and how you make them\nSo document everything and keep anything mission critical in version controlU02TZ24T492: I wish you were my boss/ bosses boss ( better )U02TZ24T492: I couldn't agree more , but unfortunately division of responsibilities shared more or less this way.U02TZ24T492: May be one of the point a which could be added to the list is “Automated tests”.U01R2314L02: Sounds like this case has circumstances outside your control <@U02TZ24T492>\n\nBut speaking in general, I've found that the most successful teams have full ownership of the things that they build. So that means the team that builds the thing also deploys the thing, maintains the thing, and continues adding features to it later on.\n\nIt's natural that specialist help will be needed along the way. So you might have someone who isn't attached to that team helping out with tooling, or scaling, or monitoring, for instance. But there's a core team that has ownership over their product.\n\nThat applies for *any* kind of software, not just ML models."}]}
38
36What is the best solution to manage multiple models in a system pipeline, and how to handle dynamic mapping of models when using RESTful containers?{'summary_answer': 'The best solution to manage multiple models in a system pipeline is to create a common interface for the models and create one docker image that can load the required model as a parameter for execution. However, the challenge is to handle dynamic mapping of models when using RESTful containers. One solution is to create a service that acts as a reverse proxy and dynamically generates the addresses for each model. This service can be responsible for starting the docker machines and generating the addresses. Kubernetes can be used to create each model as a service with any number of pods, each one based on an image for that model. Other solutions mentioned in the context include Seldon, Algorithmia, and mlrun.org.', 'citations': [{'id': '2021-07-13 05:21:37.4718 UTC', 'message': 'U01VD89958Q: Hello Awesome People in MLOps! I looking for a solution to manage all the models I have in my system pipeline. So basically I have a custom written service which uses multiple models for each request. There are more than 100 models as of now (mostly small models running on cpu and few runs on GPUs). The numbers will be growing upto 200-300 in a year. Currently all models are loaded in kafka consumer queues. I want them to be RESTful &amp; containerised so that they can be used by anyone not just the queue. Its very painful to create docker for each of them and manage their port for mapping and also updating them with feedback. A solution like TFserver would have been useful but that is limited to TF only while I have all sorts of operator models. I tried options like kubeflow, Bentoml but couldn\'t make a good case. Please guide me here.U01S2349UNS: If you abstract a common interface for your models (something like SKLearns Estimator API), then you can create one docker image and just pass the model they need to load as parameter for the executionU01VD89958Q: Yes True, that I am trying to do but the problem is currently I decide when to use which model based on the name of Kafka topic (which is dynamically generated). If I go for docker then each model will be on some port (if machine is same). How would I keep dynamic mapping of all models ? Like if I know i need to run price prediction model then I will send the request to "clientid-place-price" topic, for demand prediciton " clientid-place-demand" topic. Now with RESTful I need to keep a map of topic-port mapping for 100s of unique models. How to tackle that?U01S2349UNS: You can create a service that does that exactly mapping (which is basically a reverse proxy). It can be responsible for starting the docker machines dynamically and generating the addresses. If you use kubernetes, each model could be a kubernetes service and have any number of pods, each one based on an image for that model (you get really good scalability control that way).U02785LVDQC: Is <https://www.seldon.io/|Seldon> of any help?U01S2349UNS: so, basically a predict service that has an endpoint /predict/{model-name} and input on the body. This service holds a registry of which address each models is at (either by the models registering themselves or by the service in itself creating the containers), and routes them awayU01VD89958Q: <@U01S2349UNS> Yes that\'s what I am looking for, a service responsible for dynamically generating the address as and when a new model is coming up or when an existing model get updated (keeping same address). I am hoping I can find this service clubbed with available open source solutions like seldon core/kubeflow or something. Or do I have to write that on my own? Ingress does this job in kubern8s.\nI really appreciate your inputs, Thanks for that. :slightly_smiling_face:U01VD89958Q: <@U02785LVDQC> I am already exploring this.U01R9SZDDLY: If you want to keep it simple you can have monolith service that routes requests to specific model based on request parameter. Else if you need microservice based try aws lambda for each model and tie through api gateway. You can orchestrate them through aws step functions if requiredU015BH45ZK6: At 2-300 models I\'d definitely go for a managed solution rather than bake something at home. Also a model catalogue could be useful esp if you start updating the models.U013XV3LHCP: Also you might want to think what happens with deserializing the model artifacts. Just recently I ran in an unpickling error because of a non major version change in one of my dependencies. One docker image might be very restrictiveU01S2349UNS: I mean, there will be expections (unpickling is hell with tons of small issues). When an exception appears, you can build a different docker image on top of the general oneU01VD89958Q: One docker image wont be a great idea if I\'m going to place 100s of different models through it. Because with that I will lose the CPU parallel processing power. It will be difficult to have one docker to use all cpu capacity with internalised multi-processing.U01S2349UNS: Docker image != Docker container. you can have multiple containers with one image. Each model would still have any number of containers.U01VD89958Q: yeah, my bad. We can abstract one image and create multiple containers through it each serving different model, right? Then have a service to manage addresses or routing through dns template.U019715GGMU: <@U01VD89958Q> we built Algorithmia to solve precisely this problem. Each container is a separate and immutable object and each version becomes its own container and all the serving logic is built in so you don’t have to mess around with k8s. You don’t want to be in a situation where you are overwriting what’s in a container or stuffing multiple models into a container if you have compliance requirements. We support all the langs and frameworks you want to use. Git push and go - leave the pain of MLOps to Algorithmia….U01GQKMUQAJ: <@U01VD89958Q> I suggest you take a look our open source <http://mlrun.org|mlrun.org> which leverages another open source we have created called <https://github.com/nuclio/nuclio|Nuclio> and allows you to deploy your models as serverless functions. Nuclio is ultra-fast and can run cloud or on-prem. Each function has multiple triggers, including HTTP and Kafka. Also possible that each function to host multiple models (e.g., ensembles).\n\nThe code does run in a container, so of course you can specify the image, but you also optionally build it as part of the function deploy, so if there are variations, it\'s easy to create those containers ad-hoc.\n\nIt\'s much easier to leverage serverless capabilities to scale up/down as needed rather than trying to orchestrate yourself 100+ different containers.U01RZ22SB5E: <@U01VD89958Q> take a look at <http://Algorithmia.com|Algorithmia.com>, the single leader of <https://gigaom.com/report/gigaom-radar-for-evaluating-machine-learning-operations-mlops/|MLOps platform>. Let me know if you have any questions, I can walk you through with a demo.U01BUBPNZR9: Has anyone used nuclio/mlrun on singularity containers ? <@U01GQKMUQAJ> U01GQKMUQAJ: <@U01BUBPNZR9> I\'m not aware of someone who has done that, for some of the use cases singularity addresses, such as file-system access, we implemented an API that makes the mount much easier.U01VD89958Q: <@U01GQKMUQAJ> Can we have a Demo of Iguazio for our use case ?U01GQKMUQAJ: <@U01VD89958Q> sure thing. Moving to DM so we don\'t spam others'}]}
39
37Has anyone worked with AWS EKS for managing ML pipelines in combination with AWS SageMaker and CI/CD?{'summary_answer': 'Yes, <@U01T8RK505N> has some experience with using AWS EKS, AWS SageMaker, and CI/CD to manage ML pipelines.', 'citations': [{'id': '2022-04-07 04:24:03.406029 UTC', 'message': 'U02P9FHBP28: Does anyone have experience w/ AWS EKS? Particularly, using combination of CI/CD, EKS and AWS SageMaker to manage ML pipelines?U01T8RK505N: Yes, have some experience on this front. How can I help <@U02P9FHBP28> ?U02EGFTH8DV: What’s the piece of EK in the equation? I have experience with all of these tools :slightly_smiling_face:'}]}
40
38What is the preferred way to deploy multiclass classifiers?{'summary_answer': 'The preferred way to deploy multiclass classifiers is not specified in the context. Different approaches such as random forest, one-vs-rest type classifier, and ensemble of simple models have been mentioned as options, and the choice depends on factors such as the number of classes, their importance in production, and whether they are mutually exclusive and collectively exhaustive.', 'citations': [{'id': '2021-05-18 17:59:38.0462 UTC', 'message': "U01N8ERHH9T: what’s your preferred way to deploy multiclass classifiers? for example, do you like:\n• something like random forest or neural net and just return the class with the most votes?\n• a one-vs-rest type classifier with say logistic regression and return the predicted probabilities for all classes?\n• lots of simple models for each class that fit in an ensemble together at the end using some rules or decision tree splits?\n• anything else?UV92GMLF4: Hierarchical Parent Node + Soft Voting.U019ABXBYET: We’ve done the first and third approaches. I personally don’t have a preference, it depends on how many classes there are, whether we care about all of them in prod, and whether “none of these” needs to be a viable responseU01N8ERHH9T: Good points. I think the other aspect to consider that might get overlooked is whether the classes are _mutually exclusive and collectively exhaustive_. If you are missing a class from your data, the model is still probably learning it but incorrectly assigning it's weights. Likewise if your classes overlap it's really tough to know which to predict."}]}
41
39Is the 'operations' tag in MLOps limiting? How do people typically define MLOps?{'summary_answer': "Some people find the 'operations' tag in MLOps limiting because it implies a repeatable process that gets good results, whereas the pace of change in AI research and MLOps makes it an evolving problem. However, others argue that 'operations' is an ongoing workstream and accurately describes the ongoing maintenance and upgrades required for ML systems. People typically define MLOps as the set of practices and tools used to manage the lifecycle of machine learning models, including development, deployment, and maintenance.", 'citations': [{'id': '2023-02-01 16:09:52.451979 UTC', 'message': 'U02F2LJHM7X: I think the actual work in what we call and talk about in MLOps is exciting - but does anyone find the "operations" tag to be limiting? Operations typically is about executing on a strategy, right? A repeatable process that gets good results. However the pace of change in AI research and MLOps makes it an evolving problem, not a "solve it and move on" problem.\n\nWhen explaining what my team does, they are here to discover, experiment and innovate with architecture, platform and workflow ideas to deliver ML at ever increasing scale - more capabilities applied to more datastreams for more users.U01J0NVNE1G: Indeed, the best proof of that is Large Language Model. You couldn\'t innovate if the engineering side would not propose better scalability techniques.\n\nIt makes sense to build the best engineering capabilities to enable the end user, and I don\'t think one solution can be built without the other. Even if you have an end-to-end solution like Sagemaker, it\'s still valuable to have the tooling to leverage it properlyU04HBG5485P: Yeah I find the term can be misleading, but also I try not to worry too much about it… I’m not sure there’s any good way to sum up the field in less than 10 characters.\n\nTo the point about “operations” — it sounds like you’re talking about it in the context of chief operating officer, or operations research. But the ops in MLops is from DevOps, where it’s a contrast between developers who write software and IT operations who manage servers &amp; network (historically). I haven’t heard of anyone calling it IT operations anymore though, so it’s a bit outdatedU02U1H6RH53: I think AI systems is better, but not as catchyU01QFUGN2UQ: “Operations” is an on-going workstream. “Projects” are limited duration workstreams, with a clearly defined start and end time. In this context I think ML Ops is accurate…?U02F2LJHM7X: great discussion, thanks!\n&gt; the ops in MLops is from DevOps, where it’s a contrast between developers who write software and IT operations who manage servers &amp; network (historically). I haven’t heard of anyone calling it IT operations anymore though, so it’s a bit outdated\nthis is the Ops I was thinking of, from DevOps.... I would not want leaders &amp; other parts of the company to think our MLOps mission is focused exclusively on "keeping the lights on".\n&gt; AI systems is better, but not as catchy\nI like this, although Systems in some companies (my IBM background) refers to HW, servers.U04HBG5485P: ah I see what you mean. When I talk to PMs and other non-technical folks generally I talk about MLops as developer productivity, especially as it relates to reliable, secure servicesU03TBSSRV51: I tend to phrase it like : if you\'re on a project with a defined terminus it\'s ML-engineering, but if it is in execution/maintenance/upgrades it\'s MLOps.'}]}
42
40Is it possible to install two versions of the same pip packages to the same python environment and choose which to load? What's the least hacky workaround?{'summary_answer': 'Yes, it is possible to install two versions of the same pip packages to the same python environment and choose which to load. However, it is not a recommended practice. The least hacky workaround is to evaluate the models separately and save the evaluation results in a standard format (json, csv, etc) and load those evaluation results into a single notebook. Another workaround is to have the models on different branches and checkout the relevant branch to run the model. It is also suggested to prepare a python package, publish it to an internal pip, and pip install from wherever it is needed.', 'citations': [{'id': '2023-01-12 17:39:26.158319 UTC', 'message': 'U015DM4NM1U: Hi! is it possible to install two versions of the same pip packages to the same python environment and choose which to load?\nMy use-case is:\n• we develop model `my-model==v1`, train `model1.pkl`\n• we develop model `my-model==v2`, train `model2.pkl`\n• then we want to evaluate both in the same Jupyter notebook. But `model2.pkl` is incompatible with `my-model==v1` and vice versa.\nI know, it\'s a bad practice and we should evaluate both in different docker images, but our infra is not ready and data scientists need to be unblocked.\n\nWhat\'s the least hacky workaround?U044BKV9TNF: What about evaluating them separately, but saving the evaluation results in a standard format (json, csv, etc) and loading those evaluation results into a single notebook?U015DM4NM1U: yeah.. I guess that\'s the cleanestU044BKV9TNF: probably the fastest too :confused:U01HBEHNX16: that’s what we do too. have them on different branches. checkout v1, run, … checkout v2, runU015DM4NM1U: good point. it\'s shame that python doesn\'t encourage users to use the proper packages given this limitationU01HBEHNX16: I will do it with a `pip install -e` so checkout updates the proper python packageU015DM4NM1U: thank you for the answersU01HBEHNX16: This mode of working works well with the DVC approachU01HBEHNX16: each branch will have a different set of trackersU01HBEHNX16: it does not work well if your data is streamingU015DM4NM1U: I\'d say this approach doesn\'t scale - it\'s good for ds for eda or model development, but brings problems with automation. But I agree this is best we can do if need to evaluate different models in the same envU015DM4NM1U: having no proper solutionsU01HBEHNX16: what do you mean it doesn’t scale?U015DM4NM1U: you have to git clone the repo each time you need the package, then cd too the directory, etc. A proper way that scales is to prepare a python package, publish it to an internal pip, and pip install from wherever you need it. This way you can import it and reuse in other packagesU015DM4NM1U: I\'ll be back laterU01UXSLEL85: "Conflicting dependencies in a single environment" is an oxymoron. But as others have said you need to update the package version before loading each model. You could define (and maintain) different sets of <https://setuptools.pypa.io/en/latest/userguide/dependency_management.html#optional-dependencies|optional dependencies> on the package, update your model class (if possible) to handle the versioning logic (i.e., make a single version handle both exports), or just set up your notebook to `%pip install` the relevant version before loading/evaluation. U01HBEHNX16: in a notebook?!?! how dare you'}]}
43
41What resources are available to help train an ML model more effectively, specifically an XGBoost model on historical data that is performing poorly? How can one avoid trial and error while retraining the model, and is there a way to understand what is causing the performance issues?{'summary_answer': 'There are several resources available to help train an ML model more effectively, specifically an XGBoost model on historical data that is performing poorly. One can avoid trial and error while retraining the model by doing a more in-depth analysis to understand what is causing the performance issues. Some suggestions include checking data processing for errors that destroy the structure in the data, exploring the data to check if there is anything to learn, checking the SHAP values to better understand what the model is doing when calculating individual predictions, and reviewing feature importance to see if it matches the knowledge one has on the domain. Additionally, one can use tools like modela, which is a kubernetes native AutoML system that comes pre-built with XGBoost parameters and ranges, as well as a dozen other ML algorithms. It supports automatic feature selection and feature engineering and allows for training hundreds of models in parallel.', 'citations': [{'id': '2022-08-09 21:33:25.491189 UTC', 'message': "U03JEEAKNA2: Hi Everyone, I wonder if you could direct me to valuable resources on how to train an ML model more effectively. I am trying to train an XGBoost model on historical data, but the performance is poor. I don’t like to retrain the model using trial and error and randomly change parameters and features. I am looking for a more depth analysis to help me understand what causes this problem and how it can be addressed. I appreciate any reply.U033T45JYAK: Hey Kiana, can you get a bit deeper at what you’re trying to achieve? I’m working on a tool that makes it much easier to add real-time personalization to ranking items, which might be useful in your case.U03R014NKUZ: Hi,\nI suspect by performance you mean the accuracy of the prediction and not training time?\nIf your predictions are really bad, I would check your data processing for errors that destroy the structure in your data.\nOf course it might be that there is nothing to learn in the data, but I hope you did some basic inspection and exploration to check for that.\nTo better understand what the model is doing when calculating the individual predictions, you can check the SHAP values (<https://shap.readthedocs.io/en/latest/example_notebooks/tabular_examples/tree_based_models/Basic%20SHAP%20Interaction%20Value%20Example%20in%20XGBoost.html|https://shap.readthedocs.io/en/latest/example_notebooks/tabular_examples/tree_based_mo[…]%20SHAP%20Interaction%20Value%20Example%20in%20XGBoost.html>)U01J0NVNE1G: Hello. I agree with <@U03R014NKUZ> that you should first check if your features and evaluation methodology is suitable for the algorithm. If your dataset relates to time events, you should ensure that you train your algorithm using past events to predict future events. You can derive many interesting features, like the day of the week, day of the month, timestamp ...U01J0NVNE1G: Then, some hyper-parameters have more impact than others. I recommend this guide to better navigate your options: <https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/>U01J0NVNE1G: Finally, you can review the features importance and see if it matches the knowledge you have on the domain: <https://machinelearningmastery.com/feature-importance-and-feature-selection-with-xgboost-in-python/>U01JFBUKMCG: If you want you can try to use modela, which is a kubernetes native AutoML system (written using CRDs and operators) . It comes pre build with XGBoost parameters and ranges (as well as a dozen other ML algorithm). You can train 100's of models in parallel (across nodes, or within a node - each model is a docker container). It also supports automatic feature selection and feature engineering.\n\nI also recommend trying other algorithms, as they might perform better on your data.\n\nPlease DM me if this sounds interesting."}]}
44
42What are some recommended open source visualization tools for EDA?{'summary_answer': 'Some recommended open source visualization tools for EDA include Bokeh, D3.js, plotly, geoplotlib, Perspective, Facets, Altair, seaborn, and vega.', 'citations': [{'id': '2020-08-17 16:14:49.3474 UTC', 'message': "U017A2BBZBJ: Hi all. Just a quick one. Can anyone recommend their preferred open source visualisation tools for use during EDA? Thanks!U019N0N6G72: plotly, d3.js and bokeh <@U018N6MTH40>U018N6MTH40: Depends what your use case is visualisation, wise?, i Prefer to use <https://github.com/bokeh/bokeh|Bokeh> as <@U019N0N6G72> mentioned earlier, which fufils most of my use cases. other for example, being: <https://github.com/andrea-cuttone/geoplotlib|Geoplotlib>\xa0\xa0- geoplotlib is a python toolbox for visualizing geographical data and making maps, <https://github.com/finos/perspective|Perspective>\xa0\xa0Streaming pivot visualization via WebAssembly\xa0<https://perspective.finos.org/>. theres quite a varied choice you can useU018N6MTH40: D3.js is also quickly becoming better and better and more user friendly in time.U017A2BBZBJ: Thanks <@U019N0N6G72> <@U018N6MTH40> I’m new to this world so will take a look at Bokeh and D3.js . Plotly I know and is great. Out of interest why not matplotlib?U013CL3GTB3: <https://towardsdatascience.com/powerful-eda-exploratory-data-analysis-in-just-two-lines-of-code-using-sweetviz-6c943d32f34> is a cool one for easy EDAU017A2BBZBJ: Thanks <@U013CL3GTB3> - might save that for my morning coffee!U019N0N6G72: <@U018N6MTH40> matplotlib lacks aesthetics, if you want to display the visualisations on a dash and use matplotlib it won't look good.U017WUPQQ93: Facets from Google is good for quick summary statistics and visualizationsU017WUPQQ93: Altair is a good project from Jake VDP, who has done a lot for the scipy community. <https://altair-viz.github.io/>U017CCSNYDV: This post has a graph of the toolscape (incomplete) <https://www.kdnuggets.com/2019/06/pyviz-data-visualisation-python.html>\nseaborn is my usual tool (but it looks like I need to try sweetviz!)U018QM6DUTB: It depends on what you’re trying to do, but I’ve always liked: <https://seaborn.pydata.org/>U017A2BBZBJ: Thanks <@U018QM6DUTB> <@U017CCSNYDV> <@U017WUPQQ93>U018HUQTNBE: A slightly higher level approach than d3.js is vega: <https://vega.github.io/vega/|https://vega.github.io/vega/>. If you go the web route I'd go that way. Vega and d3 have some integration with Jupiter notebooks as well I think.U017WUPQQ93: ^ Altair is a python implementation of Vega!"}]}
45
43As the only data scientist for an ecommerce startup, what can I do? Any suggestions?{'summary_answer': 'As the only data scientist for an ecommerce startup, you can start by understanding the tools the dev team has in-house and asking them to set up a git repo for your code. You can focus on internal analysis and generate simple descriptive analysis to understand the data. You can also work with the business units to identify what would add value in terms of predictive analysis, such as analyzing the behavior of creators that generate the most revenue and figuring out how to retain and bring in more of those people. It is important to prioritize based on what would make a difference to the business units.', 'citations': [{'id': '2022-06-14 20:59:16.370039 UTC', 'message': "U0393538LCA: Need help on what all can I do for an ecommerce like startup as the only guy that they have at the time for data science, ML and MlOps and anything related to this field.\nIt's just me right now. <https://hypd.store|https://hypd.store> is the official website please check it out, and any suggestions on what all can be done are welcomed.\nThanks already for any help guys.U02ET0KPFTP: Do they have a dev team? If so I would suggest speaking to them and understanding the tools they have in house already. Ask them to set you up with a git repo for your code as step 1 :slightly_smiling_face:\nDoing ML alone is hard.U02KEE6B2F8: There's a ton you could do. Are you focused on internal analysis or are you focused on helping the creators that open a store?\n\nEither way I would start with the basics for internal needs. Things like all the data aggregated and make sure you understand it. Start generating simple descriptive analysis, ask the business people what will add value to them.\n\nThen identify with the business units what would add value in terms of predictive analysis like analyzing the behavior of creators that generate the most revenue, what are their characteristics, what do they do in the platform, etc. try to figure out how to retain and bring in more of those people and try to figure out how to prevent they leave.\n\nIn principle more that trying to think what you can do by yourself I would reach out to the business units and try to understand what would make a difference to them and try to prioritize.\n\nKeep it simple at this point"}]}
46
44When should we use Prometheus Alertmanager instead of configuring alerts in Grafana when using Prometheus and Grafana together?{'summary_answer': "We should use Prometheus Alertmanager instead of configuring alerts in Grafana when we need better deduplication of alerts, especially if we have Grafana configured with high availability. Additionally, if the connection between Grafana and Prometheus fails, we may suddenly have no alerts. However, if we have alerts coming from other sources than just Prometheus, it may be advantageous to use Grafana's alert system. Ultimately, it is considered best practice to use Prometheus Alertmanager over Grafana's alert system when we don't have any other sources of alert than Prometheus.", 'citations': [{'id': '2022-08-03 09:47:47.049759 UTC', 'message': "U03AKP34YTW: When we use prometheus and grafana together, what is the need to setup *Prometheus Alertmanager* when we can configure alerts in Grafana itself? I haven't worked with Alerts before, so please excuse if the question is silly.U01ACU6EGMV: grafana is not as good in deduplicating alerts, especially if you have it configured with high availability. Also, if for some reason the connection between grafana and prometheus fails, you suddenly have no alerts.\n\nAdvantage of using Grafana though is that you might have alerts coming from other places than just Prometheus.U03AKP34YTW: <@U01ACU6EGMV> Thanks for the insight! So, it is considered best practice to use Prometheus Alertmanager over Grafana's alert system, assuming I don't have any other sources of alert than Prometheus? Regarding high availability, you mean that we use Service Discovery and Push Gateway to send metrics to Prometheus, right?U01ACU6EGMV: I’d personally use alertmanager of grafana’s alert system in your case, yes!\n\nRegarding the HA, I meant setting up grafana with multiple replicas. Each grafana instance would trigger the alerts without any deduplication"}]}
47
45What is the best tool for env management of python in 2023?{'summary_answer': 'According to the conversation, the best tool for env management of python in 2023 is Poetry.', 'citations': [{'id': '2023-05-13 06:02:34.422689 UTC', 'message': "U020QJYSQGK: Give it to me straight, what is the best tool for env management of python in 2023?U03J3RZ6FSS: I like poetry very much. Has been using since 6 months. It's pretty good.U020QJYSQGK: better than venv and conda yeah?U020QJYSQGK: alright I found some write up, thanks : ) gonna do it.\n<https://blogs.sap.com/2022/05/08/why-you-should-use-poetry-instead-of-pip-or-conda-for-python-projects/>U03J3RZ6FSS: In my point of view, yes. But, poetry can't handle multiple versions of python. So you can use the comb of pyenv+poetry for projectsU020QJYSQGK: but it does set some underlying python right? or is it using the system version?U03J3RZ6FSS: Check DM onceU01J3NRJA3X: I’d say the combo pyenv and poetry U04PCMAE1C3: I've been using poetry for a year, but recently moved to PDM. In poetry I had annoying unexplainable package installation issues sometimes.\n\nboth with pyenvU04QCHR04SD: yeah seems most people motivating poetry theses days, I found it works well, used to use ancient virtualenv :wink:"}]}
48
46What are some strategies for incorporating more recent data into an ML model, retraining it, and validating that it has better performance than the existing model, without biased evaluation?{'summary_answer': 'One strategy is to keep the testing sets consistent across models and deliberately add additional samples to either the training or validation set. Another strategy is to test the old and new model on the same testing data, while training the new model on old data plus a fraction of the new data. Additionally, if dealing with time-sensitive data, it is recommended to train and test the model multiple times on different parts of the data using a time-dependent K-fold train/test strategy. It is important to use test data that are not used for any training to avoid biased evaluation. Shadow deployments can also be used to compare the results of the new trained model with the production model.', 'citations': [{'id': '2022-12-09 13:58:16.943109 UTC', 'message': "U03V28YU8GG: Hey ya'll! I'm trying to solve a problem that has me a bit stumped - scheduled model retraining:\n\nLet's say I have an ML model that I trained a month ago. Now, I want to incorporate more recent data, retrain the model, and validate that it has better performance than the existing model. My issue is that it is difficult to create appropriate training/testing sets because:\n\n• If I use the testing set from the existing model, it will not test performance on newer data\n• If I use the testing set from the new model, its likely that my split will include data that the existing model was trained on \nIn both cases, it seems my performance evaluation is biased towards the older, existing model (this is what I'm seeing empirically as well). Any ideas? Thanks!U01HBEHNX16: I am not aware of a way to solve this without keeping track of the exact samples included in the validation and training set. That would be the first step. Once you know exactly which samples are in training and validation, you can deliberately add additional samples to either set and reason across iterations. Beyond that, there are a lot of different strategies you might employ to figure out how to update those datasets.U01J0NVNE1G: It depends if the data distribution stays the same over time or not.\n\nIf it's time-sensitive (e.g., sales), I would recommend testing the old and new model on the same testing data, while training the new model on old + a fraction of the new data (data which will not be in the testing set). This is the only way to evaluate the data on the same ground.\n\nIf it's not time-sensitive (e.g., cat or dog), I would repeat the same process and simply compare the final metrics with the one of the old model. Even if they are not on the same data, they are from the same distribution.U01J0NVNE1G: In any case, the most important thing is used test data which are not used for any trainingU015BH45ZK6: yes, <@U01J0NVNE1G>'s point is the right one.\n\nIf you retrain you think you have a temporal drift, but then you must train on temporally split data.U015BH45ZK6: Good news is that for the final model you can use all the data and test on yet not existing data (in prod)U03V28YU8GG: Thanks for the thoughtful answers all! For the most part, we are dealing with time-sensitive data so I think it makes sense for us to keep our testing sets consistent across models.U03KY9LT411: I agree with <@U01J0NVNE1G> and <@U015BH45ZK6>. To add to that: If you have time dependent data, you probably want to know if your model would perform well over the whole time period. Not just the period of your test set. So you need to train and test the model multiple times on different parts of the data. You can use a (time dependent) K-fold train/test strategy.U020ZPEHESY: Agree with the above. In case your set up allows it have you also checked shadow deployments? You run your new trained model in production and compare results with the production model…"}]}
49
47Does anyone else have a platform for making sure data stays private and secure?{'summary_answer': 'The context does not provide information about whether anyone else has a platform for making sure data stays private and secure.', 'citations': [{'id': '2021-12-20 15:07:44.186 UTC', 'message': 'UP3T8K9M5: I was just reading this blog about how airbnb does data protection and they have a whole platform for it. Does anyone else have a platform for making sure data stays private and secure? I also wonder how this ties in with the greater ML platforms they have at airbnb. <https://medium.com/airbnb-engineering/automating-data-protection-at-scale-part-1-c74909328e08>U02Q1AW1ABU: The second article in the series seems to discuss.UP3T8K9M5: Opps I didnt even read that one yet!'}]}
50
48What are some methods and resources for building dynamic pricing models in short term rentals and how can demand be factored into these models?{'error': [{'body': 'Transformation prompt_completion with name prompt_completion failed on state {"input":{"prompt":"Context:[\\"U03UJEUDP1S: Is anyone here familiar with dynamic pricing models? I’m interviewing for a role where I’d manage a team that does this, and would love to learn more!U01N8ERHH9T: I did some of this before. We used a multi-arm bandit to try a range of pricing and narrow down on ones that optimized net revenue. \\\\n\\\\nThere should be a lot of info on dynamic pricing around airlines since they have been doing it for a long time.U03NHMDHRQW: I built a sort of a dynamic pricing model, framed as a markov decision process with a definite end state (context was for setting bonuses on open appointments)U02PUD51UGH: I looked after the MLOps for pricing at <http://ASOS.com|ASOS.com>. I never got super deep into the model internals, but most of them were GBMs, with one being developed as a multi-armed bandit. We had a paper at KDD: <https://arxiv.org/abs/2207.01137> you could check out. I recall papers from Alibaba/TMall came up a fair bit in literature review.U03UJEUDP1 with error Error: Prompt is too long. Please reduce prompt in length.', 'step_name': 'prompt_completion', 'raw': 'Error: Prompt is too long. Please reduce prompt in length.'}]}
51
49How can I access the recorded version of the LLMs for the production conference held on the 13th of April 2023?{'summary_answer': 'The context mentions that the person is asking for guidance on how to get the recorded version of the LLMs for the production conference held on the 13th of April 2023. Therefore, it is not clear from the context how to access the recorded version.', 'citations': [{'id': '2023-04-14 01:53:27.913159 UTC', 'message': 'U053KJNLD4H: Hey all, Could any one of you guide me on how to get the recorded version of the LLMs for production conference held on the 13th of April 2023?'}]}
52
50When building/evaluating a framework for ML experimentation, training pipelines, do we bake off kubeflow/mlflow/weights and biases or build our own software stack that just abstracts those implementations? Has anybody released to OSS their own code stack?{'summary_answer': 'Both options are viable, and it depends on the company environment, size, and skill sets available. Some companies, like Netflix, have released their own code stack, such as Metaflow, which abstracts the fundamental components of ML infrastructure. Others may choose to build their own software stack that abstracts those implementations to own the highest level API layer DS use.', 'citations': [{'id': '2021-03-11 18:07:33.4906 UTC', 'message': 'U01J3T8PGG3: Really interested in this groups opinions: When building/evaluating a framework for ML experimentation, training pipelines, do we; a) bake off kubeflow/mlflow/weights and biases b) build our own software stack that just abstract those implementations? Has anybody released to OSS their own code stack?U01BNPM89U1: We open sourced our stack at Netflix - it essentially abstracts the fundamental components of ML infra -data storage, compute, orchestration, experiment management etc.U01PSCJU5UZ: <@U01BNPM89U1> which Netflix project was that? Metaflow?U01PSCJU5UZ: <@U01J3T8PGG3> I think it depends on what you want the end user to do. E.g use those solutions directly, or via some API you provide, or a mix? My guess is that the answer depends on your company environment, company size, and skill sets available, etc. We have gone for (b) — we want to own the highest level API layer DS use so that we can decide whether to build a component ourselves or integrate some 3rd party solution for part of it.U01BNPM89U1: <@U01PSCJU5UZ> <https://github.com/Netflix/metaflow>U01J3T8PGG3: I have been following metaflow, thanks for the link'}]}
53
51What are some options for packaging models without separating them from their underlying runtime, and has anyone had experience with using ONNX?{'summary_answer': 'Some options for packaging models without separating them from their underlying runtime are packaging everything as a docker container and providing a means of pulling the model out if needed, or using ONNX or another serialization format. Some people in the conversation have had experience with using ONNX, with one person saying it "worked out" beautifully but had some version-related bumps along the way, and another person saying it was a rough experience for them and ended up not working. However, another person found ONNX to be the most reliable/least headache for converting common computer vision models.', 'citations': [{'id': '2022-03-22 18:19:50.424269 UTC', 'message': 'U02U1H6RH53: I\'ve been thinking a fair amount about how we package our models. Separating a pickled model from its underlying runtime seems pretty fragile.\n\nOptions I\'m looking at are:\n• Packaging everything as a docker container, and providing a means of pulling the model out if needed\n• ONNX or another serialization format\nAnyone have experience with ONNX? Would love to hear how it worked outU026ZC4FACB: It "worked out" beautifully. But there were bumps along the way. Mostly related to versions -- ONNX itself, but also relative to CUDA and cuDNN. If you are able to be up to date, then it should go pretty smoothly although you may find online resources to be stale.U01FCPMHETD: ONNX was a rough experience for us and ended up not workingU02U1H6RH53: curious about what you ran into <@U01FCPMHETD>U01FCPMHETD: started with a ton of version problems were some ops weren\'t covered by the ONNX versions we were using, and bumping the version to the one that covered some of the ops broke some other opsU01FCPMHETD: finally solved that after some headaches, but then expanding this to some GNN library primitives on top of pytorch just plain didn\'t workU02U1H6RH53: okay very good to know, thanks for sharingU015BH45ZK6: how about torchscript?U02U1H6RH53: yeah torchscript is interesting but we don\'t want to be married to torchU02NR6C4M29: Did you check mlserver?U02NR6C4M29: Seldon images uses mlserver under the hoodU02U1H6RH53: yup we\'re using mlserver for our stuff, its more about how you store models though, which mlserver doesn\'t take any stance onU02NR6C4M29: Our approach right now is register models on mlflow, that means artifacts are uploaded to GCS, then pass the reference to the GCS folder to seldon\n\nWe don\'t just register the model, but all files needed by mlserver \n\nSo not building a docker image per deployment, just passing references around (although building custom images is also allowed as some deployment might require custom libraries in the future)U02U1H6RH53: Oh interesting, that sounds like a pretty good approachU033QREB56V: Compared to other common 3rd party formats (CoreML, OpenVino, TensorRT, etc), I\'ve found ONNX to be the most reliable/least headache for converting common computer vision models (purely CNN-based). That\'s not to say there weren\'t some issues, but if I had to pick a runtime graph format, ONNX feels like the best optionU026ZC4FACB: Now that ONNX has come back to the surface, I wanted to echo something that <@U01FCPMHETD> said about the Ops support across versions. This bit us as well. And FWIW, our use case was "common computer vision models" -- fairly run of the mill semantic segmentation and image classification. There were other pitfalls along the way. But we were ultimately successful (and needed to be, so we were motivated)U033QREB56V: <@U026ZC4FACB> I\'m curious if you recall which particular operators were troublesome?U026ZC4FACB: Not offhand, and a scan of my notes didn\'t turn up anything specific. However here is a list of operators and the supported opset versions <https://github.com/microsoft/onnxruntime/blob/master/docs/OperatorKernels.md>'}]}
54
52How do you deal with access given to Data Scientists and ML Engineers/SE in a model registry, specifically with regards to S3, and what are some proposed solutions?{'summary_answer': "One proposed solution is to have a single registry where the Data Scientists push the model to and the ML Engineers/SE can pull the model from. MLFlow's model registry and python APIs can be used to get the model in desired frameworks. Another solution is to host an MLflow tracking server in one account and use the mlflow python API to pull the models without S3 access. Enabling CrossAccount access on the S3 bucket is not advisable according to the DevOps team. It is also suggested that organizations with a hand off from Data Scientists to ML Engineers/SE do not work well.", 'citations': [{'id': '2022-02-20 15:11:31.735139 UTC', 'message': "U0282TV4UEQ: Hi I have a question related to permissions given within a model registry.\n\nData Scientists would be developing a model and then on the other hand you have ML Engineers/ SE that are responsible for productionising the model.\n\n1. For the DS their models will exist within the `Staging` .\n2. For the MLE/SE the same model would need to be within `Staging &amp; Production`(within the CI/CD flow staging is for running the test on the endpoint before going to production). \nHow would you deal with access given to either sides?\nFor example if you are using S3 as the model registry:\nDo you enable CrossAccount access to enable the MLE to use the model in both environments? Then do you limit the DS to only have access to the `Staging` version of the model registry? If you have set one up for your organisation I would appreciate some assistance on this.U01DHAF466L: Trick question, organizations with a hand off from DS to MLE don't work wellU0282TV4UEQ: <@U01DHAF466L> so a better solution would be to allow DS to work up to creating the docker image of the prediction endpoint and then the MLE would handle the deployment from there?U01UXRN8U5T: One way would be to have a single registry where the DS will push the model to and MLEs can pull the model from.\nI am trying to propose a solution in my org where we use MLFlow's model registry and use the mlflow's python APIs to get the model in your desired frameworks.\nDepending on the use-case this can be done in the CI/CD phase when building the docker image or during application start up.U0282TV4UEQ: <@U01UXRN8U5T> yeah I was thinking about that initially but then I would need to enable CrossAccount on the S3 bucket which what my DevOps team has told isn't advisable.U01UXRN8U5T: You can host an MLflow tracking server in one account. Then if you use the mlflow python API, you should be able to pull the models without S3 accessU0282TV4UEQ: Oh ok thanks will definitely check it out then <@U01UXRN8U5T>"}]}
55
53What is a good experiment management tool for non-production purposes with the ability to scale GPU on demand and compare experiment results?{'summary_answer': 'Kubeflow-based experiment tracking platforms are a good option for non-production purposes with the ability to scale GPU on demand and compare experiment results.', 'citations': [{'id': '2021-01-21 20:46:10.0451 UTC', 'message': 'U01K9GYNN5C: Hey,\nI would love to have your help.\nI am looking for a tool to manage experiments and save their results (*without production*).\nI need to have the ability:\n1. scale GPU on demand\n2. compare experiment results.\n\nI was thinking of using Kubefklow, but I\'m not sure if it\'s used manly for production.\nThanks!U018RV7QPAA: Hi Joe, you\'re in luck, there are about 3 new experiment management tools announced every week. Some of them even open source and some of them have built-in scaling. There is even a finite intersection of the two groups.\nYou might want to consider more capabilities to narrow down your search.\nAre you definitely not going to production? How many users are supposed to share the GPUs that are "always on", are you doing hybrid on-prem/cloud? is security and collaboration important? etc.U01K9GYNN5C: • definitely not going to production\n• On Cloud (GCP)\n• I want the ability to turn the GPU on and off\n• security - yes\n• collaboration - no\nU018RV7QPAA: Some more:\nstructured/unstrucuted data?\nhow do you handle data version control?\nwhere do you do dev? own box or cloud machine?\nhow important is automation\nis there going to be considerable amount of hyperparam optimization?\nare you going to use jupyter notebooks?\ndo you want to setup your own pipelines?\nwhere do you store the models?U01K9GYNN5C: • structured data\n• not need to data version control\n• dev either own box /cloud machine\n• automation is very important\n• no considerable amount of hyperparam optimization.\n• no jupyter notebooks\n• do you want to setup your own pipelines? *yes*\n• store the models on GCP bucket / my boxU018RV7QPAA: Well this is a very interesting breakdown, I daresay. Looks like you will be pretty content with Kubeflow-based experiment tracking platforms. I\'m not famililar with any that do autoscaling out-of-the-box but this shouldn\'t be a big issue because k8s has very nice tooling for that. I wonder what others think about this.U01KG0YGMJ7: <@U018RV7QPAA> that’s a great set of questions to ask MLOps users! Have you considered making a flowchart out of that info?U018RV7QPAA: <@U01KG0YGMJ7> yes, yes I have, but I will forver be accused as biased because I am an evangelist for one of the platforms. I think the ai-infrastructure alliance might do something like this in the near future, we can\'t have those lists and maps cluttering everything'}]}
56
54What are the downsides of models in container images and what alternatives are there?{'summary_answer': 'The downside of models in container images is that it can become annoying to manage, especially if there are multiple clients and models. The alternative suggested is to use object storage, such as minio for on-prem, and push the models there.', 'citations': [{'id': '2020-12-15 18:27:30.0936 UTC', 'message': 'U01GS962LMU: And I guess if not, would you go for models in some sort of PV or container images?U018QM6DUTB: I haven’t been a fan of models in container images, but that was mainly because I had to build an image for each of my `n` clients and `m` models: therefore, had to build an image `n` times for a release x `m_n` models per client. It became a bit annoying to manage. I went the route of using object storage (minio for on-prem) and pushing the models there (mirroring a bucket from internal object storage).'}]}
57
55What are the best practices for multi-tenant MLOps flow in Databricks? How can multi-tenant models be trained in Databricks?{'summary_answer': 'The best practices for multi-tenant MLOps flow in Databricks depend on various factors such as whether customer data is stored together or separated, compliance protocols, and job orchestration. In multi-tenant scenarios, different customers with different data are involved. To train multi-tenant models in Databricks, one can use a combination of multi-task jobs and "meta" jobs or external linked services like Azure Data Factory or AWS Step Function (+ Lambda). Jobs can be split by customer as needed, and monitoring logic can be used to not submit too many jobs at once.', 'citations': [{'id': '2023-01-20 08:16:51.235539 UTC', 'message': 'U02ERPHBPT8: Hello folks,\n\n*Databricks, Multi-tenant models*\n\nHow does your MLOps flow look like in multi-account model training. I\'ve been using Databricks and it\'s great for single account training. This is how it looks like for single account:\n\nDelta Live Table -&gt; Feature Store -&gt; Databricks Jobs to train models and track in MLFlow -&gt; Deployment -&gt; Monitoring\n\nIt\'s a bit challenging for multi-account based modelling especially when there\'s the eligibility criteria to pick the list of accounts for modelling.\n• How are you processing the data with different window size for the accounts? \n• How does your Databricks Jobs pipelines look like? \n ◦ One job to trigger multiple child job to train account? (Databricks has the limitation on the number of jobs/hour for a given workspace) \n ◦ One job to trigger multiple jobs to model a batch of accounts (To reduce the number of jobs)\n ◦ Multiple jobs to model a batch of accounts?\n• What are the best practices for multi-account use case?\nAnswers/Discussions/Blog post links would be really be helpful! Thanks in advance!U01UXSLEL85: What does "account" mean in this context?U02ERPHBPT8: <@U01UXSLEL85> I\'m sorry for the confusion. By multi -account i meant multi-tenancy/multi-customer scenario, different customers with different data.U01UXSLEL85: Ahh, got it. I think the short answer, as usual, is that it depends.\n\nFor example, are all customer data stored together or separated by table/database/schema in some way? Are there compliance protocols which require customer data to be isolated? If so, does this only apply to data written out (as opposed to in memory)?\n\nAs for the jobs, it sounds like some or all of them will function differently (i.e., differs parameters) for different customers. In that case you have various options for how to orchestrate them. You can do it entirely within Databricks with a combination of multi-tasks jobs and "meta" jobs (a job which triggers other jobs using the Databricks CLI/API) or via an external linked service like Azure Data Factory or AWS Step Function (+ Lambda). \n\nWhen I\'ve encountered this in the past I\'ve done a combination of the above. In some cases the customer was an input and jobs were coordinated by an external orchestration service, enabling the jobs to be split by customer as needed. In other cases some jobs could be split while other, dependent, jobs could not, so we used multi-task jobs and the Databricks API to orchestrate them, with monitoring logic to not submit too many jobs at once. '}]}
58
56What is the experience with using synthetic data in a production-ready machine learning workflow and how to deal with errors in the data?{'summary_answer': 'One user mentioned heavily using synthetic data for their NER/NEL system, which involved heuristics, partially labelled data, and majority voting by other lower quality NER systems. However, another user pointed out that they actually meant artificially labelled data, not synthetic data. The issue with errors in the data was not specifically addressed.', 'citations': [{'id': '2022-06-13 15:43:24.692559 UTC', 'message': "U03142DQP6Z: In this discussion, <@UP3T8K9M5> and Yash Sheth slightly touched on the issues with synthetic data. What is your experience with using synthetic data in a production-ready machine learning workflow?\n<https://www.youtube.com/watch?v=Z_zt87zhWbE&amp;ab_channel=MLOps.community>U015BH45ZK6: We heavily used synthetic data for our NER/NEL system essentially a bunch of heuristics, partially labelled data, and majority voting by other lower quality NER systems.U03142DQP6Z: <@U015BH45ZK6> Since synthetic data are not real, they contain a lot of errors which could make your model learn the wrong pattern, how do you deal with that?U015BH45ZK6: sorry I realised I meant artificially labelledU015BH45ZK6: it's real text machine labelled"}]}
59
57Is ML part of the apps and tools we buy or do companies also try to do ML in-house?{'summary_answer': 'It can be both. Companies will try to do ML in-house and also use tools that leverage ML. There is a trend towards more products with in-built ML features, but it does not cancel doing ML in-house completely for use cases that are harder to generalize or mission-critical to business.', 'citations': [{'id': '2022-01-19 17:22:02.0236 UTC', 'message': 'UP3T8K9M5: Interested to hear what others think about his stance? is this guy out of touch with the ecosystem and how ML is growing or does he have a point about ML being part of apps and tools we buy…? Personally I feel lik eit can be both, companies will try to do ML and also use tools that leverage ML. I mean thats not very bold of a statement and its pretty obvious that its already happening.U02B0QQR1C0: dependent on audience/market imo — if nick is referring to all corporate entities, irrespective of industry or it/tech competency, then it’s obviously trueU015BH45ZK6: my mental model to "doing ML" is the same as "doing frontend". swap that in his talk and reread it how it makes senseU02B0QQR1C0: same goes for company area — e.g. less likely for cx teams at most orgs to develop ml models in-house unless they are at a certain level of scaleU016SQ2GM6U: I think it’s both.\n\nI’d agree that there is a trend towards more products with in-built ML features (e.g. a large retailer used to build a custom marketing personalization system; now similar functionality is widely available in marketing automation/CRM tools and many smaller companies would not build it from scratch).\n\nI think it expands the number of companies that use ML, but does not cancel doing ML in-house completely (e.g. for the use cases that are harder to generalise / mission critical to business). And of course all these companies creating products with ML-enabled features are “doing ML” inside a lot…U01DHAF466L: I think business intelligence / data warehousing makes a good analogy. There have been big improvements in being able to buy technology, but you can\'t get away with not doing the actual work if you want great results. Algolia and stuff like Search / Recommendations as a service might be something to look at... seems like companies are still investing in building their own stuff mostly'}]}
60
58How do you operationalize training a model back and forth between notebooks and an automated pipeline?{'summary_answer': 'There is no one great way to operationalize training a model back and forth between notebooks and an automated pipeline. Notebooks are useful for exploring and iterating with data, but once the problem is understood and a solution is identified, it is recommended to move away from notebooks and use a more structured development process with pipelines to execute code against data. One approach is to migrate all codes to Python modules and use a config object to hold all parameters, then control inputs and hyperparameters using YAML files. Another approach is to write the pipeline in a shell script observing code quality principles, then load the data into a notebook for visualization. However, notebooks are considered temporal, and the script versioned in GitHub is considered the source of truth.', 'citations': [{'id': '2021-05-27 15:47:12.3522 UTC', 'message': 'U01G39150NA: How do you guys opertionalize training a model back and forth between notebooks and an automated pipeline? I can imagine there\'s going to be a lot of versioning issues in this case.U015BH45ZK6: We solved it with the radical idea of not using notebooks.U018RV7QPAA: There are two types of notebooks - those where you have to run cells 1-7 and then cell 2 again and only then cells 8-32....\nAnd those where you can "run all". The first kind are a problem, for the latter there are many approaches which I\'m sure others will be happy to suggest. Personally I\'m with <@U015BH45ZK6> U015BH45ZK6: What\'s the difference between a "run all" notebook and copying the code into a script and running it with python-fire?U015BH45ZK6: Oh, I know the second one is not a pain in the *** to version controlU018QM6DUTB: Overall, I don’t think there’s a _great_ way to do it. I think there’s a lot of value in notebooks, especially in the early stages of understanding your problem, and iterating with the data to gain some level of human understanding around what’s going on. But once you know how you’re going to try to solve your problem (at least initially), this is where it’s a good idea to move away from notebooks.\n\nPersonally, my development process is to explore my data and script in notebooks until I know what my data looks like and maybe run a couple of experiments for a proof-of-concept. At that point, I actually move my development to something that’s structured as a python (or whatever language) project in Git, with pipelines to execute my code against my data.\n\nBut basically, I always treat my notebooks as temporal and the script versioned in github as the source of truth. (working through some of this stuff for Pachyderm’s IDE rn)U018RV7QPAA: <@UP3T8K9M5> you wanted stuff to ask Jeremy Howard ^U01QFUGN2UQ: I call notebooks “sketches”. A co-worker recently called them “doodles”. Both encapsulate how I personally view them. They’re scratchpads that are best used to organize thoughts and experiments. Once that’s done then transition to a more rigorous development process (git, VS Code, etc).U01GDTVQSJE: I like the term EDA and keeping notebooks focused on that use case. Basically I agree with <@U01QFUGN2UQ> and <@U018QM6DUTB>U01G8F6E38T: 1. Notebook as scratch book\n2. Notebook as scratch book with a config object holding all params\n3. Migrate all codes to python modules and config object in a `config.py`\n4. Migrate params to YAML, keep static configs in config.py (e.g., key pairs, URIs, random seeds, logging, etc).\nAnd then train using the python script only, controlling the inputs and hyperparams using YAML files.U01G8F6E38T: Or, you can versioning notebooks as\nUntitled-1-copy-copy-2.ipynb\nUntitled-2.ipynb\nUntitled-2 (1).ipynb\netc.\nNo versioning issue, just keep all the runs in a separate notebookU01BNPM89U1: One pattern that has worked well for Netflix data science has been around using Notebooks as a way to inspect the output/intermediate state of ML pipelines - kind of like a quick dashboarding/analysis tool.U01G8F6E38T: Notebooks are great for exploring.\n\nCan\'t imagine operationalize notebooks though.U01BNPM89U1: It\'s the hidden state and non-linear execution of notebooks that causes a lot of headache while debugging them.U01BNPM89U1: But as an educational tool and a scratch pad, Notebooks are fantasticU016624NW3U: Notebooks/Jupyter are a format (ipynb) and a dev environment (jupyter notebook/lab). git versioning problems go away if you change the underlying format, you can use <https://github.com/mwouts/jupytext|jupytext> for that, it allows you to open scripts as notebooks; you can open a train.py file in jupyter, do some interactive development (which is extremely useful when dealing with a new dataset and you want to get some descriptive stats, plot some distributions, etc) Then you can orchestrate a production pipeline with those scriptsU015BH45ZK6: <@U01BNPM89U1> That\'s exactly what we do: write your pipeline in a shell script observing code quality principles (because the entire team will look at it in the future) then load the data into a notebook for visualisation. If there is something that needs to be permanently monitored move it to metabase or a plotly-dash app.U015BH45ZK6: But notebooks are strictly your eyes only not to be sharedU022CK8HXFC: <@U01BNPM89U1> btw where does Netflix use metaflow ? <https://metaflow.org/>U01BNPM89U1: <@U022CK8HXFC> Metaflow is used for all data science work - iterating on models, building/debugging/productizing training pipelines and integrating the models with the surrounding business ecosystem.U022YPYBZD0: Sharing Joel Grus\' presentation "I don\'t like notebooks" in case someone hasn\'t seen it yet: <https://docs.google.com/presentation/d/1n2RlMdmv1p25Xy5thJUhkKGvjtV-dkAIsUXP-AL4ffI/edit#slide=id.g362da58057_0_1>U01GDTVQSJE: I\'m on Joel\'s side but here\'s the other side <https://www.youtube.com/watch?v=9Q6sLbz37gk>U018RV7QPAA: You can\'t possibly put these specific links up without the third :pray: <https://www.fast.ai/2020/10/28/code-of-conduct/|https://www.fast.ai/2020/10/28/code-of-conduct/>U015BH45ZK6: If you have déjà vu like me there is a reason for this: <https://mlops-community.slack.com/archives/C015J2Y9RLM/p1614627111245500>UP3T8K9M5: should we make a shirt?\n&gt; I run notebooks in productionU015BH45ZK6: Today we run notebooks in productionU015BH45ZK6: One does not simply run notebooks in productionUP3T8K9M5: once i ran notebooks in productionU015BH45ZK6: (I can carry on until <@U018RV7QPAA> comes down here an twists the keyboard out of my hands)U018RV7QPAA: Well I already made this earlier because I want everyone to get along U018RV7QPAA: But I can definitely make an evil one :male_supervillain:U018RV7QPAA: Since it\'s the informal capacity, I can eschew PC a bit\nU018RV7QPAA: Got to stay current U018RV7QPAA: Okay too tired Last one! \nU01BNPM89U1: <@UP3T8K9M5> I would love to order one and frame it - <https://mlops-community.slack.com/archives/C015J2Y9RLM/p1622227717386100?thread_ts=1622130432.352200&amp;cid=C015J2Y9RLM>U024M075JBC: Unlike some others here, I think notebooks can be extremely useful for exploratory analysis and iterative prototyping (without which there may not be a good model to productionize). I foresee notebooks being used for data analysis tasks for many more years. Code from notebooks has to be refactored and integrated for production scale deployment. Naturally, if you have a trusted set of features such as from a feature store, this process of going from EDA and model prototypes to production becomes a bit simpler.'}]}
61
59What meeting framework or outline would be recommended for early discussions on developing better MLOps practices?{'summary_answer': 'A recommended framework or meeting outline for early discussions on developing better MLOps practices would be to focus on identifying the problems that need to be solved, estimating the effort required, and determining the value that solving the problem would deliver. This can be used to prioritize what should be solved first and lead into addressing skill gaps and assigning responsibilities. Additionally, it would be useful to decide as a group on where the organization is at along the MLOps Maturity Model and where they would like to be in the future. The SCAR framework can also be used to structure the meeting, focusing on the current situation, complications, actions, and expected results.', 'citations': [{'id': '2023-01-09 15:03:44.866739 UTC', 'message': "U0374S5MJGN: Hi all - My team is planning to have early discussions on developing better MLOps practices at our org. We will have infrastructure, engineers, and data scientists present. Is there a recommended framework or meeting outline people would recommend to help us navigate the discussion? I dont think it is worthwhile to dive into the technical aspects of MLOps in our first meeting but are there more general things are org should be thinking/discussing about in terms of team responsibilities, ml maturity level.U02MEN6HVRV: A naive solution would be to focus on:\n• What are the problems we are trying to solve?\n• How much effort will they take?\n• How much value will solving the problem deliver?\nUse that to figure what should be solved first and lead into who can / should work on what and addressing skill gapsUP3T8K9M5: this from <@U017CCHCW8K> might be useful?\n\n<https://ml-ops.org/content/mlops-stack-canvas>\n\n<https://ml-ops.org/content/phase-zero>U02KJLD4FM5: I think it’s useful to decide as a group on where you’re at along the MLOps Maturity Model (<https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning|Google> and <https://learn.microsoft.com/en-us/azure/architecture/example-scenario/mlops/mlops-maturity-model|Microsoft> have good descriptions of these maturity models) and decide on where you’d like to be in 3, 6, 12, etc monthsU02001GD6UW: Thank you <@UP3T8K9M5>. (for the record, I work for Astronomer and to be clear - I am not trying to promote Astronomer or airflow).\n\nThis webinar <https://www.astronomer.io/events/recaps/using-airflow-with-tensorflow-mlflow/|ML learning pipelines in prod with airflow> by Jeff gives a good overview of end to end ML pipeline. (first 6 mins highlights what can be automated vs. what has a human component). I like this since it also touches upon different tools.\nAfter 8 mins, he has a very interesting use case as well. You have access to the code repo as well.\n\nIMO, MLOps is just productionalizing the models you have already built and tested plus having a process to make changes to the models as your data changes and having a clear version control mechanism.U0374S5MJGN: Thank you all for the useful info!U04HBG5485P: In addition to what others have mentioned, I’d also say that leaders often forget to get buy-in from the team on the initiative and only learn much later that the team didn’t understand or didn’t want to do certain work. So like Kevin mentioned, you could try a retrospective-like approach to involve the team in surveying current challenges that might fall under the MLops umbrellaU015BH45ZK6: This is so generic one can't help you (the above ones will be little help for you, maybe apart from the ML Canvas). I recommend using the SCAR framework for now:\n\n• Situation: What's going on right now? Where are you?\n• Complication: Where do you need to be, and why can't you be there right now?\n• Action: What can you do right now? How? Who?\n• Results: What outcomes do you expect from the actions above? Who will verify and how the outcomes were achieved?\nBefore the meeting, I would enumerate the following:\n• Strategy: What does your company want to do, and how does ML/DS fits into it\n• Infrastructure: What resources do you have? How do you integrate with the rest of the company, options, fixed points etc. etc. \n• People Process Technology: Resources and existing assets"}]}
62
60How are other people handling the 'batch in disguise' problem which involves requests coming in for individual keys/inputs with a third-party marketing system that is largely outside of one's control?{'summary_answer': "Other people are handling the 'batch in disguise' problem by caching the requests, but some are also considering more clever solutions such as preloading some number of features into a cheaper storage layer based on which keys they think will be accessed. However, some suggest that an online solution may be better in the long term, especially if there are constraints on how many times certain items can be recommended or the set of recommendations that the customer gets. It is also noted that in a microservice-land, the requests are already in an online context.", 'citations': [{'id': '2021-05-30 14:01:10.4224 UTC', 'message': 'U017HFKSB0R: architecture question (I have some ideas/plans/solutions for this already, but interested in what others think):\n\nwhen looking at demands for model execution and feature-store lookup, there are a couple of simple cases of "batch" and "online" usage i.e.\n• "batch" requests for model-execution and feature-store come in for a large number of keys/inputs at once. typically used, for example, for offline training and evaluation. here, bias is towards $-efficient scaling to a large number of requests, and latency is not really an issue. also, request key-space frequency is typically flat, based on all possible requests\n• "online" request for same models/features, but coming in for individual keys/inputs in a live context where latency is important. request key-space is not highly structured, but does follow some sort of long-tail distribution i.e. some keys are expected to be asked for way more often than others. typically used for online evaluation of models.\nThere is a third combo I\'ve been thinking of recently which I am calling "batch in disguise". This can come about in systems like email marketing where a daily email is sent out to a large number of customers, based on some criteria. The model-execution / feature-store lookups for this may come when generating content for the email, perhaps via a tree of intermediate services.\n\nThis has some properties of batch in that it typically comes all at once over a large number of customers and very low latency is not an issue, but is also very online-like in that it\'s not a flat distribution.\n\nYou can totally service these via a system designed for "online" but it tends to have higher-$ cost than if you could somehow pre-evaluate the keys using a "batch" system.\n\nHow are other people handling this "batch in disguise" problem?U01CRVDS4NA: Is there a problem with literally doing it in batch?U017HFKSB0R: You don’t know the keys ahead of time as each comes in as a result of a tree of service calls.U017HFKSB0R: Btw, in terms of solutions for batch I mean things like using Spark to iterate over keys in bulk.U01CRVDS4NA: This sounds similar to a lot of workflows from my time at LinkedIn.\n\nWe generally just did it in batch. Although you don’t know “ahead of time” what the keys are, it sounds like the processing is async (ie no one is waiting for a response on the other end “now”).\n\nSo whatever the process to determine the keys, dump those to file(s) and then create a scoring workflow that scores those entities.U01CRVDS4NA: If your case is synchronous, that’s another beast and I would investigate if precomputing the predictions is a possibility U019ABXBYET: “Batch in disguise” could also be named “mini batch” - i.e., cache requests until a threshold is reached (where that threshold is # of examples or time-based)U017HFKSB0R: <@U01CRVDS4NA>: the key thing in my context is that “the process to determine the keys” is a few service hops away and/or in an external System.\n\nTo be a bit more concrete, a simple version of this flow is something like:\n• 1. every day, identify all users who meet some criteria e.g. “all people who have an interest in travelling to Germany”\n• 2. generate candidate items for those users e.g. “cheap hotels across all of Germany”\n• 3. rank items based on constraint e.g. “we only want to show the top 5 best choices for these users, so rank the hotels and select the top 5"\nIt is the 3rd step where this hits the systems owned by me, however it is step 1 that is where the “batchiness” is. Also step 1 may be a third-party marketing system, who’s architecture is largely outside my control.U017HFKSB0R: <@U019ABXBYET>: caching is a totally sensible strategy and is what I am doing currently, and I think it’ll take us far.\n\nHowever, thinking down the line to possible future constraints, I’m wondering if it can be a bit more clever. For example, it may be that caching the features doesn’t work if the features are sufficiently larger.\n\nThis could be seen as an ML optimisation problem in itself. If we had the ability to preload some number of features into a cheaper storage layer, based on which keys we think will be accessed then this may still deal with larger features whilst reducing $ costs.U017HFKSB0R: However, I am quite possibly over-thinking this; wouldn’t be the first time :wink:U021XGAN7ND: I think I faced a somewhat similar problem, and I am leaning towards an "online" solution:\nA batch of customer have to receive their recommendations, but there are some constraints on how many times certain items can be recommended (in total across customers) or the set of recommendations that the customer gets (at least X new items, Y recommendations, items are not too similar, etc...).\nThis created a complex pipeline where the ML part can be considered small in complexity and impact on the end result compared to the allocation rules, business rules, etc..\nIt also creates some frictions on who owns and maintains which part.\nI find this batch approach to be much more complex to maintain, debug and explain which is why I said above that I am leaning towards an online solution.\nBut the best reason to use the online approach, is that it enables to send the recommendation anytime (by example immediately when a new customer joins) instead of on a scheduled batch. This is probably more aligned with our long term strategy, and it fits more easily in the same template as other online models in terms of monitoring, lifecycle, etc....\nAlso "online" does not make all the business rules disappear, there is still a need for a rule-based post-processing on the ranked recommendations, but I find the whole pipeline more decoupled and more clear.\n\nNot exactly the same problem as yours, but maybe this perspective helpsU017HFKSB0R: Thanks <@U021XGAN7ND>!\n\nI think your example also illustrates a wider context that it is more typical for behaviours to be implemented as a chain of microservices. This tends to mean when you receive the request you’ll already be in microservice-land, and hence online.U021XGAN7ND: To summarize a bit what I said above in terms of attention points for decision making:\n• Longer term strategy: Do you want to be able to send the messages on-demand instead of in batches in the future? \n• Most common pattern in your organization: Do you have a lot of online and a few batch models, or the other way around? Could be worth re-using existing patterns for consistency, maintenance, etc.. \nThere are probably other attention points, but those were the first two I came up with in my case.U019ABXBYET: Btw <@U017HFKSB0R> I was thinking of caching the requests, not the features :)'}]}
63
61What are the things to be learnt in MLops for someone new to it?{'summary_answer': 'Someone new to MLops should learn about managing and iterating on both code and data, as machine learning involves both. They should also learn about writing YAML files and model deployment. Resources such as madewithml.com and ml-ops.org can be helpful in developing mental models for MLops principles and best practices. However, while theoretical knowledge is important, the move to application in practice and production can be tedious and difficult.', 'citations': [{'id': '2022-04-18 08:56:24.520519 UTC', 'message': "U032Y1SAU11: What are things actually are to be learnt in MLops? I'm pretty new to MLops. I'm not sure where to actually begin at.UP3T8K9M5: i normally recomend the <http://madewithml.com|madewithml.com> resource but you can also check <#C01DEPTQ0F5|learning-resources> for more info. <@U025ZEJJLQ6> also has some cool repos that can help you get started on the journeyU039TRV2FFZ: How to write YAML files, for sure. :rolling_on_the_floor_laughing:\n\nBut in all seriousness, I'm new in my MLOps journey as well. Here's what I've come to realize about machine learning: it's not the same as software development.\n\nMainly because we have TWO moving pieces: code + data. So that means we need to manage and iterate on both of them.\n\nI've found this post from <https://ml-ops.org/content/mlops-principles#summary-of-mlops-principles-and-best-practices|ml-ops.org> to be pretty helpful in developing mental models for what we need to keep in mind.U0147G90NDA: I'd suggest to begin with model deploymentU012YQULW4X: <@UP3T8K9M5> something new for the community todo list:) 101 intro into the basics^^U0147G90NDA: <https://github.com/DataTalksClub/mlops-zoomcamp|https://github.com/DataTalksClub/mlops-zoomcamp>\n\nYou can also keep an eye on this oneU037QAS8S0H: I mean, personally I think some resources can exist (there is e.g. the book practical MLOps, or the informed company, which is more about DataOps imo). But in general, it quickly remains at the theoretical knowledge in those books. For someone relatively new to MLOps I feel like I know a lot of the theory by now, but the move to application of it into practice/production is the way more tedious and difficult :stuck_out_tongue: But definitely I could not do any of these things without knowing the theory exists (which was a process of a few weeks/months for sure)"}]}
64
62How does one deal with refactoring and maintainability in data science when it's more dynamic and iterative than software development?{'summary_answer': 'It depends on the company/product and scale, but usually, in data science, you develop an ML solution that is good enough and move on to the next problem. While you monitor the solution and possible drifts in the system surrounding it, it doesn\'t demand huge "refactoring and maintainability" efforts like in software development. However, the initial exploration phase of a data science problem may require bigger refactoring efforts.', 'citations': [{'id': '2021-09-10 05:40:59.3188 UTC', 'message': 'U01JFEFSQV7: in data science its harder to have design patterns and system design cause of the iterative nature of things so how does one deal with refactoring and maintainability?U01CDPSN29K: What fo you mean with "iterative nature of things"?U01JFEFSQV7: iterative nature meaning you train once deploy it and then monitor it and keep doing this in loopsU01CDPSN29K: But you could make the same case for vanilla software development where you develop, deploy, monitor features and their effects on user behaviour, and iterate. I am trying to understand first how you are drawing this distinction so I can understand the question\'s contextU01JFEFSQV7: True that but isn\'t data science/ml more dynamic, in sde you get the system design write the number of iterations in comparison to mle are lesser right?U01CDPSN29K: It *really* depends on the company/product and scale ofc but a lot of time you develop an ML solution that is good enough and move on to the next problem. Yeh sure… you monitor the solution and possible drifts in the system surrounding it, but usually this doesn’t demand huge “refactoring and maintainability” efforts like you mentioned before. U01CDPSN29K: I think that maybe you are referring to the initial exploration phase of a data science problem? e.g. the old “I devoted my model in a notebook, now what?”\n\nThat will indeed require bigger refactoring efforts U01CRVDS4NA: Yeah I largely don’t buy this “data science is iterative” argument. It sounds like a take written by people who have little if any SWE experience U01R217N30F: I think systems that try define behaviour based on observed data do add another layer of depth to the ‘the irony of automation’'}]}
65
63What tools can be recommended for a team to unify research and MLOps development teams, for quick deployment and management of different ML models with versioning?{'summary_answer': 'MLFlow, Verta.ai, Whylogs, and DVC can be recommended for a team to unify research and MLOps development teams, for quick deployment and management of different ML models with versioning. MLFlow does experiment tracking, model selection, and model deployment reasonably well, while Verta.ai tries to do everything other than model testing and Whylogs is super focused on just model monitoring/observability. DVC is a CLI tool that acts as an extension to Git and allows versioning of large files, hosting them on remote storage, and making the entire project accessible to the team, and more importantly - reproducible.', 'citations': [{'id': '2021-08-15 11:41:37.3479 UTC', 'message': "U02A7AGLX7G: Happy Sunday morning to everyone. \n\nIf your data science teams are using SageMaker and Databricks, what tools (versioning , model registry, deployment, etc.) would you leverage to enhance the team’s experience? U02A7AGLX7G: If you find the question too vague, we can start on a registry and deployment that may unify the workflow for my SageMaker and Databricks customers. Paid tools are welcome. U02AJT34VST: <@U02A7AGLX7G> I have used/researched a few tools. Specifically, MLFlow,\xa0<http://Verta.ai|Verta.ai>\xa0and Whylogs.\nI am not sure what kind of lifecycle you follow for your models. So let’s assume a Model lifecycle that looks like this -\nEDA -&gt; Experiment tracking -&gt; Model Selection -&gt; Model deployment -&gt; Model testing -&gt; Model monitoring/observability\n• <http://mlflow.org/|ML FLow>\xa0- Does Experiment tracking -&gt; Model Selection -&gt; Model deployment reasonably well. From Databricks. Development on github, Apache license. Available as SaaS. Integrates with Sagemaker\n• <http://Verta.ai|Verta.ai>\xa0-&gt; Tries to do everything other than Model testing. Looks a bit more comprehensive than the others. Based on the CEO’s research while at MIT. $10 mn funding till date. Looks proprietary. Available as SaaS\n• <https://whylabs.ai/|Whylogs>\xa0-&gt; They seem super focused on just Model monitoring/observability. Development on github, Apache license. About\xa0 4 mn funding. Dont think its available as SaaS. Integrates with MLFlowU01LJC2E7AT: Hi <@U02A7AGLX7G>,\nFollowing <@U02AJT34VST>'s great answer I'd like to add a large file tracking tool for data &amp; model versioning.\n<https://dvc.org/|DVC>, which stands for Data Version Control, is a CLI tool that acts as an extension to Git and allows you to version your large file, host them on remote storage and make the entire project accessible to the team, and more importantly - reproducible.\nDVC is completely open-source and has a great team that supports it (<@UCVNTNL13> :wink:)."}]}
66
64How do I integrate MLflow in ZenML for experiment tracking?{'summary_answer': 'There are resources available to help with integrating MLflow in ZenML for experiment tracking. One option is to check out a blog post by <@U029FHM5LPP> from ZenML on the topic, which can be found at <https://dev.to/zenml/how-to-improve-your-experimentation-workflows-with-mlflow-tracking-and-zenml-2dhe>. Another resource is a recent blog post by <@U02KVGKNKCK> from FuzzyLabs, which covers their experimentation with ZenML+MLFlow and includes implementation details.', 'citations': [{'id': '2022-04-13 19:33:07.335319 UTC', 'message': "U03BR541FG9: HI everybodey, I have a question:\nHow to integrate mlfow in zenml, if you have some materials that can help me on that.U01R2314L02: For serving, or experiment tracking? or something else?U03BR541FG9: yes for experiment tracking onlyU01R2314L02: There's a good blog by <@U029FHM5LPP> from ZenML on this <https://dev.to/zenml/how-to-improve-your-experimentation-workflows-with-mlflow-tracking-and-zenml-2dhe>\n\nWe wrote about it recently as well: <https://www.fuzzylabs.ai/blog-post/the-road-to-zen-running-experiments> - for our experimentation with ZenML+MLFlow, <@U02KVGKNKCK> did all the implementation, so he may be of help too.U03BR541FG9: Thank you so much Matt"}]}
67
65Is it possible to have multiple containers share a single GPU with AWS ECS Agent updates?{'summary_answer': "It is unclear from the context whether it is currently possible to have multiple containers share a single GPU with AWS ECS Agent updates. The person asking the question is unsure if the updates allow for circumventing the Agent's restrictions on GPU sharing. However, another person suggests using MIG or fractional GPU solutions as possible alternatives.", 'citations': [{'id': '2022-04-08 22:18:11.872579 UTC', 'message': "U02A66FPN8N: Is this still possible? I'm trying to get this to work but i'm not sure if ECS Agent updates don't allow the Agent to be circumvented (basically allowing multi containers to use a single GPU)\n<https://stackoverflow.com/questions/59687272/multiple-containers-sharing-single-gpu>U01V15WMFGT: I'm curious why would you want that?U02A66FPN8N: <@U01V15WMFGT> The reason behind this is because we have several ML application containers running. Some of them will require a gpu and unfortunately if you run a GPU based container via AWS ECS, it locks the container to the GPU. Traditionally, without ECS, you can share docker containers to 1 gpu.....but having the ECS control plane makes it difficult.\n\nOtherwise, we then have to have X clusters with a gpu for each Y ML application. Can be pretty expensive if ML containers are not 100% utilizing the GPU.U02RP3DR35G: <@U02A66FPN8N> there are solutions which allows you to share GPU between multiple containers.\nif you are using A100 or A40 you can use MIG to split the GPU for up to 7 slices.\nin any case you can also try to search for fractional GPU solutions.U02A66FPN8N: <@U02RP3DR35G> Yeah but does MIG work with the AWS ECS Agent control plane? which is the limiting factor here in sharing GPUs....\nI'll look into fractional gpu solutions, I have a feeling i'm close with modifying the ECS control plane and circumventing how AWS is provisioning gpu allocation"}]}
68
66What decision framework can be used to choose a specific ML tool from the countless options available?{'summary_answer': 'The recommended decision framework for choosing a specific ML tool from the countless options available includes two phases. In Phase 1, a short list of potential candidates (2 to 5 items) can be created by reviewing product features and online reviews. In Phase 2, each candidate can be evaluated during a PoC to assess their pros and cons. Before each phase, a list of criteria should be created to evaluate the solutions objectively, and end users should be involved to confirm the criteria and weigh their importance. It is also important to target the real use case and not go for whatever is the trendiest tool right now.', 'citations': [{'id': '2022-10-03 05:41:23.751419 UTC', 'message': 'U01JFEFSQV7: There are countless tools in the ML space, what is the decision framework template when choosing for a specific tool.U01J0NVNE1G: I recommend having 2 phases:\n• Phase 1: Create a short list of potential candidates (e.g. 2 to 5 items). This can be reviewed just by looking at the product features and online reviews (low time investment, but a lot of solutions).\n• Phase 2: Evaluate each candidate during a PoC to assess their pros and cons (high time investment, but fewer solutions).\nBefore each phase, you should come up with a list of criteria to evaluate the solutions objectively. It\'s better if you can involve end users to confirm the criteria and weigh their importance.\n\nAs general feedback, each time I went into a PoC I changed my mind 180 degrees on the best solution. There is some aspect you can only detect if you try a solution on a concrete use case.\n\nAlternatively, you can also say that the solution is always nice when it\'s showcased by a vendor :wink:U03NPJQCNJ1: &gt; As general feedback, each time I went into a PoC I changed my mind 180 degrees on the best solution. There is some aspect you can only detect if you try a solution on a concrete use case.\nSame experience hereU01HRT0R69M: <@U0187JQPZ1D> wrote a nice blog post about this which gives you some nice pointers hat you might want to consider, <https://dagshub.com/blog/solve-your-mlops-problems-with-an-open-source-data-science-stack/>\n\nI do not necessarily agree with having everything open source all the time and I think sometimes it makes sense to buy a tool to save on maintenance and staff. But it\'s highly dependent on your situation. But regardless I think Dean gives a bunch of great tips that can help you out.\n\nIn the end I think it\'s crucial to stick to _what you need for your usecase_ and not to go for whatever is the trendiest tool right now. I also agree with the approach mentioned above by <@U01J0NVNE1G>UP3T8K9M5: I was kinda asking <@U03ALCQQF8F> this a few months back when him and <@U015MM44NLW> talked about their stack. they might have some good insightsU01J0NVNE1G: I agree with <@U01HRT0R69M> about targetting the real use case. It\'s easy to deploy the most popular tool that you find on Medium, but often your users will have different needs. At least 40% of the target criteria/features come from the interaction with the user, and it will make it easier for you to onboard them sooner than later.\n\nI think you can by-pass this approach in a context where mature solutions are well-established (e.g., Spark), but MLOps is not at this stage yet :slightly_smiling_face:U015MM44NLW: Thanks for the ping <@UP3T8K9M5>! As we go through the evaluation process we start with "Why". Meaning, we identify who we are trying to empower, what areas of complexity this can potentially smooth out or reduce, and is this an immediate need or in preparation for what\'s coming in the future?\n\nWe use the above as a litmus test before beginning evaluations. But even before we begin them, we align it to use cases, gaps, opportunities, and create a rubric for the part of the MLOps pipeline that we are trying to solve for and use that as part of our evaluation. \n\nMy personal take, is that you ALWAYS have at least 3-4 options so that your team learns from different vendors/partners while making sure you get what you need and weigh the pros and cons of each appropriately. And finally, understand that just because you are looking at 3-4, if none of them fit the bill, then be OK with starting over again.\n\nI was encouraged to write a detailed article by a familiar name on this topic in particular, which I am working on. Stay tuned!'}]}
69
67What are some good resources with examples, preferably with diagrams, that show how other companies have structured their data architecture and task orchestration ecosystems?{'summary_answer': 'The person is asking for good resources with examples, preferably with diagrams, that show how other companies have structured their data architecture and task orchestration ecosystems. They are finding it difficult to figure out this stuff and are looking for help.', 'citations': [{'id': '2022-03-17 16:35:25.842619 UTC', 'message': 'U037QAS8S0H: Hi everyone,\nWe are a data-driven startup invested in Data Science, I recently joined and am currently pushing to refactor/rebuild certain aspects of our ecosystem. This was because for an outsider it was very easy for me to realize how obscure and complicated certain things were setup, with a single repository often referred to as \'the pipelines\' containing preprocessing tasks, ETL, classifiers etc. and these pipelines were run with cron jobs on a Linux server.\nProbably most people here would see the pitfall in that, but now that I am trying to think about how the architecture/ecosystem should look like, I am running into a 1001 different issues/topics to tackle. For example, we are considering Argo/Prefect/Airflow/Flyte, and also looking into how our data stores should be structured, where we currently use a combination of PostgreSQL and ElasticSearch.\nWe are considering to have an (insert-only) data-lake with some kind of materialized views, a separate model registry/repository (or even deploying models as containerized apps or something along those lines), and having all other preprocessing/data loading/utility code in separate packaged repositories.\n\nSo now my main question is, _*does anyone have any good resources with examples (preferrably diagrams as well) of how other companies have structured their data architecture and task orchestration ecosystems?*_ Because it is very hard to figure out this stuff sometimes and easy to lose the forest for the trees in this huge MLOps landscape.\n\nThanks in advance!U01DHAF466L: :stuck_out_tongue: :stuck_out_tongue: :stuck_out_tongue:U037QAS8S0H: Haha so you are saying just use cronjobs :joy:? I mean I did realize at some point I need to be careful to not overdo it, but why build something if there is plenty of OSS solutions for it right xD? I would say \'buy-or-build\' debate but as far as I could tell, except for overhead computation costs, none of Argo/Airflow etc. costs anything anywayU01DHAF466L: Haha in all seriousness though, I really like this take on it: <https://github.com/jacopotagliabue/you-dont-need-a-bigger-boat>U01DHAF466L: We use ELT pattern at $JOB, and have used Prefect for orchestration in the pastU01DHAF466L: This is a great book, more focused on data engineering and analytics not so much on MLE: <https://www.amazon.com/Informed-Company-Cloud-Based-Explore-Understand/dp/1119748003>U015BH45ZK6: We specialise in refactoring DS/ML systems, but I don\'t have anything public yet on my blog. Get in touch if you are interested in a chatU02NR6C4M29: I would like to know here pros and cons of that mono repo, I\'m thinking lately if doing that for prefect tasks or notU02NR6C4M29: I guess you just have too many things in 1 place, although is not uncommon to have mono repos I believeU015BH45ZK6: I don\'t think having one or many repos is the root cause of the problem. You can perfectly structure your stuff in a single repo.U037QAS8S0H: But how do you deal with versioning of logically independent modules within one repository? I think for visual grouping of such related code it makes sense to use things like groups in GitLab imo, but Im not sureU015BH45ZK6: can you clarify why do you need versioning? Do you need multiple versions exist at the same time because of an external customer?U015BH45ZK6: or temporarily because of let say AB testingU015BH45ZK6: or just normal developementU037QAS8S0H: Mostly normal development, but also for example if a specific model or some functionality unintentionally regresses in practice/production and hence needs to be reverted to an older versionU037QAS8S0H: But that last part I think is specific to models mostly. So that would go along the idea of having models serving predictions as a sort of standalone \'app\'U015BH45ZK6: If you have "external" customers then you are better off pypi style packaging in a different repo but that should be relatively rare.\nIf you update models then they are different classes/objects so you need to write them that way. This is because the "regression" can happen after a long time so you need to maintain both at production level (kinda like two legs of an AB test)\nStandard dev (smallish changes) are done with gitflow/branchingU01G8F6E38T: You can set and revert model versioning using config files, decoupling software from model version.\n\nOne other way is using other build systems like pants or use something like git submodules. '}]}
70
68Are RNNs and LSTMs still used for NLP tasks or has Transformers taken over? Are these models productionized and how much resources are required monetarily?{'summary_answer': 'For NLP tasks beyond basic ones, pre-trained Transformers are usually fine-tuned. However, for other sequence tasks that cannot use pre-trained models, LSTMs are still commonly used. RNNs are not preferred over LSTMs. The models mentioned are productionized, and the monetary resources required depend on the specific use case.', 'citations': [{'id': '2021-01-16 06:41:21.0161 UTC', 'message': "U01JFEFSQV7: I am not from NLP domain but I wanted to know are RNN's, LSTM's used nowadays or is Transformers used for everything? The main reason why I asked is are these model's productionized and how much of resources do these models take in terms monetary value?U01EPRA9PLL: For NLP specifically (beyond the super basic tasks), because of how good pre-trained models are, we’ll usually start off with fine-tuning a transformerU01EPRA9PLL: But for other sequence tasks that can’t leverage a pre-trained model, LSTMs are usually what you want to start with (assuming you can/want to use DL of course). Don’t know anyone that prefers RNN over LSTMU017QUF7Z8U: If you want some hands-on examples, have a look at <https://huggingface.co/models> and the corresponding `transformers` library, which comes with some nice examples: <https://huggingface.co/transformers/task_summary.html>"}]}
71
69Is it possible to get two metrics from two runs into the same plot in Neptune?{'summary_answer': 'The question is not directly answered in the context.', 'citations': [{'id': '2023-04-21 09:12:21.809589 UTC', 'message': "U03C4BQLMME: Question for the Neptune people:\nTwo metrics for two runs in the same plot? Is it possible?\n\nI can get two metrics (eg. train and val loss) from the same run into the same plot, and I can also get one metric for two different runs into the same plot, but I can’t figure out how to get two metrics from two runs into the same plot. Is this not possible in Neptune?U03QX52GF5G: The max_epochs should be available in the namespace `training/fit_params/epochs`. You can add this field to the runs table to see the max epochs the model was trained for.\nThe current epoch number is not captured directly. However, since the training metrics are captured and updated at the end of each epoch, and stdout is captured in realtime, checking the run charts or `monitoring/stdout`will tell you which epoch is currently running.\nIf you want to see the current epoch number in the runs table, you will have to log it manually to a field and then add that field to the runs table.\nHere is our Keras integration guide for reference: <https://docs.neptune.ai/integrations/keras/>\nPlease let me know if this helps.\n\nI also noticed from your import statement that you might be using `neptune-client`&lt;1.0. I'd recommend upgrading to v1.x. Here is a guide that would help you upgrade if you choose to: <https://docs.neptune.ai/setup/neptune-client_1-0_release_changes/>"}]}
72
70What are the other options available to migrate from training on demand architecture, and can Celery be used to build task queues instead of handling requests on gRPC server threadpool? Also, what are the ways to handle GIL issues?{'summary_answer': 'The other options available to migrate from training on demand architecture are a queue system or designing a tool that adds training data to the database layer and polling this layer instead of training whenever a request is received. Celery can be used to build task queues that dispatch the training request to workers instead of handling requests on the gRPC server threadpool. GIL issues may be causing bottlenecks, and one way to handle them is to sidecar the extra pods needed or create a master pod that creates the pod workers.', 'citations': [{'id': '2023-02-02 17:48:51.385929 UTC', 'message': 'U04MT8A099Q: Hey all :slightly_smiling_face: We are currently facing issues when training models on pods where the labels and feature payloads are coming from gRPC requests, and are then trained on demand.\n\nWe want to migrate to another architecture here as have been finding training on demand to overload the pod on occasion. What are other options we have here? We are thinking either a queue system or designing a tool that adds training data to the database layer and polling this layer instead of training whenever we get a request - our pods have a number of services alongside training here, and overloading can degrade their performance.\n\nDo people have any advice, blogs, libraries etc. they’ve used to achieve things like adding queues for pods? We’re open to ideas, our stack is python kubernetes (ps we don’t want to use pod autoscaler)U04HKD72GH2: This is an interesting question. However, questions like this also don\'t have perfect answers. I can say when it comes to advanced queueing, your best bet is Slurm, not Kubernetes. The scheduling ability of HPC schedulers is in a different league than Kubernetes.\n\nYou could implement the training pipeline in a more traditional framework and keep your support services in K8s.\n\n"Compared to Slurm, Kubernetes enables you to better manage cloud-native technologies and container-based microservices which can be scaled more dynamically than traditional applications.\n\nIn summary: Slurm works well for batch workloads and optimizing resource usage based on a queue of jobs. Kubernetes is meant for use in ‘modern’ cloud-native microservice architectures. It optimizes scheduling and scaling live services based on available resources. So, two quite different worlds."\n<https://ubiops.com/slurm-for-running-ai-and-ml-workloads/#:~:text=In%20summary%3A%20Slurm%20works%20well,So%2C%20two%20quite%20different%20worlds|link>.U04MT8A099Q: Yeah great suggestion <@U04HKD72GH2> and thanks for the reply. \n\nWe’re a bit limited here in that we supply essentially a helm chart to our client to run on prem so think switching out the architecture to slurm might be a stretch too far here unfortunately. \n\nWe’ve been looking at celery to build task queues that dispatch the training request to workers as opposed to handling requests in the gRPC server threadpool (GIL issues may be causing bottlenecks for us here). Do you have any opinion on celery? U04HKD72GH2: Not really, we use it with Airflow, but its never been an issue, so I\'ve never really thought about it.\n\nAre these simple models you\'re training? I couldn\'t have this issue because each one of my training jobs is its own pod. So if my cluster is smaller than the requests, the pods hang out in the pending state until a spot opens up with enough cpu/ram/gpu.\n\nI regularly queue thousands to tens of thousands of jobs on our cluster and let K8s handle it. If each pod has dependencies, you could sidecar the extra pods needed.\n\nIf this is on-prem with a lower security requirement, you could also create a master pod that creates the pod workers. I\'ve done this before, and it works well, except for the security risk of a pod with lots of permissions.'}]}
73
71What are the recommended experiment tracking libraries for a simple application with input parameters, and tie it to a resulting model, and is MLFlow the best option available?{'summary_answer': 'MLFlow is recommended as the best option for experiment tracking in a simple application with input parameters and tying it to a resulting model. Other options mentioned include Sacred, Weights & Biases, and ClearML.', 'citations': [{'id': '2021-10-22 06:59:16.0269 UTC', 'message': 'U01R217N30F: Hi all, looking for a very simple experiment tracking library. Is MLFlow the best option here? or there is something more lightweight? I basically just need to track input parameters/dataset and tie it to a resulting modelU018RV7QPAA: If you\'re working alone I hear that sacred is the lightest. No experience with it myself, I obviously have a favorite:wink:U02AVFU2Q6B: I\'ve used MLFlow personally just to try it out locally and found it quite easy and we\'ve set it up in our team in the past to run on Kubernetes (i.e. not very simple :wink:)U01J0NVNE1G: I would recommend MLFlow. The solution is light enough and productive in my opinion. This is the most familiar and popular option is you want to collaborate on your project.U01J0NVNE1G: The only cons are the additional management and lack of authentication/workspace if you want to scale this solution.U020ZPEHESY: For experiment tracking I don’t think it is something more easy than MLFlow it is just a pip install away :slightly_smiling_face: If you want also to add model registry then. you need to integrated with an SQL like storage . …. If you want this to run for other data scientists in your team so as to share experiments with them , then install the tracking server in a machine and set up the common tracking uri inside the training scripts ( Ok to set it up for a team it may take a couple of days depending your infra, complexity may scale depending your access policies on the experiments but nothing too scary there to handle )U01VDVCCDK3: +1 for MLflow. U01N8ERHH9T: I use mlflow, but I would actually say the most lightweight would be to log everything to a file that gets synced to S3 and/or a database.U01N8ERHH9T: You would have no graphical interface to compare experiments so you\'d be figuring that out yourself, but that could easily be a notebook or tableau or something good at that stuff.U01R217N30F: thanks all, sorry for the slow reply: MLFlow it is!U024SSR9T5Z: For experiments that are only done by a single person, I like Weights &amp; Biases. Their web app is great, the client library is really simple and has nice integrations with common ML frameworks, and its free for single person projects.U024F704Z24: Alternative is to write everything to a set of tables in your data warehouse. But yu need to think about the schema and in a way you\'d be implementing mlflow-tracking yourself. Technically not difficult, but it\' s the cruft around it that will eat up your time (everyone has to use the same keywords, tables etc.)U01RUQTSPU6: The lightest solutions are those available in a "as-a-service" manner like W&amp;B, Neptune, Comet, ClearML. All you need for the setup is API key and 1 pip install.U018RV7QPAA: <@U01RUQTSPU6> Didn\'t you forget a certain open source solution that is better than all three combined :wink: ?U01RUQTSPU6: Uh, I still had no occassion to give it a try, but yes, ClearML is definitely another option! :slightly_smiling_face:U018RV7QPAA: 2nd favorite open source for tracking after mlflow in the kaggle survey :wink:U01J0NVNE1G: I recently tried NNI for combining experiment tracking, automl, and distributed workflows. It\'s mostly during development, but I found it really convenient <https://github.com/microsoft/nni>'}]}
74
72What do people use for renting cloud GPUs?{'summary_answer': 'People use big cloud providers like AWS and GCP, as well as Paperspace and LambdaLabs for renting cloud GPUs.', 'citations': [{'id': '2023-05-16 23:22:04.332479 UTC', 'message': "U0533K5C0E5: What do folks here use for renting cloud GPUs?U04UV1SC5ML: I've used big cloud providers (AWS, GCP) and Paperspace and LambdaLabs without many issuesU01CRVDS4NA: I just built some demos with it. The developer experience was super good IMO."}]}
75
73What is Vertex Feature Store and is it based on Feast?{'summary_answer': 'Vertex Feature Store is a feature store offered by Google. It is not based on Feast, but is built entirely within Google.', 'citations': [{'id': '2022-01-27 16:29:25.4445 UTC', 'message': 'U0262CGLYS0: Who is it that runs <https://www.featurestorecomparison.com/> ?U015DM4NM1U: wow looks like a competitor of <https://mlops.community/learn/feature-store/>U01994M42BF: HopsworksU0262CGLYS0: I thought Hopsworks was doing <https://www.featurestore.org/>. Did I get them backwards? (<@U01994M42BF>)U0262CGLYS0: Of the 3 main feature store comparison websites out there, mlops community is currently #3 in coverage <@U015DM4NM1U>UP3T8K9M5: dang i gotta bring up our seo… after all the work i did on our compare pageU0262CGLYS0: Your work really shows thru in the quality. While the coverage isn\'t all there (Vertex is obviously missing for example) I do think it\'s the best of the 3.UP3T8K9M5: thank you <@U0262CGLYS0> i tried to get in touch with the google team to put something up about vertex since they announced it but havent been able to get anything from them. cc <@U0134LJPWQ3>\n\nmaybe the fuzzy labs crew could help me put something together on them. <@U01TAGE0JKG> <@U01R2314L02>?U01R2314L02: Thanks for the mention <@UP3T8K9M5>. We did test out Vertex a couple of months back. Things might have moved on since then. In any case I\'m sure we can offer a hand here.U0262CGLYS0: One of my main questions is whether Vertex Feature Store is based on Feast, since that was a collaboration between Google &amp; GoJek (which also had Kubeflow integration), or whether they quietly built something new to power Vertex Feature Store?U02FKUD2XAB: <@U0262CGLYS0> <@UP3T8K9M5> Friends, I talk to Rand Xie (<https://github.com/randxie>) about this question, who was founding engineer of Vertex Feature Store.\n\nShort answer: No, Vertex Feature Store is not based on Feast. All built inside Google. Feature Serving layer done in C++.UP3T8K9M5: <@U02FKUD2XAB> do you think you could put me in touch with Rand? I want to get vertex feature store on the MLOps website!U02FKUD2XAB: <@UP3T8K9M5> let me tryU0262CGLYS0: <@U02FKUD2XAB> That was my suspicion, since no where does it say it\'s Feast. But they also can\'t say it\'s *not* Feast, or they would be undermining the Feast project which they were partnered with GoJek on (now run by Tecton ofc.)\n\nThe problem is that they are caught in between, with the result being they can\'t say *anything* so there\'s a lack of transparency around where Feast fell short, or what they built to supersede it.U02FKUD2XAB: <@U0262CGLYS0> partnership between gcp and feast doesn’t mean that vertex should be based on feast …U02FKUD2XAB: <@U0262CGLYS0> and ex-vertex engineer has told me clearly that vertex was not based on feast, since vertex predated tecton :shrug: so i don’t think it is that complicated to tellU0262CGLYS0: sure it does if Feast is suitable, which it "should" be, but clearly wasn\'tU0262CGLYS0: Vertex didn\'t predate Feast thoU0262CGLYS0: did someone say it was complicated?UP3T8K9M5: I think just about everything in this space is complicated :see_no_evil: <@U0262CGLYS0>U0262CGLYS0: Couldn\'t agree more <@UP3T8K9M5>U0262CGLYS0: Our job is seeing through complication to fundamental complexities, and then operationalizing solutions.U011D84SFFZ: I’m not sure who was first matters for anything. But certainly Feast was around before Vertex.\n\nAlso I wouldn’t trust any of those non-MLOpsC comparison sites. They’re not exactly independent.U01V3JDJAQ6: Sometimes, big tech companies build their own thing (even several different versions of a thing) because they\'ve got lots of eager engineers. :smile:\n\nI\'m sure there are more recent examples, but back in the day, Microsoft rolled out a new data access standard about every 2 years.\n\nI would love to know more about Vertex AI Feature Store. We\'re using GCP, so it\'s a natural, easy choice. I can\'t help but wonder, though, is this a "me too" offering from Google, or will they try and make it a competitive offering?'}]}
76
74Is there a model registry that can accommodate models with a combination of weights, architecture, and code for pre- and post-processing? How can MLFLow artifacts be utilized with the MLFlow model registry?{'summary_answer': 'The person asking the question mentions that they have not seen a model registry that can handle models with a combination of weights, architecture, and code for pre- and post-processing. Another person suggests using MLFlow artifacts along with the MLFlow model registry to track model metrics, plots, auto logging, and code. The code and model must go together, either as artifacts or using Docker. Dependencies can be specified on the MLmodel file in MLFlow, or a Docker image can be built with all the necessary dependencies. Alternatively, pre-built images can be used, but this may not be very flexible.', 'citations': [{'id': '2022-11-26 11:29:41.835219 UTC', 'message': 'U030DQNH89M: All of the model registries that I’ve seen assume your model can be nicely serialized as weights + architecture. Anybody know of a registry that deals with models that are a combination of weights, architecture, and some additional code such as pre- and post- processing code?U019ABXBYET: I guess that model registries need to draw the line somewhere, and sometimes that code will be different based on where/how you’re running it (eg reading data out of a data warehouse vs consuming from a stream). \n\nHow would this work ideally for you?U02NR6C4M29: We do add code as artifacts with mlflowU030DQNH89M: <@U019ABXBYET>: Very true, in my case, the code is always the same. In other words there is a consistent interface which takes a model as input and spits back a series of outputs, but inside the implementation, for a specific model (/version) is both code, which is frozen for that model, and its weights.U030DQNH89M: <@U02NR6C4M29>: Thanks for the tip! How might I use MLFlow artifacts along with the MLFlow model registry?U02NR6C4M29: this is an example of how we use it\n\nwe track the model metrics, plots, auto logging from mlflow, and code (the seldon folder) together. then you can register that run as a model, everything (all artifacts) are on the model registry (the model registry is just a pointer to a experiment run)U02NR6C4M29: we deploy the “seldon” folder through some internal tool, that contains the model binary and any other file it depends onU02NR6C4M29: “which is frozen for that model, and its weights.” that’s the intention behind this, code and model must go together (either like this as artifacts or maybe using docker)U030DQNH89M: Got it, that makes sense. How do you deal with your environment changing and dependencies? For example if you change your python version or library versions (numpy, etc)?U02NR6C4M29: So in general you have this options\n\n1. specify dependencies on the model (that goes on the MLmodel file in mlflow), whenever the model is loaded the dependencies are installed on the fly =&gt; slow start time, easier implementation/maintenance\n2. build a docker image with all you need every new version =&gt; complex setup, faster start time\n3. use some prebuild image (that’s what we do mostly, with mlserver images) and stick to those packages versions during development =&gt; easy, faster, not very flexibleU02NR6C4M29: All have pros and cons, so you have to think what works best for your use case / infrastructure'}]}
77
75What is the best practice for running FastAPI services in k8s/ECS/etc and how should uvicorn workers and containers be configured for optimal performance?{'summary_answer': 'The best practice for running FastAPI services in k8s/ECS/etc is to have one container per "atomic" process for clear separation of responsibility and easy scaling and management. Uvicorn workers should be matched to the number of CPU processes available. Python (specifically CPython) has GIL and can only use one system thread (vCPU) at most, so scaling one container for more than 1 vCPU has no sense, except for models using libraries with high performance multithread processing code. It is recommended to set real CPU requests instead of CPU limits to avoid potential problems with health checks and application metrics ingestion. In terms of container configuration, it is generally better to use fewer but larger containers (e.g. 2-4 CPU, 4-8 GB RAM) as smaller containers tend to sometimes run on underperforming CPUs and FastAPI can\'t take advantage of them. Uvicorn should be set to 1 or 2 workers for easier scaling at the k8s pod and ECS task level. However, the optimal configuration may vary depending on the specific use case and traffic patterns.', 'citations': [{'id': '2022-08-23 17:52:23.375788 UTC', 'message': "U01N8ERHH9T: For people running FastAPI services in k8s/ECS/etc., do you tend to use lots of small containers (e.g. 0.5 vCPU) or fewer but larger containers (e.g. &gt;= 2 vCPU)? I assume you probably match the number of uvicorn workers to the number of CPU processes available? I read through <https://fastapi.tiangolo.com/deployment/docker/> and it seems to suggest using 1 uvicorn worker per container and letting k8s/etc handle replication, but from my load testing it seems like I might get a little better performance using multiple workers on a slightly bigger container. For example, instead of running 12 small containers of 0.5 vCPU each, running 3 containers with 2 vCPU each. I’m not sure it makes a difference on costs since we running k8s on pretty large instance types and using fractional shares (millicpu).U03V4932SQ1: According to good practices there should be one container per “atomic” process. It gives clear separation in responsibility of the system parts - k8s is responsible for orchestration, container process needs only to handle the traffic. This model is transparent, easy to scale, manage… in general - control. On the other side, consider potential problems with health checks (liveness, readiness probes), application metrics ingestion, etc in multiprocess container environment.\n\nAbout performance issue: The thing that you probably faced is some kind of bug or weird behaviour of linux kernel `cgroups` (used heavily in containerisation) when you set CPU limits. I don’t want to elaborate too much here about that right now, but you can find tons of information searching for “kubernetes cpu limit throttling”. In short, try without setting CPU limits, but with setting real CPU requests. Don’t wory about that - CPU limits/requests behaves differently than memory limits/requests. Simply it’s different matter.\n\nAbout Python: Python (precisely most widely used CPython) has GIL and in practice can use only one system thread (vCPU in general) at most. So scaling one container for more than 1 vCPU has no sense.\nException: If you are deploying some model and use libraries that under the hood has some high performance multithread processing code (like `mxnet` with `intel-mkl`), then this rule not apply.U01N8ERHH9T: Thanks. I get all of that, but it doesn't seem to match real world load testingU01N8ERHH9T: I guess I'm debating whether in theory is more important than experiment resultsU01N8ERHH9T: For reference, we have an internal service scorecard that gets reviewed regularly as part of an engineering excellence practice. One of the scorecard checks is defining service CPU resources and limits. So I can try removing the CPU limit, but I need to set it to something or get dinged on our scorecard. Yes, I know that it's just vanity metrics and not indicative of actual engineering excellence.U031F7D87RA: This is a great question! I don't have the answer, but maybe I can offer some additional thoughts. If you're scaling fastapi by creating additional uvicorn workers, I'm pretty sure they will create a process per worker (thus getting around the python GIL issue), so I think you should have fairly decent performance. Additional performance bottlenecks may be due to nuances of your use case.\n\nI have to imagine that there's a little bit of overhead if you're introducing a container per process. Just to run the os for each of these containers has got to be a non-zero compute task. I guess if you're seeing a performance difference in orders of magnitude I'd be a little surprised, but I would think a single container running multiple processes is probably faster because there's less in the way to running the inference.\n\nThat said, I'd think it's a little easier to reason about a container per process and scale that at a higher level.U03CSQA1KF1: Disclaimer: I run stuff on Cloud Run but maybe it appliies. I personally use fewer but larger e..g 2-4 CPU 4-8 GB RAM containers for 3 reasons 1. smaller containers tend to sometimes run on CPUs which under perform 2. FastAPI can't really take advantage 3. Compute up time vs. cost vs. latency considerations skew towards running larger instances (I run object detection and classification models for 2-5s per image/user requirements)U01N8ERHH9T: ^ that’s been more or less my experience as well!U01N8ERHH9T: fwiw for our API we have the following chart\n```environments:\n production:\n deployment:\n resources:\n cpu: 2000m\n memory: 2000Mi\n\n replicas:\n dynamic:\n min: 3\n max: 12\n targets:\n cpuPercentage: 65```\nSo I don’t think we’re doing CPU limits now anywayU01N8ERHH9T: with 2000m CPU (or 2 vCPU since we’re on AWS), we have uvicorn set at 2 workers and getting noticeable throughput and latency benefits. It’s not night and day, but I do think it makes scaling to our traffic easier. Our traffic kind of looks like this:\n``` o \n o \n o \n o o \n o o \n o o o \n o o o o \n o o o o o o o o o o o o o o \no o o o o o o o o o o o o o o o o o o o o o o o \no o o o o o o o o o o o o o o o o o o o o o o o o```\nwe get these big spikes at irregular intervals. they are not always from things we are driving, but often from other companies launching new products and then people coming to our secondary market site to get them. When we suddenly get another 1k requests per minute for 5 minutes, it seems to work a bit better to scale up with fewer slightly bigger pods than more slightly smaller pods.U01P75V5QGM: Even though our load testing sometimes indicates that we would get slightly better performance with a higher number of uvicorn workers we find that it’s easier to scale at the k8s pod and ECS task level if we have uvicorn set to 1 or 2 workers. And yes, it does create a separate process for each worker."}]}
78
76What framework did the guys from Mage use to build their UI and did they have a designer?{'summary_answer': 'The guys from Mage used React for the components and NextJS for the framework to build their UI. They also have an in-house designer and invest a lot of time and effort in the UI and UX.', 'citations': [{'id': '2022-09-09 14:50:52.915718 UTC', 'message': 'U03LPQP9LSJ: I wonder in what framework the guys from mage used to build that nice looking UI. <@U03G79Y2HU6>U040G4526FK: <https://github.com/mage-ai/mage-ai/blob/master/mage_ai/frontend/package.json>U015BH45ZK6: Thanks!U03G79Y2HU6: So sorry for the late reply <@U03LPQP9LSJ>. For some reason, I didn’t get a push notification. I spend a lot of time on Slack as well...\n\nThank you <@U040G4526FK> for sharing.\n\nWe use React for the components. We write in TypeScript. We use NextJS for the framework.\n\nAs for the nice looking UI, thank you for the compliments. We have an in-house designer. We also spend a lot of time and investment in the UI and UX.\n\nToo many engineering tools focus too little on the end-user experience and only on the tech.\n\nWe care about both!'}]}
79
77How can we communicate about ML without complicated things?{'summary_answer': 'We can communicate about ML by connecting it to its benefits and speaking in the language of the business, such as discussing revenue goals and building a solid cost justification. It may also help to provide a general understanding of what ML can do without getting too technical.', 'citations': [{'id': '2022-01-13 21:35:07.0304 UTC', 'message': 'UP3T8K9M5: this is an amazing questionU022KRHMWEB: <@U012YQULW4X> I think you have to consider it from a business perspective. What are the businesses key problems. Then talk about some ML solutions that have shown some advance in those areas. ML is essentially coding application functionality via data. You could consider ML as a black box. You feed in lots of data that is tagged/classified/identified and out comes a ML DNN that can do this sort of classification/tagging on other (similar) data that is untagged. How this can help the business is a more difficult question. Hope this helpsU02CY3STNSH: My take based on discussing it with boards and non technical stakeholders:\n\n1. Connect it to its benefits. This makes it closer to why as opposed to how (where ML conceptually falls into).\n\n2. Experienced entrepreneurs and business people can understood the experimental nature of data science. Coming up with working business models is also an pursuit of experimenting, gathering enough data, iteration and trade offs without knowing up front the formula to make it work.U022WCWK4BF: Similar to what\'s been said already but tying it to their specific top of mind goals and business KPI\'s. Speaking in their language "increase in revenue, revenue at risk, etc " and building a solid cost justification to back up the project.U012YQULW4X: yes, agee wholehardedly. We are already talking benefits, business case, their language. I was wondering of how to cast a wider net and have then come up with their own ideas/problems to approach us. We will never be as deep in their domain as might be necessary to have good visibility of possible problems to solve.\n\nthat’s why I thought having them gain a general understanding of what we do/what we can do for them might help.'}]}
80
78What are people's thoughts on AutoML and are there successful use cases? How can AutoML models be improved upon? Are there any limitations to using AutoML in regulated industries?{'summary_answer': 'People have mixed thoughts on AutoML. Some find it useful for trivial or low impact prediction tasks in mid-to-big-size companies where there may not be enough "real ML people" to do it. AutoML can also be used as a benchmark or to get a baseline model quickly. However, there are limitations to using AutoML in regulated industries due to liability and indemnification concerns. AutoML models can be improved upon by playing with augmentation, hyperparameters, and feature engineering. Limitations to using AutoML include the inability to handle all categories at once and the need for AutoML tuning to achieve good performance.', 'citations': [{'id': '2022-01-19 21:47:28.0279 UTC', 'message': 'U02V250FQTS: What do people think of AutoML? I\'m seeing lots of blog posts like <https://medium.com/captech-corner/the-truth-about-automl-and-no-code-data-science-b73f2cf50c4e|this>. And a number of offerings pop-up (Obviously AI, Roboflow, Vertex AI, Ludwig). Anyone using it successfully?U02GEFE4YEP: in my experience, there is a ton of trivial / low impact prediction tasks in a mid-to-big-size company, and not enough "real ML people" to do it. Ludwig (disclaimer: Piero is a friend) helps analysts or BI people build prediction models for their analsys without using more expensive resources. As a ML person, I also find Ludwig a good 10-minutes-baseline for starting out prototypes or even research projects.U02GEFE4YEP: [ If churn prediction is not an internal dashboard, but your core business, you\'re unlikely to use autoML for it these days, for the same reason that Amazon doesn\'t buy a recommender system, but they develop their own ]U02Q1AW1ABU: A challenge that doesn\'t seem to be well-handled/well understood is using AutoML in regulated industries can be fairly dangerous from a liability/indemnification perspective. Additionally, building a product where you\'re marketing a feature as "performant" and then not understanding/"owning" the ML components that make that feature performant seems like a slippery slope. I think its use cases are fairly limited, and Jacopo describes them well.U02Q1AW1ABU: However, I don\'t think that the AutoML systems perform particularly "Beyond" just grabbing a pre-trained benchmark model from a well documented model zoo and deploying that instead. I.E., I can\'t imagine that an AutoML object detector does significantly better than a Resnet-50 trained on COCO would prior to fine-tuning it.U01E66U1SDR: • AutoML != AutoML: there’s AutoML for computer vision tasks (e.g., IBM’s Maximo Visual Inspection; disclaimer: I’m an IBMer), for structured data (some approaches additionally support NLP features) in the form of CSV files (e.g., H2O’s driverless AI), for NAS (e.g., Kubeflow’s Katib, ...). I haven’t come across an AutoML approach that is able to do all of these categories at once\n• I mostly use AutoML as a benchmark / getting a base line model quickly. From there, I can try to “beat” the AutoML-created model in regards to some performance metric (if needed); knowing that it must be possible to perform at least as good as the base line model.U02V250FQTS: What are the techniques you use to "beat" the AutoML model? Better hyperparams, data augmentation, feature engineering?U01E66U1SDR: That’s clearly a “depends” answer :wink: Say the AutoML approach comes up with a classical ML model like XGBoost. Auto feature engineering and hyperparams by AutoML are often done quite well, so I’d rather play with augmentation (which often benefits from more knowledge of the application context, e.g., in computer vision I might know that the camera that takes pictures can move around in different angles, so I’d benefit from augmenting my data set by rotations). If AutoML comes up with an XGBoost model even if I’m having lots of data, I also might beat the model with a completely different architecture like a DNN…U02TN47BQ4X: For object detection and image segmentation problems I would agree with you, <@U02Q1AW1ABU>. But for simple classification/regression problems like I\'m dealing with now it\'s a really good option for me, at least in the Databricks managed ML environment. I\'ve done it before myself "by hand" and the amount of work I need to do instead using Databricks\' AutoML is a small fraction of what it was before when I was writing it all myself using scikit-learn, writing code for hyperparameter grid searching, managing/versioning the experiments, models, datasets, etc.U02Q1AW1ABU: <@U02TN47BQ4X> &amp; <@U01E66U1SDR> - Fair enough! I have worked in regulated industries since AutoML has emerged, so my experience has been more along the lines of "This might save me some time now but is unlikely to save me time later."U01E66U1SDR: I guess I should also mention that I myself am obviously a bit biased about AutoML as I know how and love to do data science manually from time to time. For some subject matter experts, AutoML can be a pretty convenient data scientist tool. For example, in the insurance sector, actuaries understand the math relevant for insurances and an AutoML tool provides them an easy access to data science features. I would even claim they will know how to interpret performance metrics and use them as feedback for AutoML tuning (yes, that will still be needed to achieve good performance in AutoML…).'}]}
81
79What are some resources or guidance for improving MLOps knowledge and skills specifically for computer vision models?{'summary_answer': 'The person is asking for guidance or insights on how to improve their MLOps knowledge and skills specifically for computer vision models, but they have not come across any good resources. They have an engineering background and some knowledge of ML through Andrew Ng courses and ZTM academy.', 'citations': [{'id': '2023-04-28 15:08:15.144219 UTC', 'message': 'U0556TYHK1R: Hi Everyone!\nI have been part of the mlops team for quite a long time, but my major job is only data monitoring for computer vision models, I want to learn more about mlops for computer vision cases but did come across any good stuff, I have an engineering background and I know about ML through Andrew Ng courses and some ZTM academy stuff, I was targeting learn mlops for CV model as I thought if I learn mlops this will also help my company as my company also want me to learn and lead mlops.\n if anyone has some insights/guidance on how I can improve my mlops knowledge and skills, I would really appreciate that.'}]}
82
80What is the best way to give a heads up to community to come and attend a Kubeflow event hosted by a company in New York City?{'summary_answer': 'The best way to give a heads up to the community to come and attend a Kubeflow event hosted by Capital One in New York City is not specified in the given context.', 'citations': [{'id': '2023-04-22 05:49:24.199709 UTC', 'message': 'U0549CY2NMC: Hey all - have my company (Capital One) hosting a Kubeflow event for the community to come sit / listen to contributors / eat some free lunch in New York City. What is the best way to give a heads up to the community to come and attend?'}]}
83
81What are some options for integrating a Python-trained ML model into a Matlab workflow for deployment on a rapid prototyping machine?{'summary_answer': 'There are two possible options for integrating a Python-trained ML model into a Matlab workflow for deployment on a rapid prototyping machine: \n1. Train the model in Python and then "export" the trained model into C code. Then import that code into a Matlab S-Function that should be possible to integrate into their workflow.\n2. Do the ML work in Matlab. Or possibly do most of the iterative work on Python and then the last bit of tuning/final training on Matlab.', 'citations': [{'id': '2022-09-15 09:17:09.882559 UTC', 'message': 'U03P9NQSSUW: Not sure this is the appropriate channel but here goes:\n\nI\'ve been working on an ML project in Python where the model is used as a surrogate for NOx emissions predictions in diesel engines. Turns out the deliverable (trained model) needs to be loaded onto a rapid prototyping machine. The team working with the RP machine and that\'s supposed to load code/components into the machine use Matlab for all their models/code/components and use Matlab Coder to translate the Simulink blocks into C code that can then be flashed onto the machine. My ML component would need to be integrated into that workflow somehow.\n\nGranted that I would like to avoid using Matlab as much as possible, I see two possible options:\n1. Train the model in Python and then "export" the trained model into C code. Then import that code into a Matlab S-Function that should be possible to integrate into their workflow.\n2. Do the ML work in Matlab. Or possibly do most of the iterative work on Python and then the last bit of tuning/final training on Matlab.\nHas anyone run into a similar problem? What do you suggest I do here?\nThanks!U01S2349UNS: Doesn’t MatLab support Java? You can train the models in Java and then import them as libraries. It seems there’s also some support for python: <https://www.mathworks.com/help/matlab/call-python-libraries.html>U03P9NQSSUW: Yeah the problem with this solution (as far as I understand it) is that the Matlab Coder can\'t translate the python code into C, which is ultimately what is required.U015BH45ZK6: What kind of model do you have?U03P9NQSSUW: Still iterating to find the best model, but I am only using sklearn for now. Working with 500 examples and roughly 20 featuresU01S2349UNS: I wouldn’t shy away from using MatLab either for the ML, it’s a great toolU015BH45ZK6: ok, but which model in sklearn?U03P9NQSSUW: SVR seems to be most promising for now. But still need to test othersU015BH45ZK6: You don\'t need to retrain the model in matlab? How are you going to do feature engineering in matlab?U03P9NQSSUW: I think I actually _have_ to train the model in matlab because I don\'t believe there\'s anyway to import a trained model from python to matlab (they are structured differently it seems). I don\'t know about feature engineering in Matlab, haven\'t thought about that.U03P9NQSSUW: Data is quite clean so the only preprocessing I am doing is scaling input data and transforming targetU015BH45ZK6: if it will turn out that SVM is the best solution then you are in luck because those are relatively easy to implement for prediction (if matlab doesn\'t have it already)U015BH45ZK6: its essentially sum(wi*K(xi,x)) where i\'s are the support vectors (wi!=0)U03P9NQSSUW: Matlab has that algo, so should be OK. But yes, it is fairly easy to implement from scratch for predictionU03P9NQSSUW: yesU03P9NQSSUW: so you\'re saying, don\'t even bother with retraining in matlab, just code the prediction equation from scratchU03P9NQSSUW: but I\'d still have the feature engineering issueU03P9NQSSUW: I am still iterating, and I haven\'t done much playing around with the features but I am bound to find something interesting there. That would need to implemented in the prediction pipeline of course...U015BH45ZK6: Well, YMMV\n\nI am giving you options, and you decide on the total cost.\n\nYou either\n• go full matlab (question is how fast you pick it up, sunk cost into python, available packages\n• python training, matlab feature engineering, rewrite model inference in matlabU03P9NQSSUW: yeah I agree <@U015BH45ZK6>. Thanks for your prompt feedback on this!U044GV99B7B: <@U03P9NQSSUW> if you\'re still working on this problem, have you checked out the ONNX converter? I have not used it for MATLAB but I think it supports sklearn --&gt; ONNX and ONNX --&gt; MATLAB\n<https://onnx.ai/sklearn-onnx/>\n<https://www.mathworks.com/help/deeplearning/ref/importonnxnetwork.html>U03P9NQSSUW: Thanks for the suggestion <@U044GV99B7B>! It seems imports of ONNX models from Matlab currently only support deep learning models, though I am not 100% sure. That being said, once the model is imported in Matlab, I am not sure it can be translated to C using the Matlab Coder tool which is the real issue :confused:U03P9NQSSUW: end product need to be C code, and most probably via the Matlab suite (so, the Matlab Coder)U044GV99B7B: Have you tried hard-coding the inference runs from C? e.g. by-passing MATLAB. If the end product is C, then you may call your model from C directly?U03P9NQSSUW: yeah this was an idea. Depending on the selected model, then coding inference could be simple (like with SVR). But with other models, it could get more complex.'}]}
84
82Are there many people in the MLOps community who work as independent contractors and/or work only four days a week, and is it possible to find work with a better work-life balance than corporate jobs in the field of MLOps?{'summary_answer': 'There are several people in the MLOps community who work as independent contractors and/or work only four days a week. However, finding work with a better work-life balance than corporate jobs in the field of MLOps can be challenging. It is possible to find fractional work, but it is hard to come by. Fractional work works well for early/research stages, but becomes frustrating for the employee and employer the closer you get to operational activity. With MLOps, you would need to stay in an advisory role. Freelancing can provide flexibility in schedule, but there can be pushback from employers, and it is important to choose projects well. Consulting can also provide flexibility, but it is hard to find "cold" work, and there is a major push to turn consulting into a 40/week job. It is important to have a good network and be good at sales. Overall, it is difficult to generalize the best approach, and it takes discipline and a willingness to decline work to achieve work-life balance. In the Netherlands, 4-day workweeks are typical, and remote work is now a right.', 'citations': [{'id': '2022-09-25 18:09:56.103789 UTC', 'message': 'U03UJEUDP1S: How many people here work as independent contractors and/or work only four days a week? I\'ve been pondering finding ways to have better work life balance than the corporate world allows, and am wondering if my investment in MLOps skills can pay off in this way. U017CCSNYDV: _Fractional_ working is good, but it is hard to find. I have been working as a fractional/interim technologist for the last 6 years. 3 days a week works well from a life balance point of view, but that makes it hard to be hands on/responsible for delivery. In the DS space, fractional works well for early/research stages, but becomes frustrating for the employee and employer the closer you get to operational activity. With MLOps you would need to stay in an advisory role.U03UJEUDP1S: That\'s a great perspective Hugh, thanks!U03LQLB57L7: I\'m working as an independent ML consultant and I have a huge flexibility in my schedule. Usually I work around 30 hours per week. I find significant pushback from employers, so that\'s why I\'m freelancing. Even there, you have to choose your projects well, because if there\'s already a big team, an active implementation role won\'t work part time, so you\'d be looking for more advisory role. Where I had luck with implementation projects was in doing PoCs/first versions of a new ML project, usually for smaller companies/startups.UP3T8K9M5: <@U01G1V9K430> and <@U01N60K8F0E> are doing this i think. also <https://www.tribe.ai/> trys to do this and i think <@U015BH45ZK6> has had experience with themU01J0NVNE1G: I now work as a freelance ML Engineer. This is such an improvement compared to my previous job as a consultant for a company: I can choose my office, my computer/tools, and how I organize myself.\n\nI think the work schedule problem is similar to the trade-off between latency and throughput. Some companies want you there 5 days a week to handle requests, others only care about how much stuff you can do in a whole week. It\'s difficult to generalize the best approach.\n\nOverall, I prefer to work 5 days a week but I don\'t want to waste the time commuting between work to home.U01N60K8F0E: I do consulting and delivery but I\'m trying to ramp up my business so working much more than full time just now. Different objectives though.U015BH45ZK6: I work as a consultant as well. It is incredibly hard to find "cold" work so do not start without a network or a set of prospective clients who are willing to pay a retainer. Also there is a major push to turn your consulting into a 40/week job. Most companies can\'t handle external talent. Also there is a major risk of turning your "consulting" into a contracting/pseudo employment without the benefits of employment (promotion, bonus, perks, leadership, mentoring)U015BH45ZK6: Also get better at sales, job is 90% sales 10% of tech (and this tech is harder than most people\'s 100% job)U01S2349UNS: If you are considering migrating, in the Netherlands 4 days-week is a right on any company, and now so is remote (you do receive for 4 days, but that should be more than enough)U015BH45ZK6: TBH at the current stage I dissuade anyone from doing consulting. I would think that it is better to shop around and get a job at a relatively high tech place where you have a good company cultureU015BH45ZK6: I can go on about this more. You really need to know what you want what you can do and assess if it is working or even realisticU022KRHMWEB: FWIW I’ve been doing consulting (not contracting) work (not in ML space) for almost 20 yrs now. Sales and marketing typically run about 50% of my overall time spent on the job. Doing actual work for customers ~25% and the rest is admin, office stuff, training etc. I strongly second <@U015BH45ZK6> statement above to create a great network before you make the plunge AND have enough savings to last a couple of years with little to no Income (profits that can be paid to you as salary). It took me 2 yrs to break even and my guess this was unusually good.U022KRHMWEB: Life work balance is a challenge whether you work for your self or for another. It takes discipline and a willingness to decline work (say no).U022KRHMWEB: On the other hand, there’s nothing better in the world than working for yourself… good luck in what ever you decideU01N8ERHH9T: Work/life balance is pretty good at my full-time tech job. I usually only work about 4-6 hours a day. So an average day might be 2 hours in the morning, 1 hour break for lunch, 3 hours in the afternoon. Does my employer expect more? Maybe. Do I care? No. Do I get good reviews? Yes!U01N8ERHH9T: There’s a lot of people on Reddit who claim to work 2 full-time tech jobs at the same time because the workload is so low.U033QREB56V: <@U01N8ERHH9T> Curious what lets you feel comfortable with this in the review process? Personally, I\'ve been on teams where my boss and/or co-workers place much less emphasis on work/life balance, so I always feel like I\'m slacking even if things are getting doneU01N8ERHH9T: I make stuff that has a positive ROI. We measure that ROI by running A/B tests that show without our contributions the company would make a lot less money. Why should they care how many hours I work if the work I do can clearly be tied to significantly more value created?U01N8ERHH9T: That attitude is a lot harder in roles like data engineering or infrastructure. Those roles are all about enablement. You depend on other teams leveraging your work to produce value. I build recommendation systems. Our outputs are directly measurable and we can use marketing terms like lift. Now there might be some discussion of how much our work depends on other teams and that\'s fair, but if the ROI is high enough then those questions aren\'t even worth asking.U01N8ERHH9T: Think about it like you\'re running a trading firm. Do you care how much time your analysts put in their seats or how much money they bring in? If you are focused on time in seat, you\'re losing.U01N8ERHH9T: I still work 7-8 hour days sometimes, but usually only when I have 6 hours of meetings that day. That\'s rare and I hate my job on those days. Usually I like my job and spend most of the time coding or exploring data.U01N8ERHH9T: Or writing. Lots of writing plans and explaining things.U033QREB56V: Yeah fair enough. I\'ve been in a position of providing valuable production models (measured more by accuracy than ROI), and it is definitely empowering. Probably my current issue is that it takes a while to get to that point. Being new at a company, I\'ve usually had a few experimental projects before striking one that\'s a big successU01N8ERHH9T: Then it\'s just an issue of confidence, right? You feel like you need to prove yourself first. I feel like I\'ve already proven myself so many times at so many places that it\'s going to happen eventually if I just keep doing my thing.U01N8ERHH9T: When I worked in consulting, I had the opposite problem. Could provide great work at a great value to the client, but the firm would be upset because there weren\'t enough billable hours. Is that my fault? Bill for the project then and make it seem like there are more hours. Or charge more per hour.U033QREB56V: Currently, I think the low confidence comes from having significantly accelerated turn-around times compared to what I\'ve been used to previously. Aside from model ROI/performance, I think data engineering/infrastructure can actually provide a lot of security. If you make systems that everyone relies on and comes to you when there\'s troubleshooting, you make yourself pretty indispensableU01HRT0R69M: <@U01S2349UNS> already mentioned it, but in the Netherlands 4-day workweeks are quite typical :slightly_smiling_face: I also work 4 days a week at the moment (not as a contractor though). I think there are some other countries where it\'s also more common'}]}
85
83How can I get kfserving to work properly on a private cluster?{'summary_answer': 'To get kfserving to work properly on a private cluster, one needs to ensure proper DNS and networking setup, particularly with regards to Istio. DNS names like *.svc.cluster.local should work only from within the cluster. Additionally, an internal LB for the knative gateway may need to be created to function properly if the default ingress LB has been removed. It is also important to ensure that the k8s annotations are properly set to signal GCP to give the load balancer an internal IP.', 'citations': [{'id': '2021-03-08 11:02:58.3649 UTC', 'message': 'U01NTK2841W: Anyone managed to get kfserving running on a private cluster? We managed to deploy the model, but when we curl it we either get a timeout or a cannot resolve issue.\n\n```[ root@curl-tester:/ ]$ curl <http://knative-local-gateway.istio-system.svc.cluster.local> -H "<http://test-model-023-predictor-default.default.example.com|test-model-023-predictor-default.default.example.com>"\ncurl: (7) Failed to connect to knative-local-gateway.istio-system.svc.cluster.local port 80: Connection timed out\n[ root@curl-tester:/ ]$ curl <http://cluster-local-gateway.istio-system.svc.cluster.local>\ncurl: (6) Couldn\'t resolve host \'cluster-local-gateway.istio-system.svc.cluster.local\'```\nEtc..\n\nThis is what the service returns when creating the model:\n```"status": {\n "address": {\n "url": "<http://test-model-023.default.svc.cluster.local/v1/models/test-model-023:predict>"\n },\n "components": {\n "predictor": {\n "address": {\n "url": "<http://test-model-023-predictor-default.default.svc.cluster.local>"\n },\n "latestCreatedRevision": "test-model-023-predictor-default-00002",\n "latestReadyRevision": "test-model-023-predictor-default-00002",\n "latestRolledoutRevision": "test-model-023-predictor-default-00002",\n "traffic": [\n {\n "latestRevision": true,\n "percent": 100,\n "revisionName": "test-model-023-predictor-default-00002"\n }\n ],\n "url": "<http://test-model-023-predictor-default.default.example.com>"\n }\n },```U015DM4NM1U: I haven\'t yet played with kfserving, but it seems to me the problem lies in DNS &amp; networking setup (`istio` in your case). AFAIK the DNS names like *.svc.cluster.local should work only from within the clusterU01NTK2841W: Hey. Yes. Seems so. For the timeout issue, we had to create an internal LB for the knative gateway to function properly since we removed the default ingress LB.U015DM4NM1U: so it worked? :smiley:U01NTK2841W: Not quite yet, but I managed to make the gateway work. Now I have to figure out how to fix the internal LB.U01BNRGA5GU: <@U01NTK2841W> which knative version you are on ?U01NTK2841W: <@U01BNRGA5GU> v0.5.1. I managed to make it work, I used k8s annotations to signal GCP to give the load balancer an internal IP.'}]}
86
84What are the experiences of using Ray/Anyscale and how seamless is the shift from research to productization? Has anyone used Ray in Computer Vision space?{'summary_answer': 'The person is asking for experiences of using Ray/Anyscale and how seamless the shift from research to productization is. They are also asking if anyone has used Ray in the Computer Vision space and how their experience was.', 'citations': [{'id': '2023-04-24 14:06:56.928049 UTC', 'message': 'U054Q73KRFE: Hey all, heard the recent podcast with Waleed from Anyscale and thinking of trying it out. Was wondering if anybody could describe their experience using Ray/Anyscale and could address questions regarding how seamless is the shift from research to productization? Also has anybody used Ray in the Computer vision space- if so how was the experience?'}]}
87
85What is the most effective way to do scalable real-time inference of custom models in AWS? Specifically, between using Lambda to call a model hosted on SageMaker or calling an ELB that routes to an autoscaling ECS service hosted on Fargate?{'summary_answer': 'The most effective way to do scalable real-time inference of custom models in AWS depends on the specific use case. Some options mentioned in the context include using Lambda to call a model hosted on SageMaker, using Lambda to call an ELB that routes to an autoscaling ECS service hosted on Fargate, using Ray.io combined with EKS and their operator, or having a Ray cluster running on top of k8s. It was also suggested to consider using an inference cluster on top of EKS or a fully managed solution like cnvrg.io. Ultimately, the best approach will depend on factors such as the complexity of the model, the size and duration of the load, and the level of technical expertise and resources available for configuration and maintenance.', 'citations': [{'id': '2022-02-02 19:26:18.048419 UTC', 'message': "U024SSR9T5Z: Hi all! I'd love to ask all the smart people here (which is all of you) this question:\n\n*TL;DR*\nWhat is the most effective way you've found to do scalable real-time inference of custom models in AWS?\n\n*Details*\nWe're migrating from GCP to AWS, and are considering two options for real-time inference:\n1. Lambda as a fronting web service, which calls the model hosted on SageMaker inference\n2. Lambda as a fronting web service, which calls an ELB, which routes to an autoscaling ECS service hosted on Fargate\nDo you do one of these things? Is there another approach you use that you really like? Any weaknesses you see with either of these two approaches?U02FT777093: If your load is not too heavy it might be an option to do the inference in Lambda as well.U02T5C1HEQ0: A lot of the conversations I've had in the last few months about this are either lambda/sagemaker if you are conforming to that model, and if you need more flexibility I think <http://Ray.io|Ray.io> is getting some impressive stuff going, and can be very effective when combined with EKS and their operator. More to manage that route, which isn't ideal, but that's what you pay for flexibility.U01S5RHF8PK: We are also currently looking at having a Ray cluster running on top of k8s (using spot instances) to perform inference. In parallel, we are looking at having a series of ec2 instances clustered under ECS and paired with EFS (instead of s3 to reduce latency) to host model files.\nFor smaller loads, lambda works too, but for us, it did not due to size/duration limitations.U02FT777093: I'm currently running a ray cluster for inference as well (on bare ec2, no k8s), it has been a pretty good experience compared to alternatives IMO :slightly_smiling_face:\nThat said, I think the answer to this question will definitely depend on the specifics of the usecaseU01N8ERHH9T: Both options seem fine, but it really depends on the complexity of the model. Sagemaker inference is nice because it does auto scaling for you and has some runtime optimizations built in, but you pay for those niceties.U01S5RHF8PK: <@U02FT777093> wondering if you have multiple ec2 instances or running nodes on different cores. Also, spot or regular ec2?U02FT777093: I'm using multiple ec2 instances <@U01S5RHF8PK>. Currently we do on-demand because of some constraints we have.U01S5RHF8PK: <@U02FT777093> and no ECS, right?U024SSR9T5Z: Thank you all for your great answers. I agree that the answer to this question depends on your use case. Thank you for sharing a variety here. Our use case is primarily large models that thus have a slower cold start time, and also require at least e.g. 2 vCPU and 4 GB RAM.U01S5RHF8PK: <@U024SSR9T5Z> any reasons (other that cost) that you are not considering an inference cluster on top of EKS? That is genuinely what keeps me awake at night for my own case.U024SSR9T5Z: Hi <@U01S5RHF8PK> :wave:! The only reason is that I don't have k8s experience, and am looking for a more fully managed solution. From what I hear, EKS/k8s requires a fair amount of configuration and maintenance, and I didn't want to have to deal with ongoing technical debt/maintenance there, since I would be owning it. Do you guys use EKS+OSS tools for inference? What benefits do you see over a more fully managed solution?U01S5RHF8PK: For us, being a small shop, it was a cost issue that prevented us from going fully DevOps on it ourselves.\nWe are currently using a company called `Shakudo` to run Dask clusters on EKS (I won’t get paid if you work with them :slightly_smiling_face: ) but if you want a fully managed experience, you might wanna check <http://cnvrg.io|cnvrg.io> as well. I know them (again, won’t get paid any commission) so if you want a demo, DM me and I’ll ping them.U024SSR9T5Z: Thank you Arman for the pointers. I may look into one of those.U02UQ4VLWE5: cortex.dev is an abstractive layer over K8s/EKS, very simple to use, and has a great Slack community. Have been using it in production for a year now.U024SSR9T5Z: Thanks Aleks, I've heard of Cortex before. That's good to know someone who's been using it in production recommends it!"}]}
88
86Which feature store solutions offer feature versioning? Can AWS Sagemaker FS offer feature versioning? How does SMFS eliminate the need for Glue jobs?{'summary_answer': 'The person asking the question is looking for feature store solutions that offer feature versioning. AWS Sagemaker FS does not seem to have first-class support for feature versioning, but it does save every historical version of each feature in the offline store, allowing for point-in-time dataset recreation. However, it may be possible to add a version column to each feature record to accomplish versioning. SMFS eliminates the need for Glue jobs by storing the offline store in S3 and building a Glue data catalog automatically, allowing for Athena queries against it. Other feature store solutions that offer feature versioning include Tecton, which has the concept of variants for feature versions.', 'citations': [{'id': '2022-12-15 21:28:01.905989 UTC', 'message': 'U04AQ0V99U3: Hi! Does anyone know which feature store solutions offer feature versioning? We are having troubles at my team because we want to fix some bugs in some ML features we have at our custom feature store, but we can\'t push the fix because that would break one ML model that can\'t be retrained easily.\nWe are thinking of migrating to a feature store solution (ideally AWS Sagemaker FS because we are hosted in AWS), but I couldn\'t find reference to feature versioning in AWSU03L6RTPTB3: I remember from the <https://github.com/featureform/featureform|Feature Form> tutorial a couple of months ago that it had something called variants. See <https://docs.featureform.com/getting-started/transforming-data|here>:\n\n"Transformation definitions are versioned and stored immutably."\n\nExample:\n```@postgres.register_transformation(variant="quickstart")\ndef fare_per_family_member():\n """ The average fare paid per family member in a family\n """\n return "SELECT PassengerId, Fare / Parch FROM {{titanic.kaggle}}"```\n(Maybe <@U01994M42BF> can answer this better.)U04AQ0V99U3: Thanks <@U03L6RTPTB3> I\'ll have a look at feature form!U024SSR9T5Z: I don\'t know that SageMaker Feature Store (SMFS) has first class support for allowing you to specify your own versions on each feature, but it does save every historical version of each feature in the offline store, so you can do point-in-time dataset recreation.U024SSR9T5Z: SMFS isn\'t the most feature complete feature store out there, and doesn\'t have the most beautiful API, but it was really easy to get started with and does everything we need for our use case, which is primarily offline point-in-time training dataset creation and recreation.U04AQ0V99U3: Thanks for your answer Evan! Point in time dataset recreation probably won\'t be enough for us, because we have to be able to maintain two versions of the same feature (some models will use v1 for inference and others v2 for inference)U024SSR9T5Z: It sounds like you could add your own version column to each feature record to accomplish that.U04AQ0V99U3: In fact, at the moment we are hosting our feature store in Athena (without SMFS) and we could add a feature_version partition to all the tables. We would then have two Glue jobs, one populating v1 of the feature and another for v2. We were wondering if SMFS would help us doing that in a more automated way, but it doesn\'t have that featureU024SSR9T5Z: Got it. SMFS does eliminate the need for glue jobs i.e. it does that under the hood for you, so that\'s one thing at least. But yes there could be a more elegant solution out there for you.U04AQ0V99U3: How does it eliminate the need for Glue jobs?U024SSR9T5Z: Under the hood SMFS stores the offline store in S3, and builds a Glue data catalog of it automatically for you, so you can run Athena queries against it. Athena is the API for the offline store. <https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-athena-glue-integration.html|Source>U020ZPEHESY: well how you have implemented your current feature store. I think you are after for time travel queriers.\n\nAside AWS feature store, there are a couple of technologies that they can support you (Delta files ,hopswork…) however you could perhaps continue use your custom solution by enabling your storage to support versioning look at this SDC type 2 (<https://en.wikipedia.org/wiki/Slowly_changing_dimension>)U02F5P5UH42: hi <@U04AQ0V99U3>! i work at tecton, and we have the concepts of variants, essentially feature versions, when leveraging our spark backend compute. (<https://docs.tecton.ai/docs/running-in-production/production-best-practices/#use-variants-when-updating-feature-views-and-services|link>). we version both feature views (consisting of 1 to n features) as well as feature services (consisting of 1 to n feature views). these feature services are the point of consumption for a training data set, or inference set, be it batch or real time. this groups together features in the gui and allows one to specify what to consume and where. this makes it easy to run a/b tests or have a model that keeps an older definition while a new model leverages a more current one. we also have some plans developing a tighter integration with athena going forward as well.U04AQ0V99U3: Thanks <@U02F5P5UH42>, I\'m gonna check out Tecton!'}]}
89
87Is there a link provided for virtual workshops in the mlops world conference? If so, where can it be found?{'summary_answer': 'The person asking the question is unsure if there is a link provided for virtual workshops in the mlops world conference. However, someone else suggests checking their email for all the information, which the person confirms they received. It is not specified where the link can be found.', 'citations': [{'id': '2022-06-07 13:34:35.015239 UTC', 'message': "U03JDUTAD62: Hey, anyone here attending mlops world conference?\nIf so then where is a link provided for virtual workshops, pl helpU039A8UCMHV: I'm actually struggling with the same issueUP3T8K9M5: sending you both DMs nowU039A8UCMHV: thanks, we are waiting thenUP3T8K9M5: can you check that you did not get an email with all the info?U039A8UCMHV: Yes, I got it. Thank you a lot!"}]}
90
88What MLOps technologies and projects would be recommended for deploying PyTorch models to Kubernetes containers for non-trivial orchestration needs involving monitoring drift, automatic retraining, data validation, and model and data versioning? Additionally, what distributed training tools and orchestration and DAG tools would pair well with Seldon if automation of its deployment is desired?{'summary_answer': "Seldon is recommended for deploying PyTorch models to Kubernetes containers for non-trivial orchestration needs involving monitoring drift, automatic retraining, data validation, and model and data versioning. Pachyderm and ClearML are recommended for distributed training and DAG tools, with Pachyderm being more flexible in terms of DAGs. For automation of Seldon's deployment, it is suggested to talk to U02CJDP5T8X.", 'citations': [{'id': '2021-09-09 01:15:42.2986 UTC', 'message': "U02DUEEJY2F: Hey everybody! Just popping over from the <#C010A328X38|introduce-yourself> channel to ask a couple of questions about recommended MLOps stacks.\nMy use case is that I need to create some flexible training, deployment and monitoring pipelines for deploying pytorch models to kubernetes containers. (I have been creating some Kubeflow pipelines on a minikube cluster, but that experience has been lackluster, to say the least.)\nIt's a small team, so I don't have the option of just brute forcing it, but we do need to do some non-trivial orchestration. (Monitoring drift, Automatic retraining, Data validation, Model and Data versioning, etc).\nAnyways, given my use case, what technologies and projects would folks recommend, and what do you all like to use as your MLOps stack?U018QM6DUTB: I've been a big fan of the stuff Seldon is doing. The drift stuff is pretty cool. Some folks did a data versioning and deployment thing potentially similar to what you're looking for here if it helps: <https://github.com/winderresearch/pachyderm-seldon/blob/main/repo/tutorial.ipynb|https://github.com/winderresearch/pachyderm-seldon/blob/main/repo/tutorial.ipynb>U02DUEEJY2F: Thanks for the reply! I'll check that out.U02DUEEJY2F: <@U018QM6DUTB> another set of questions if you're willing: What kind of orchestration and DAG tools would pair well with Seldon, if I wanted to automated its deployment? And what kind of distributed training tools should I be keeping an eye on? (Again thanks for your help)U025VH7V85N: You might want to talk with <@U02CJDP5T8X> \nIf I tagged the right one that is.U018QM6DUTB: <@U02DUEEJY2F> definitely have some thoughts on that - quick disclosure that I'm at Pachyderm, so I prefer their approach to DAGs (they are more flexible to me). On the distributed training front, I've always been a huge fan of ClearML, myself.U02DUEEJY2F: Thanks a whole bunch! That gives me some good directions to look."}]}
91
89What are some examples of effective data wrangling to model building pipelines for NLP preprocessing of clinical notes, cleaning/standardizing structured data, and training patient-level embeddings for downstream tasks?{'summary_answer': "The context mentions that the academic clinical lab is looking to develop processes around sharing code for common tasks like NLP preprocessing of clinical notes, cleaning/standardizing structured data to common ontologies, training and deploying patient-level embeddings for downstream tasks. The context also mentions the use of PyTorch for handling datasets in an extensible way and the use of Ray for ML pipelines and performance improvements. The in-memory object store of Ray was useful for complicated pipelines with many steps and dependencies. However, Ray may be overkill for the lab's needs.", 'citations': [{'id': '2021-10-11 18:52:42.4693 UTC', 'message': "U01QEJWEZDY: Does anyone have good examples of effective data wrangling -&gt; model building pipelines? I work in an academic clinical lab and we’re looking to develop some processes around sharing code for common tasks like NLP preprocessing of clinical notes, cleaning/standardizing structured data to common ontologies, training and deploying patient-level embeddings for downstream tasks, etc. Would love some inspiration + best practices!U01QEJWEZDY: I really like the way that PyTorch handles Datasets in an extensible way, so something similar?U01EPLVF25T: In the past, I worked on some heavy-duty ML pipelines at <https://primer.ai/|Primer>, and we spent a lot of time thinking about code re-use for ML tasks. We also went through a refactor to use <https://www.ray.io/|Ray> for performance improvements (reasons not too dissimilar to <https://www.anyscale.com/blog/why-third-generation-ml-platforms-are-more-performant|this>).\n\nHappy to chat, Vimig — feel free to DM me.U015BH45ZK6: <@U01EPLVF25T> What's your experience with ray?U01EPLVF25T: Not an expert in Ray, and a lot has changed in the years since I used it. High-level thoughts from my exposure to it:\n• It was a convenient way to distribute tasks to multiple processes across multiple nodes (an easier way than using `multiprocessing` , and across multiple machines, but still a bit low-level)\n• For us, it was possible to make small changes to our Python pipelines and have them be orchestrated by Ray, and we didn’t have to spin up queues/other infra. \n• Ray had an <https://ray-project.github.io/2017/08/08/plasma-in-memory-object-store.html|in-memory object store> that made sharing large objects across tasks “easy”. We had some complicated pipelines (many steps, many dependencies), and it was useful for us. \nTakes us away from the original question from <@U01QEJWEZDY>, but have you used it/tried it out <@U015BH45ZK6>?U0282TV4UEQ: <@U01EPLVF25T> I have never used Ray before, how does it work with Kubernetes? Does it gain even more of a speed up or is it sort of redundant?U015BH45ZK6: <@U01EPLVF25T> <@U0282TV4UEQ> maybe we should start a new thread on RayU01EPLVF25T: <@U0282TV4UEQ> I don’t believe there was Kubernetes support when I last used it (~3 years ago), but looks like it <https://docs.ray.io/en/latest/cluster/kubernetes.html|can be deployed on K8s> now.U0282TV4UEQ: <@U01EPLVF25T> thanks looks very promising.U01QEJWEZDY: <@U01EPLVF25T> thanks for the feedback, Ray looks interesting, but maybe a bit overkill of our needs, would love to chat more over DM!"}]}
92
90What is the most common language used for developing computer vision models and how does deployment on edge devices or cloud affect the choice of language?{'summary_answer': 'The most common languages used for developing computer vision models are Python and C++. The choice of language for deployment on edge devices or cloud depends on the specific requirements and hardware of the device. If high performance or specific hardware is needed, the model may need to be converted to the desired binaries afterward. However, there are also model converters and open standards available that allow for easy conversion of models trained in Python to the target platform/device with little to no loss in performance.', 'citations': [{'id': '2022-06-23 09:25:18.974749 UTC', 'message': "U01BP1L847K: Hi all!\nQuestion about computer vision, hopefully you guys know the answer to this! :blush: When I google on it I see it's quite a hot topic..\n\nWhat language is most often used (or preferred) to develop CV models? I thought at first it would be python, but I hear a lot of C++ as well. What are your thoughts on this? And any pros and cons?U01BP1L847K: <@U02CMHKR8G7>U029FHM5LPP: I think the language will have a lot to do with what you plan on doing with your model. I.e. do you intend to deploy it on an edge device like Google Glass? Or is it just doing some batch inference once in a while?U01BP1L847K: That's interesting; how would that affect the choice of language?\nAnd let's say, it's mainly for batch or express inferencing on the cloud (where there's sufficient compute capacity available), not on edge devices.U01T78HPG3H: <@U029FHM5LPP> surely where it's deployed doesn't impact how it's developed? Can optimize/reconfigure the model however you need to after it's trained. Or are you thinking of models that need to retrain as part of their deployment?U029FHM5LPP: I've spoken to a couple of companies that prototype in Python, but the moment they have some kind of validation of the approach or the kinds of behaviour and models they want to use, they switch everything over to C++.U02CMHKR8G7: Yes to me the language is more a deployment issue than development, if you're in need of high performances or specific hardware you will certainly convert your model to the desired binaries afterward (though it might not be trivial) U033QREB56V: We develop in PyTorch and export to CoreML which is then deployed to iPhones running Swift/C++/Objective-C. So target deployment device affects the language translations / graph format conversions we have to do to make the final leap into production. Also FYI there's <#C026ED0PZEZ|computer-vision>U01RLNFHEU9: <@U01BP1L847K> to add to all the great contributions here, nowadays you also have access to a lot of great model converters/open standards, that allows you to convert your model trained in Python to the target platform/device with little to no loss in performance.\n\nPython model converters\n• Tensorflow -&gt; <https://www.tensorflow.org/lite/models/convert>\n• PyTorch -&gt; <https://pytorch.org/docs/stable/jit.html>\nOpen standard for ML\n• ONNX -&gt; <https://onnx.ai/>U029FHM5LPP: <@U01RLNFHEU9> yeah those are great. I'd add MMDeploy to that list (which should probably be called 'MMConvert' since that's basically what it does) -&gt; <https://mmdeploy.readthedocs.io/en/latest/>U033QREB56V: <@U029FHM5LPP> Is there a particular conversion pathway that you've found MMDeploy helpful for where alternative options are non-existent? Most of the major formats (TF, ONNX, PyTorch) seem to have sufficient compatibility, so I'm curious if there are more niche caches you run intoU029FHM5LPP: <@U033QREB56V> I’ve trained a bunch of MMDetection models, so when you want to convert from that then often MMDeploy is the only easy way.U029FHM5LPP: Adding a recent conversation I had on this very topic for the podcast <https://mlops-community.slack.com/archives/C018E4N2H9V/p1656578980457369>"}]}
93
91What tools are available for data observability, monitoring drift in predictions, and overall data health before consumption into a feature store and/or models?{'summary_answer': 'Some tools mentioned in the context for data observability include elementary-lineage (an open-source tool), Grafana + TICK (realtime observability dashboards), BQ+Data Studio (for more involved data quality metrics and KPIs), and oss-evidently.', 'citations': [{'id': '2022-01-19 21:16:21.2533 UTC', 'message': "U02CCK9694L: Hi all, curious: What tools(if any) are people using for data observability? e.g. monitoring drift in predictions, overall data health before consumption into a feature store and/or modelsU016624NW3U: I just discovered this open-source tool. Haven't used it yet but looks great! <https://github.com/elementary-data/elementary-lineage>U02KEE6B2F8: <@U02CCK9694L> sharing a tool with some details in this blog <https://proud-botany-7dd.notion.site/Reducing-Data-Drift-874cb46061aa4984b31ceccc271e6e30|https://proud-botany-7dd.notion.site/Reducing-Data-Drift-874cb46061aa4984b31ceccc271e6e30>\n\nFull disclosure I am the CEOU02CCK9694L: Thanks or the replies!U01CZGD4PM2: Realtime observability dashboards (Grafana + TICK on prem, data flow things in the cloud) for the basic health metrics. BQ+Data Studio for more involved data quality metrics and KPIs. Prettiness and interactivity is lacking for the latter, but it gets the job done.U01HF46CURG: <#C026KJWSLSW|oss-evidently>"}]}
94
92When prototyping infrastructure on multiple local machines, should one use docker containers or VMs?{'summary_answer': 'One should use docker containers when prototyping infrastructure on multiple local machines.', 'citations': [{'id': '2020-08-26 15:37:47.0491 UTC', 'message': 'U013YS7NE5B: For prototyping infrastructure on multiple local machines: docker containers or VMs?U013CL3GTB3: docker containersssU013YS7NE5B: How are docker containers an improvement over a virtual machine image?U013CL3GTB3: for me its about quick experimentation. I can containerize some code pretty quickly and try it in a bunch of places - other local machines, running on a cloud instance, etc.U013CL3GTB3: not sure about the pros and cons though. would love to hear some thoughts on thisU016SA6SLJ1: I agree with David - I havn’t used VMs since I started working with containers - just thinking how long it takes to start a VM :weary:\nJust today I tested a service of mine and changed it several times so I rebuiled the container many times today and uploaded it to the cloud a few more.. all was really fast, seconds on my local machine for every rebuild and restartU017CCSNYDV: <@U013YS7NE5B> can you expand a bit more on "prototyping infrastructure" and what you are trying to show?U013YS7NE5B: We\'re building some ML/analytics infrastructure and are doing everything locally for now. standardizing database / api/service configuration is becoming a pain point and the thing to do is put everything into a container /vm.UP3T8K9M5: <@U013YS7NE5B> im gonna talk with the folks from adobe next week in the other community I run dok.community about when they use one or the other. might be a good idea for you tro drop in and ask them a few questions\n\nTuesday at 5 pm UK / 9 am PST\n<https://go.dok.community/register>UV92GMLF4: +1 for Docker. I worked in the both worlds, and Docker it\'s a bliss in comparison with VMs (I worked with Virtual Box in their golden days and vmware).\n\nSome arguments in favor of Docker:\n• Lightweight deployment and super lean environments, \n• Docker is hassle-free from managing hardware resources\n• Scale-out, IMHO, way easier with tools like Mesos\n• No Xen/Hypervisor nightmares\n• IMHO, a way better for orchestration with compose \nBut of course there\'s some concerns about Docker containers like security (environment isolation), OS compliance in some cases and so on, but depends of the case.U017CCSNYDV: So, you should be working as close to the production deployment as you can; if your prod is VMs then do that - otherwise you are not de-risking anything and you might have to do it all again. If you are defining what will go into production then consider k8s + containers.U0116PEG352: Agree with <@U013CL3GTB3> - you can\'t beat containers for quick CICD. VMs have configuration and management overhead of managing the OS, and hence are larger, and slower to burndown/startup. For quick experimentation pipelines for ML training, stay away from VMs unless you are doing bleeding-edge stuff that requires kernel configuration.U015CHWG25B: +1 to Docker containers. One the reasons is that after you prototype your infrastructure on local machines with Docker, it will be much easier to scale it to production (e.g. in some cloud) than with VMs.'}]}
95
93What are some recommendations for a small team trying to implement MLOps, specifically around toolset, Python programming, CI/deployment, project scaffolding, documentation, project registry, data exploration/preparation, tests, feature store, data versioning, experimenting, and training?{'error': [{'body': 'Transformation prompt_completion with name prompt_completion failed on state {"input":{"prompt":"Context:[\\"U01P5SV0Q9G: \\\\n*Toolset*\\\\n\\\\nWith this premises, I have two different and opposed concerns or even fears.\\\\nFear of not using enough tools and good practices and arriving in a couple of years to a state where we cannot manage our own code, project and models.\\\\nFear of using so many tools that they impose a burden our small team cannot bear.\\\\n\\\\nIt’s clear that we need some MLOps, but how much, I don’t know. I will review some things I have been reading, and I hope you can help me choosing the right tools.\\\\n\\\\n*Python Programming*\\\\n\\\\nIt looks like we will program using Visual Studio. We will use a remote interpreter, because we will run things on the Cloudera Nodes, although we will program locally and integrate the code with a SVN repository.\\\\nDo we need tools for standardizing our code, like PyLint, Flake8, MyPy or Black? Would you recommend any of those?\\\\n\\\\n*CI and Deployment*\\\\n\\\\nWe will use Jenkins. For deployment of our code, is Docker a n with error Error: Prompt is too long. Please reduce prompt in length.', 'step_name': 'prompt_completion', 'raw': 'Error: Prompt is too long. Please reduce prompt in length.'}]}
96
94How can data mesh and feature stores coexist? Should there be a centralized team to maintain a feature store with multiple data products or should a data product use a feature store and serve the online and historical data as different ways of consuming the features?{'summary_answer': 'The coexistence of data mesh and feature stores can be achieved by having a centralized team maintain a feature store with multiple data products or by having a data product use a feature store and serve the online and historical data as different ways of consuming the features. Both approaches have their advantages and disadvantages, and the choice ultimately depends on the specific needs and goals of the organization. A centralized team can ensure consistency and quality across all data products, but may also create bottlenecks and slow down development. On the other hand, a decentralized approach can enable faster iteration and innovation, but may lead to inconsistencies and duplication of effort. Ultimately, the key is to find the right balance between centralization and decentralization, and to continuously evaluate and adjust the approach as needed.', 'citations': [{'id': '2023-03-29 14:22:09.718219 UTC', 'message': "U03LV7TMNG3: Hi :wave: a question related with data mesh and feature stores. How it's supposed to live both concepts at the same time? Should a centralised team maintain a feature store with multiple data products? Should a data product use (if its the case) a feature store and serve the online and historical data as different ways of consume the features? Other way?"}]}
97
95What is the best method to use for deployment in MLOps, MLflow + Seldon or Kubeflow? And can MLflow and Kubeflow or Seldon be used together for deployment?{'summary_answer': 'According to the conversation, it depends on the specific needs and preferences of the user. If the user wants a more customized approach and wants to track machine learning experiments, they should use Kubeflow. If they want a simpler approach and want to deploy to managed platforms, they should use MLFlow. MLFlow and Kubeflow can be used together, but it may be more complicated. Seldon Core comes installed with Kubeflow, and people generally use Kubeflow for training and deploy onto Seldon. One user suggested using Seldon + MLFlow for deployment.', 'citations': [{'id': '2022-07-06 12:17:16.100059 UTC', 'message': "U03BR541FG9: Hi all,\nI have a question about the deloyment part in Mlops. What is the best method to use for deployment Mlflow + seldon or Kubeflow?U03KBAPV7MH: Hi <@U03BR541FG9> Use Kubeflow if you want to track your machine learning experiments and deploy your solutions in a more customized way, backed by Kubernetes. Use MLFlow if you want a simpler approach to experiment tracking and want to deploy to managed platforms such as Amazon Sagemaker.\nYou can also use MLFlow + Kubeflow together but I would feel it would be burden of mangling two things.\nSame with Seldon -- Seldon Core comes installed with Kubeflow. People generally use kubeflow for training and deploy onto seldon. So I guess you can stick with Kubeflow from end to end if your are looking to deploy models onto Kubernetes.U01J3NRJA3X: Use of deployment I'd say Seldon + MLFlow. I'd go as far as saying you can't put Kubeflow and easy in the same sentence :sweat_smile:. \n\nOn our side we installed seldon core directly on our K8s cluster and then starting deploying different models U03KBAPV7MH: <@U01J3NRJA3X> I agree that easy and kubeflow do not have a place in a single sentence. :joy:U03BR541FG9: <@U01J3NRJA3X> Thank you for your answer. Do you have an example of how to use mlflow+seldon for deployment it will be very helpful if you have an example or links that can help me to setup the deployment phase using these tools.U01J3NRJA3X: What do you use MLFLow for? Only the tracking part or you also build MLFlow server?U03BR541FG9: yes I'm using mlflow for tracking, now I started the deployment part and I want to ask what tools I need for the deplyment (seldon+mlflow or just seldon).U01J3NRJA3X: MLFlow has an example if you want :smile: <https://mlflow.org/docs/latest/tutorials-and-examples/tutorial.html#deploy-the-model-to-seldon-core-or-kserve-experimental>"}]}
98
99
100