Patrick Koss

Aug 20, 2025 • World Congress 2025

Unveiling the Magic: Scaling Large Language Models to Serve Millions

A single short prompt can exhaust your GPU resources. Learn how a custom proxy and clever rate-limiting can serve large language models to millions of users.

#1about 3 minutes

Understanding the benefits of self-hosting large language models

Self-hosting LLMs provides greater control over data privacy, compliance, cost, and vendor lock-in compared to using third-party services.

#2about 4 minutes

Architectural overview for a scalable LLM serving platform

A scalable LLM service requires key components for model acquisition, inference, storage, billing, security, and request routing.

#3about 7 minutes

Choosing an inference engine and model storage strategy

Using network file storage (NFS) is crucial for reducing startup times and enabling fast horizontal scaling when deploying new model instances.

#4about 5 minutes

Building an efficient token-based billing system

Aggregate token usage with tools like Redis before sending data to a payment provider to manage rate limits and improve system efficiency.

#5about 3 minutes

Implementing robust rate limiting for shared LLM systems

Prevent system abuse by implementing both request-based and token-based rate limiting, using estimations for output tokens to protect shared resources.

#6about 3 minutes

Selecting the right authentication and authorization strategy

Bearer tokens offer a flexible solution for managing authentication and fine-grained authorization, such as restricting access to specific models.

#7about 2 minutes

Scaling inference with Kubernetes and smart routing

Use tools like KServe or Knative on Kubernetes for intelligent autoscaling and canary deployments based on custom metrics like queue size.

#8about 3 minutes

Summary of best practices for scalable LLM deployment

Key strategies for success include robust rate limiting, modular design, continuous benchmarking, and using canary deployments for safe production testing.

The rise of self-hosted open source AI models

02:32 MIN

The rise of self-hosted open source AI models

Self-Hosted LLMs: From Zero to Inference

Addressing the core challenges of large language models

05:18 MIN

Addressing the core challenges of large language models

Accelerating GenAI Development: Harnessing Astra DB Vector Store and Langflow for LLM-Powered Apps

The opaque and complex stack of modern LLM services

02:19 MIN

The opaque and complex stack of modern LLM services

You are not my model anymore - understanding LLM model behavior

Running large language models locally with Web LLM

12:42 MIN

Running large language models locally with Web LLM

Generative AI power on the web: making web apps smarter with WebGPU and WebNN

Analyzing the risks and architecture of current AI models

04:34 MIN

Analyzing the risks and architecture of current AI models

Opening Keynote by Sir Tim Berners-Lee

Running large language models with WebLLM

03:19 MIN

Running large language models with WebLLM

Prompt API & WebNN: The AI Revolution Right in Your Browser

Using large language models for voice-driven development

03:30 MIN

Using large language models for voice-driven development

Speak, Code, Deploy: Transforming Developer Experience with Voice Commands

Addressing the key challenges of large language models

02:55 MIN

Addressing the key challenges of large language models

Large Language Models ❤️ Knowledge Graphs

Featured Partners

Self-Hosted LLMs: From Zero to Inference

Self-Hosted LLMs: From Zero to Inference

Roberto Carratalá & Cedric Clyburn

about 7 months ago • World Congress 2025

How AI Models Get Smarter

How AI Models Get Smarter

Ankit Patel

about 8 months ago • World Congress 2025

Using LLMs in your Product

Using LLMs in your Product

Daniel Töws

about 2 years ago • World Congress 2024

Three years of putting LLMs into Software - Lessons learned

Three years of putting LLMs into Software - Lessons learned

Simon A.T. Jiménez

about 7 months ago • World Congress 2025

How to Avoid LLM Pitfalls - Mete Atamel and Guillaume Laforge

How to Avoid LLM Pitfalls - Mete Atamel and Guillaume Laforge

Meta Atamel & Guillaume Laforge

about 11 months ago • Coffee With Developers

Inside the Mind of an LLM

Inside the Mind of an LLM

Emanuele Fabbiani

about 7 months ago • World Congress 2025

DevOps for AI: running LLMs in production with Kubernetes and KubeFlow

DevOps for AI: running LLMs in production with Kubernetes and KubeFlow

Aarno Aukia

about 2 years ago • WeAreDevelopers LIVE

Your Next AI Needs 10,000 GPUs. Now What?

Your Next AI Needs 10,000 GPUs. Now What?

Anshul Jindal & Martin Piercy

about 7 months ago • World Congress 2025

Related Articles

View all articles

Luis Minvielle

What Are Large Language Models?

Developers and writers can finally agree on one thing: Large Language Models, the subset of AIs that drive ChatGPT and its competitors, are stunning tech creations. Developers enjoying the likes of GitHub Copilot know the feeling: this new kind of te...

What Are Large Language Models?

Krissy Davis

The Best Large Language Models on The Market

Large language models are sophisticated programs that enable machines to comprehend and generate human-like text. They have been the foundation of natural language processing for almost a decade. Although generative AI has only recently gained popula...

The Best Large Language Models on The Market

Benedikt Bischof

MLops – Deploying, Maintaining And Evolving Machine Learning Models in Production

Welcome to this issue of the WeAreDevelopers Live Talk series. This article recaps an interesting talk by Bas Geerdink who gave advice on MLOps.‍About the speaker:‍Bas is a programmer, scientist, and IT manager. At ING, he is responsible for the Fast...

MLops – Deploying, Maintaining And Evolving Machine Learning Models in Production

Chris Heilmann

All the videos of Halfstack London 2024!

Last month was Halfstack London, a conference about the web, JavaScript and half a dozen other things. We were there to deliver a talk, but also to record all the sessions and we're happy to share them with you. It took a bit as we had to wait for th...

All the videos of Halfstack London 2024!

From learning to earning

Jobs that call for the skills explored in this talk.

Product Owner/Projektleiter (m/w/d)

relyon AG
Tübingen, Germany

Junior

Intermediate

Senior

Scrum

Data Engineer (f/m/d) - AI

smartclip Europe GmbH
Hamburg, Germany

Intermediate

Senior

ETL

Java

Scala

Machine Learning & Data Engineer

vengine GmbH
Hamburg, Germany

Junior

Intermediate

Python

Senior Python Engineer

CONTIAMO GMBH
Berlin, Germany

Senior

Python

Docker

TypeScript

PostgreSQL

Machine Learning Engineer (m/f/d)

evoila Frankfurt GmbH
Mainz, Germany

Senior

Keras

DevOps

Tensorflow

Conversational AI & Machine Learning Engineer

Deloitte

Machine Learning

Conversational AI & Machine Learning Engineer

Deloitte

DevOps

Docker

PyTorch

Tensorflow

Kubernetes

+2

Full Stack Developer | AI SaaS

Kwery

Remote

€45K

Svelte

TypeScript

Part-Time - AI Operations Support (Voice AI & Automation)

Auralinx

Remote