Join us

ContentUpdates and recent posts about Slurm..
Link
@faun shared a link, 1 year, 1 month ago
FAUN.dev()

The Power of Asymmetric Experiments @ Meta

Meta's bold move to crank up control group sizes—sometimes21 times larger—while shrinking test groups by half keeps those cherished confidence intervals intact. Asymmetric experiments shine when you've got low experiment bandwidth, recruitment costs peanuts, and test interventions drain the budget. .. read more  

Link
@faun shared a link, 1 year, 1 month ago
FAUN.dev()

Unleashing the Power of Model Context Protocol (MCP): A Game-Changer in AI Integration

Model Context Protocol (MCP)is the AI world's version of USB-C. It lets models snag live data and tango with APIs, juicing up their powers like never before. Microsoft'sAzure OpenAI Servicesuses MCP to catapult GPT models out of their static halls of knowledge, mixing in real-time tool hookups for o.. read more  

Unleashing the Power of Model Context Protocol (MCP): A Game-Changer in AI Integration
Link
@faun shared a link, 1 year, 1 month ago
FAUN.dev()

On How We Moved to Kubernetes

Migrating fromAWS ECStoAWS EKS? Beats the bark out of those pesky spot instance disruptions, but introduces a new player: the complexity monster namedKubernetes. Bigger, faster, cheaper—if you know the dance steps. Juggling CPUs in Kubernetes feels like herding caffeinated cats. EnterKarpenterto sav.. read more  

Link
@faun shared a link, 1 year, 1 month ago
FAUN.dev()

The Production-Ready Kubernetes Service Checklist

Running Kubernetes in production isn’t just a button-click. Start with3 master nodesto dodge disasters. Dish outload balancingto smash single points of failure. Skew yournode sizingfor peak workload muscle. Automate scaling withCluster Autoscaler—your new best friend. Keep your setup a fortress with.. read more  

The Production-Ready Kubernetes Service Checklist
Link
@faun shared a link, 1 year, 1 month ago
FAUN.dev()

Uber’s Journey to Ray on Kubernetes: Ray Setup

Uber enhanced its machine learning platform by migrating workloads to Kubernetes in early 2024. The migration aimed to solve pain points such as manual resource management, inefficient resource utilization, and inflexible capacity planning. The architecture designed included federated resource manag.. read more  

Link
@faun shared a link, 1 year, 1 month ago
FAUN.dev()

CNPG Recipe 17 - PostgreSQL In-Place Major Upgrades

CloudNativePG 1.26storms the scene, making PostgreSQL upgrades a breeze inside Kubernetes. It slashes the usual chaos. Minimal downtime threatens, but what's life without a little thrill?.. read more  

CNPG Recipe 17 - PostgreSQL In-Place Major Upgrades
Link
@faun shared a link, 1 year, 1 month ago
FAUN.dev()

Automated Testing for Terraform, Docker, Packer, Kubernetes, and More

Automated tests crush infrastructure anxiety. Use tools likeTerratestto deploy, validate, and clean up—all without a stealth deployment... read more  

Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Link
@faun shared a link, 1 year, 1 month ago
FAUN.dev()

Recyling a OnePlus 6T into a Kubernetes Node

Connected a 7-year-old OnePlus 6T as a Kubernetes node in my homelab—tagged on "8" cores, 6GB RAM—but postmarketOS kernel didn’t have nftables' numgen!Wrestled with manual kernel compilation and untangled DNS snafus, but now the project's chugging along mighty fine... read more  

Recyling a OnePlus 6T into a Kubernetes Node
Link
@faun shared a link, 1 year, 1 month ago
FAUN.dev()

Introducing kro: Kube Resource Orchestrator

TheKube Resource Orchestrator (kro)dreams big by letting you turn complex Kubernetes APIs into elegant, singleResourceGroupCRDs. Think of it as Kubernetes without the migraines—dependencies and configurations quietly managed in the background. An AWS experiment still cooking, it's not quite ready fo.. read more  

Introducing kro: Kube Resource Orchestrator
Link
@faun shared a link, 1 year, 1 month ago
FAUN.dev()

How autoscaling took down my application..!!

A glitch in the autoscaling settings skewed the NEGs, cramming them into a single AZ. Boom. Next thing you know, pods flounder and the app goes belly-up... read more  

How autoscaling took down my application..!!
Slurm Workload Manager is an open-source, fault-tolerant, and highly scalable cluster management and scheduling system widely used in high-performance computing (HPC). Designed to operate without kernel modifications, Slurm coordinates thousands of compute nodes by allocating resources, launching and monitoring jobs, and managing contention through its flexible scheduling queue.

At its core, Slurm uses a centralized controller (slurmctld) to track cluster state and assign work, while lightweight daemons (slurmd) on each node execute tasks and communicate hierarchically for fault tolerance. Optional components like slurmdbd and slurmrestd extend Slurm with accounting and REST APIs. A rich set of commands—such as srun, squeue, scancel, and sinfo—gives users and administrators full visibility and control.

Slurm’s modular plugin architecture supports nearly every aspect of cluster operation, including authentication, MPI integration, container runtimes, resource limits, energy accounting, topology-aware scheduling, preemption, and GPU management via Generic Resources (GRES). Nodes are organized into partitions, enabling sophisticated policies for job size, priority, fairness, oversubscription, reservation, and resource exclusivity.

Widely adopted across academia, research labs, and enterprise HPC environments, Slurm serves as the backbone for many of the world’s top supercomputers, offering a battle-tested, flexible, and highly configurable framework for large-scale distributed computing.