Updates and recent posts about Slurm..

Posts
Description

Link

@faun shared a link, 1 year, 1 month ago

FAUN.dev()

The Power of Asymmetric Experiments @ Meta

Meta's bold move to crank up control group sizes—sometimes21 times larger—while shrinking test groups by half keeps those cherished confidence intervals intact. Asymmetric experiments shine when you've got low experiment bandwidth, recruitment costs peanuts, and test interventions drain the budget. .. read more

Link

@faun shared a link, 1 year, 1 month ago

FAUN.dev()

Unleashing the Power of Model Context Protocol (MCP): A Game-Changer in AI Integration

Model Context Protocol (MCP)is the AI world's version of USB-C. It lets models snag live data and tango with APIs, juicing up their powers like never before. Microsoft'sAzure OpenAI Servicesuses MCP to catapult GPT models out of their static halls of knowledge, mixing in real-time tool hookups for o.. read more

Link

@faun shared a link, 1 year, 1 month ago

FAUN.dev()

On How We Moved to Kubernetes

Migrating fromAWS ECStoAWS EKS? Beats the bark out of those pesky spot instance disruptions, but introduces a new player: the complexity monster namedKubernetes. Bigger, faster, cheaper—if you know the dance steps. Juggling CPUs in Kubernetes feels like herding caffeinated cats. EnterKarpenterto sav.. read more

Link

@faun shared a link, 1 year, 1 month ago

FAUN.dev()

The Production-Ready Kubernetes Service Checklist

Running Kubernetes in production isn’t just a button-click. Start with3 master nodesto dodge disasters. Dish outload balancingto smash single points of failure. Skew yournode sizingfor peak workload muscle. Automate scaling withCluster Autoscaler—your new best friend. Keep your setup a fortress with.. read more

Link

@faun shared a link, 1 year, 1 month ago

FAUN.dev()

Uber’s Journey to Ray on Kubernetes: Ray Setup

Uber enhanced its machine learning platform by migrating workloads to Kubernetes in early 2024. The migration aimed to solve pain points such as manual resource management, inefficient resource utilization, and inflexible capacity planning. The architecture designed included federated resource manag.. read more

Link

@faun shared a link, 1 year, 1 month ago

FAUN.dev()

CNPG Recipe 17 - PostgreSQL In-Place Major Upgrades

CloudNativePG 1.26storms the scene, making PostgreSQL upgrades a breeze inside Kubernetes. It slashes the usual chaos. Minimal downtime threatens, but what's life without a little thrill?.. read more

Link

@faun shared a link, 1 year, 1 month ago

FAUN.dev()

Automated Testing for Terraform, Docker, Packer, Kubernetes, and More

Automated tests crush infrastructure anxiety. Use tools likeTerratestto deploy, validate, and clean up—all without a stealth deployment... read more

Link

@faun shared a link, 1 year, 1 month ago

FAUN.dev()

Recyling a OnePlus 6T into a Kubernetes Node

Connected a 7-year-old OnePlus 6T as a Kubernetes node in my homelab—tagged on "8" cores, 6GB RAM—but postmarketOS kernel didn’t have nftables' numgen!Wrestled with manual kernel compilation and untangled DNS snafus, but now the project's chugging along mighty fine... read more

Link

@faun shared a link, 1 year, 1 month ago

FAUN.dev()

Introducing kro: Kube Resource Orchestrator

TheKube Resource Orchestrator (kro)dreams big by letting you turn complex Kubernetes APIs into elegant, singleResourceGroupCRDs. Think of it as Kubernetes without the migraines—dependencies and configurations quietly managed in the background. An AWS experiment still cooking, it's not quite ready fo.. read more

Link

@faun shared a link, 1 year, 1 month ago

FAUN.dev()

How autoscaling took down my application..!!

A glitch in the autoscaling settings skewed the NEGs, cramming them into a single AZ. Boom. Next thing you know, pods flounder and the app goes belly-up... read more

Slurm Workload Manager is an open-source, fault-tolerant, and highly scalable cluster management and scheduling system widely used in high-performance computing (HPC). Designed to operate without kernel modifications, Slurm coordinates thousands of compute nodes by allocating resources, launching and monitoring jobs, and managing contention through its flexible scheduling queue.

At its core, Slurm uses a centralized controller (slurmctld) to track cluster state and assign work, while lightweight daemons (slurmd) on each node execute tasks and communicate hierarchically for fault tolerance. Optional components like slurmdbd and slurmrestd extend Slurm with accounting and REST APIs. A rich set of commands—such as srun, squeue, scancel, and sinfo—gives users and administrators full visibility and control.

Slurm’s modular plugin architecture supports nearly every aspect of cluster operation, including authentication, MPI integration, container runtimes, resource limits, energy accounting, topology-aware scheduling, preemption, and GPU management via Generic Resources (GRES). Nodes are organized into partitions, enabling sophisticated policies for job size, priority, fairness, oversubscription, reservation, and resource exclusivity.

Widely adopted across academia, research labs, and enterprise HPC environments, Slurm serves as the backbone for many of the world’s top supercomputers, offering a battle-tested, flexible, and highly configurable framework for large-scale distributed computing.