Join us

ContentUpdates and recent posts about Slurm..
Link
@kaptain shared a link, 1 month, 1 week ago
FAUN.dev()

Kubernetes OptimizationInPlace Pod Resizing,ZoneAware Routin

Halodoc cut EC2 costs and shaved latency by leaning into two Kubernetes tricks: In-place pod resizing(v1.33) lets them dial pod resources up or down on the fly, especially handy during off-peak hours. Zone-aware routingviatopology-aware hintskeeps inter-service traffic close to home (same AZ), skipp.. read more  

Kubernetes OptimizationInPlace Pod Resizing,ZoneAware Routin
Link
@kaptain shared a link, 1 month, 1 week ago
FAUN.dev()

Avoiding Zombie Cluster Members When Upgrading to etcd v3.6

etcd v3.5.26 patches a nasty upgrade bug. It now syncsv3storefromv2storeto stop zombie nodes from corrupting clusters during the jump to v3.6. The core issue: Older versions let stale store states bring removed members back from the dead... read more  

Link
@kala shared a link, 1 month, 1 week ago
FAUN.dev()

Chinese AI in 2025, Wrapped

Chinese AI milestones in 2025: Big models from DeepSeek and others, AGI discussions at Alibaba, US-China chip war swings, Beijing's AI Action plan, and more. DeepSeek led the way with an open-source model, setting off a wave of Chinese companies going open-source. China's push for AGI and involvemen.. read more  

Link
@kala shared a link, 1 month, 1 week ago
FAUN.dev()

Review of Deep Seek OCR

DeepSeek-OCRflips the OCR script. Instead of feeding full image tokens to the decoder, it leans on an encoder to compress them up front, trimming down input size and GPU strain in one move. That context diet? It opens the door for way bigger windows in LLMs. Why it matters:Shoving compression earlie.. read more  

Link
@kala shared a link, 1 month, 1 week ago
FAUN.dev()

Evaluating AI Agents in Security Operations

Cotool threw frontier LLMs at real-world SecOps tasks using Splunk’s BOTSv3 dataset.GPT-5topped the chart in accuracy (62.7%) and gave the best results per dollar.Claude Haiku-4.5blazed through tasks fastest, just 240 seconds on average, maxing out tool integrations.Gemini-2.5-proflopped on both acc.. read more  

Evaluating AI Agents in Security Operations
Link
@kala shared a link, 1 month, 1 week ago
FAUN.dev()

AI agents are starting to eat SaaS

AI coding agents are eating the lunch of low-complexity SaaS. Teams with a bit of dev muscle are skipping subscription logins and spinning up dashboards, pipelines, even decks, using Claude, Gemini, whoever’s fastest that day. Build vs. buy? Tilting back toward build. The kicker: build now takes min.. read more  

AI agents are starting to eat SaaS
Link
@kala shared a link, 1 month, 1 week ago
FAUN.dev()

Everything to know about Google Gemini’s most recent AI updates

Google jammed a full no-code AI workshop into Gemini. The browser now bakes inOpal, a drag-and-drop app builder with a shiny newvisual editor. You can chain prompts, preview apps, and feed it text, voice, or images, without touching code. They also dropped theGemini 3 Flash model, built for dual rea.. read more  

Link
@devopslinks shared a link, 1 month, 1 week ago
FAUN.dev()

From Static Rate Limiting to Adaptive Traffic Management in Airbnb’s Key-Value Store

Airbnb just rewired Mussel, its key-value store, with a smarter, layered QoS system. Out go the rigid QPS caps. In comeresource-aware rate control,criticality-based load shedding, andreal-time hot-key mitigation. Dispatchers now speak the language of backend cost -rows, bytes, latency - not just raw.. read more  

From Static Rate Limiting to Adaptive Traffic Management in Airbnb’s Key-Value Store
Link
@devopslinks shared a link, 1 month, 1 week ago
FAUN.dev()

Agent-Driven SRE Investigations: A Practical Deep Dive into Multi-Agent Incident Response

A sandboxed setup dropped multiple Claude-powered agents into Docker containers to run a full incident response drill. Each agent played a role: probing Kubernetes clusters, sniffing out root causes, and shipping remediation PRs straight to GitHub. Out of 7 test incidents, they nailed the diagnoses .. read more  

Agent-Driven SRE Investigations: A Practical Deep Dive into Multi-Agent Incident Response
Link
@devopslinks shared a link, 1 month, 1 week ago
FAUN.dev()

How We Saved 70% of CPU and 60% of Memory in Refinery’s Go Code, No Rust Required.

Refinery 3.0 cuts CPU by 70% and slashes RAM by 60%. The trick: selective field extraction from serialized spans. No full deserialization. Fewer heap allocations. Way less waste. It also recycles buffers, handles metrics smarter, and is gearing up to parallelize its core decision loop... read more  

How We Saved 70% of CPU and 60% of Memory in Refinery’s Go Code, No Rust Required.
Slurm Workload Manager is an open-source, fault-tolerant, and highly scalable cluster management and scheduling system widely used in high-performance computing (HPC). Designed to operate without kernel modifications, Slurm coordinates thousands of compute nodes by allocating resources, launching and monitoring jobs, and managing contention through its flexible scheduling queue.

At its core, Slurm uses a centralized controller (slurmctld) to track cluster state and assign work, while lightweight daemons (slurmd) on each node execute tasks and communicate hierarchically for fault tolerance. Optional components like slurmdbd and slurmrestd extend Slurm with accounting and REST APIs. A rich set of commands—such as srun, squeue, scancel, and sinfo—gives users and administrators full visibility and control.

Slurm’s modular plugin architecture supports nearly every aspect of cluster operation, including authentication, MPI integration, container runtimes, resource limits, energy accounting, topology-aware scheduling, preemption, and GPU management via Generic Resources (GRES). Nodes are organized into partitions, enabling sophisticated policies for job size, priority, fairness, oversubscription, reservation, and resource exclusivity.

Widely adopted across academia, research labs, and enterprise HPC environments, Slurm serves as the backbone for many of the world’s top supercomputers, offering a battle-tested, flexible, and highly configurable framework for large-scale distributed computing.