Join us

ContentUpdates and recent posts about Slurm..
Link
@kaptain shared a link, 3 weeks, 2 days ago
FAUN.dev()

Streamline your containerized CI/CD with GitLab Runners and Amazon EKS Auto Mode

GitLab Runners now work withAmazon EKS Auto Mode. That means hands-off infra, smarter scaling, and built-in AWS security. Runners spin up onEC2 Spot Instances, so teams can cut CI/CD compute costs by as much as90%- without hacking together flaky pipelines... read more  

Streamline your containerized CI/CD with GitLab Runners and Amazon EKS Auto Mode
Link
@kaptain shared a link, 3 weeks, 2 days ago
FAUN.dev()

Implementing assurance pipeline for Amazon EKS Platform

AWS released a full-stack CI/CD validation pipeline forAmazon EKS. It pulls in six layers of testing,Terraform,Helm,Locustload testing, and evenAWS Fault Injectionfor pushing resilience to the edge. The goal: bake policy checks, functional tests, and brutal load tests right into pre-deployment. Fewe.. read more  

Link
@kaptain shared a link, 3 weeks, 2 days ago
FAUN.dev()

From Deterministic to Agentic: Creating Durable AI Workflows with Dapr

Dapr droppedDurable Agents- a mashup of classic workflows and LLM-driven agents that can actually get things done and survive rough edges. They track reasoning steps, tool calls, and chat states like a champ. If things crash, no problem: Dapr Workflows and Diagrid Catalyst bring it all back... read more  

From Deterministic to Agentic: Creating Durable AI Workflows with Dapr
Link
@kaptain shared a link, 3 weeks, 2 days ago
FAUN.dev()

Kubernetes GPU Management Just Got a Major Upgrade

Kubernetes 1.34 droppedDynamic Resource Allocation (DRA)- think persistent volumes, but for GPUs and custom hardware. Vendors can now plug in drivers and schedulers for their devices, and workloads can pick exactly what they need. Coming in 1.35: a newworkload abstractionthat speaks the language of .. read more  

Link
@kaptain shared a link, 3 weeks, 2 days ago
FAUN.dev()

v1.35: New level of efficiency with in-place Pod restart

Kubernetes 1.35, as you may know, introducedin-place Pod restarts(alpha). It's a real reset: all containers, init and sidecars included - without killing the Pod or kicking off a reschedule. Think restart without the cloud drama. Big win for workloads with heavy inter-container dependencies or massi.. read more  

Link
@kaptain shared a link, 3 weeks, 2 days ago
FAUN.dev()

1.35: Enhanced Debugging with Versioned z-pages APIs

Kubernetes 1.35 makes a quiet-but-crucial upgrade: z-pages debugging endpoints now returnstructured, machine-readable JSON. That means tools- not just tired humans - can parse control plane state directly. The responses areversioned, backward-compatible, and tucked behind feature flags for now... read more  

Link
@kaptain shared a link, 3 weeks, 2 days ago
FAUN.dev()

v1.35: Watch Based Route Reconciliation in the Cloud Controller Manager

Kubernetes v1.35 sneaks in an alphafeature gatethat flips the CCM route controller from "check every X minutes" to "watch and react." It now usesinformersto trigger syncs when nodes change - plus a light periodic check every 12–24 hours... read more  

Link
@kala shared a link, 3 weeks, 2 days ago
FAUN.dev()

The 2026 Data Engineering Roadmap: Building Data Systems for the Agentic AI Era

Data engineering’s getting flipped.AI agentsandLLMsaren’t just tagging along anymore - they’re the main users now. That means engineers need to buildcontext-aware, machine-readable data systemsthat don’t just store info but actually make sense of it. Think:vector databases,knowledge graphs,semantic .. read more  

The 2026 Data Engineering Roadmap: Building Data Systems for the Agentic AI Era
Link
@kala shared a link, 3 weeks, 2 days ago
FAUN.dev()

2025: The year in LLMs

2025 was the year LLMs stopped just answering questions and started building things.Reasoning modelslike OpenAI’s o-series and Claude Code took over tool-driven workflows. Asynchronous coding agentsbroke out. These models didn’t just write code - they ran it, debugged it, then did it again. That loo.. read more  

2025: The year in LLMs
Link
@kala shared a link, 3 weeks, 2 days ago
FAUN.dev()

Streamlining Security Investigations with Agents

Slack broke down how it's threading AI into its product without torching user trust.Slack AIleans hard ontenant-specific data isolationandzero data retention- no leftover crumbs from LLM interactions. Instead of piping user data through someone else’s APIs, Slack runs LLMs onits own infrawhere it ca.. read more  

Streamlining Security Investigations with Agents
Slurm Workload Manager is an open-source, fault-tolerant, and highly scalable cluster management and scheduling system widely used in high-performance computing (HPC). Designed to operate without kernel modifications, Slurm coordinates thousands of compute nodes by allocating resources, launching and monitoring jobs, and managing contention through its flexible scheduling queue.

At its core, Slurm uses a centralized controller (slurmctld) to track cluster state and assign work, while lightweight daemons (slurmd) on each node execute tasks and communicate hierarchically for fault tolerance. Optional components like slurmdbd and slurmrestd extend Slurm with accounting and REST APIs. A rich set of commands—such as srun, squeue, scancel, and sinfo—gives users and administrators full visibility and control.

Slurm’s modular plugin architecture supports nearly every aspect of cluster operation, including authentication, MPI integration, container runtimes, resource limits, energy accounting, topology-aware scheduling, preemption, and GPU management via Generic Resources (GRES). Nodes are organized into partitions, enabling sophisticated policies for job size, priority, fairness, oversubscription, reservation, and resource exclusivity.

Widely adopted across academia, research labs, and enterprise HPC environments, Slurm serves as the backbone for many of the world’s top supercomputers, offering a battle-tested, flexible, and highly configurable framework for large-scale distributed computing.