Updates and recent posts about Slurm..

Posts
Description

Link

@faun shared a link, 1 year, 1 month ago

FAUN.dev()

Optimize Gemma 3 Inference: vLLM on GKE 🏎️💨

GKE Autopilot's GPUmeans business—AI inference tasks don’t stand a chance. Just two arguments and, bam, you’ve unleashed NVIDIA's beastly Gemma 3 27B model, which chugs a massive46.4GB VRAM. ⚡️ Meanwhile, vLLM squeezes the models with bf16 precision, though optimization requires wrestling with algor.. read more

Link

@faun shared a link, 1 year, 1 month ago

FAUN.dev()

Kubernetes 1.33 – What you need to know

Kubernetes 1.33 shakes things up with game-changing updates.LIST streaming encodingtrims down API Server memory like a chef with a sharp knife. Deliberate deletion orders lock down security tighter than a drum. And get this:in-place updatesfor Pod resources ditch those annoying restarts! Finally, us.. read more

Link

@anjali shared a link, 1 year, 1 month ago

Customer Marketing Manager, Last9

Observability vs APM: What’s the Real Difference?

Observability goes beyond APM—it's not just about metrics, it's about understanding why things break, not just that they did.

Link

@anjali shared a link, 1 year, 1 month ago

Customer Marketing Manager, Last9

Logging vs Monitoring: What’s the Real Difference?

Logging and monitoring work together, but they’re not the same. Here’s how they help you understand, fix, and improve your systems.

Link

@anjali shared a link, 1 year, 1 month ago

Customer Marketing Manager, Last9

Debug Logging: A Comprehensive Guide for Developers

A clear guide to debug logging—what it is, how to use it well, and why it matters when you're trying to understand what your code is doing.

Story

@shurup shared a post, 1 year, 1 month ago

@palark

Nelm, a new alternative to Helm, is GA

#Helm #Cloud N... #werf #Nelm #kuberne...

werf, a CNCF Sandbox project, announced Nelm as a new tool for deploying Helm charts.

Story

@laura_garcia shared a post, 1 year, 1 month ago

Software Developer, RELIANOID

GTS 2025 - here we go

🚀 RELIANOID is heading to Atlanta, Georgia for the Georgia Technology Summit 2025 on April 16th! We’re thrilled to be part of this premier event hosted by the Technology Association of Georgia (TAG)—a day dedicated to innovation, collaboration, and the future of tech. 💡🌐 With 1,200+ tech and busines..

Georgia_Technology_Summit_RELIANOID 2025

Story

@viktoriiagolovtseva shared a post, 1 year, 1 month ago

Difference between Agile and Scrum

Agile is a methodology that helps teams build products through iterative development, continuous feedback, and adaptability. Scrum is a framework within Agile that provides a structured way to manage work using fixed-length iterations called sprints.

Link

@anjali shared a link, 1 year, 1 month ago

Customer Marketing Manager, Last9

How to Use Prometheus for APM

Learn how to turn Prometheus into a powerful APM tool—track app performance, reduce guesswork, and get real visibility into your systems.

Link

@anjali shared a link, 1 year, 1 month ago

Customer Marketing Manager, Last9

Logstash Grok Examples: A Detailed Guide to Pattern Matching

Learn how to use Logstash Grok with simple examples. Match and parse logs easily using patterns that are easy to understand.

Slurm Workload Manager is an open-source, fault-tolerant, and highly scalable cluster management and scheduling system widely used in high-performance computing (HPC). Designed to operate without kernel modifications, Slurm coordinates thousands of compute nodes by allocating resources, launching and monitoring jobs, and managing contention through its flexible scheduling queue.

At its core, Slurm uses a centralized controller (slurmctld) to track cluster state and assign work, while lightweight daemons (slurmd) on each node execute tasks and communicate hierarchically for fault tolerance. Optional components like slurmdbd and slurmrestd extend Slurm with accounting and REST APIs. A rich set of commands—such as srun, squeue, scancel, and sinfo—gives users and administrators full visibility and control.

Slurm’s modular plugin architecture supports nearly every aspect of cluster operation, including authentication, MPI integration, container runtimes, resource limits, energy accounting, topology-aware scheduling, preemption, and GPU management via Generic Resources (GRES). Nodes are organized into partitions, enabling sophisticated policies for job size, priority, fairness, oversubscription, reservation, and resource exclusivity.

Widely adopted across academia, research labs, and enterprise HPC environments, Slurm serves as the backbone for many of the world’s top supercomputers, offering a battle-tested, flexible, and highly configurable framework for large-scale distributed computing.