Join us

FAUN.dev() is where engineers from GitHub, Netflix, and Shopify go to stay ahead - fast.

An effortless, straightforward way to keep up with technologies...so you can keep your tabs closed and your mind open!

70,000+ developers already joined our ecosystem ⭐⭐⭐⭐⭐
Trusted by engineers at:

Google • Microsoft • AWS • Netflix

Slurm

Slurm is an open-source workload manager and job scheduler for Linux clusters, providing resource allocation, job execution, and queue management for large-scale high-performance computing environmen…

Featured Course(s)

Local AI Engineering with Ollama

Run, understand, customize, fine-tune, and build agentic apps on your own hardware

> Get Your Copy

Content

Updates and recent posts about Slurm..

Posts
Description

Story

@laura_garcia shared a post, 1 year, 9 months ago

Software Developer, RELIANOID

Falcon Reliable Transport

🚀 Introduction to Falcon Reliable Transport 🚀 Recently, at Netdev 0x18, speakers Yadong Li, Jay Bhat, and others introduced Falcon, Google's innovative, hardware-offloaded reliable transport. Designed for high-bandwidth, low-latency workloads like AI/ML training and HPC, Falcon brings new capabiliti..

falcon reliable transport schema RELIANOID

Link

@prathamesh-sonpatki shared a link, 1 year, 9 months ago

SRE, Last9.io

Identify root spans in Otel Collector

Identify root spans in Otel Collector

Link

@prathamesh-sonpatki shared a link, 1 year, 9 months ago

SRE, Last9.io

Golang logging guide for developers

Golang logging guide for developers

Story

@laura_garcia shared a post, 1 year, 9 months ago

Software Developer, RELIANOID

Joining Info Tech Las Vegas

✈️ We’re thrilled to join Info-Tech LIVE 2024 in Las Vegas from Sept 17-19! With over 2,000+ IT leaders, this event will explore the theme "Exponential IT in Motion" through keynotes, workshops, and networking. Don’t miss out on the future of IT! #InfoTechLIVE2024#ITLeadership#TechInnovation#Network..

InfoTech Las Vegas RELIANOID

Story

@laura_garcia shared a post, 1 year, 9 months ago

Software Developer, RELIANOID

mTLS - Mutual Transport Layer Security

Mutual Transport Layer Security (mTLS), or Two-Way TLS, adds an extra layer of security by ensuring both the client and server authenticate each other using digital certificates. Building on the principles of Transport Layer Security (TLS), mTLS has seen increased adoption due to rising cybersecurit..

Knowledge base_what is mTLS_RELIANOID

Story

@squadcast shared a post, 1 year, 9 months ago

Choosing the Best SRE Tools for Your Business: A Buyer’s Guide

This guide is here to help you choose the best SRE tools for your enterprise team.

We'll dive into the types of SRE tools, how to pick them, and the best practices for using them. By the end, you'll know exactly what works best for your team. We'll also highlight key factors to consider when choosing tools.

Story

@squadcast shared a post, 1 year, 9 months ago

Enterprise-Grade ITSM: Scaling Incident Response with ServiceNow & Squadcast

Integrating ServiceNow with Squadcast creates a robust IT Service Management (ITSM) solution, designed for enterprise teams dealing with complex systems and high-stakes environments. ServiceNow’s comprehensive ITSM capabilities—ranging from incident management to asset tracking—combined with Squadcast’s advanced incident response automation, intelligent alerting, and on-call management, enable teams to reduce downtime, streamline operations, and improve collaboration. This integration is a game-changer for scaling teams, helping them maintain reliability, minimize downtime, and respond to incidents swiftly.

Story

@squadcast shared a post, 1 year, 9 months ago

Optimizing Incident Management: Effective Stakeholder Communication with Squadcast

In critical incidents, keeping stakeholders informed is just as important as resolving the issue. Squadcast simplifies this by offering tools like stakeholder notifications, StatusPages, and Service Graphs. These features ensure clear communication with both internal and external stakeholders, allowing teams to focus on resolving incidents without worrying about constant updates. Squadcast enables real-time updates, automated notifications, and visual service representations to keep everyone aligned and minimize confusion during crises.

Story

@squadcast shared a post, 1 year, 9 months ago

The Engineer's Roadmap to Building Resilient Systems in High Growth Environments

Resilience engineering is essential for modern software systems, ensuring they recover quickly from disruptions and provide seamless user experiences. This blog explores the key concepts of resilience engineering, including the 4 R's: Robustness, Redundancy, Resourcefulness, and Rapidity. It also outlines a roadmap for engineers to build resilient systems in high-growth environments, from defining resilience goals to implementing scalable infrastructure and continuous monitoring.

Story

@squadcast shared a post, 1 year, 9 months ago

Maximizing Uptime: Four Essential System Monitoring Best Practices

System uptime is critical for organizations, directly impacting revenue, customer satisfaction, and internal operations. Downtime can result in significant financial losses and reputational damage. Proactive system monitoring is essential to mitigate these risks by enabling early detection, faster resolution, and performance optimization. Best practices for modern monitoring include defining KPIs, continuous monitoring, data analysis, and automation to reduce alert fatigue and improve system resilience.

Slurm Workload Manager is an open-source, fault-tolerant, and highly scalable cluster management and scheduling system widely used in high-performance computing (HPC). Designed to operate without kernel modifications, Slurm coordinates thousands of compute nodes by allocating resources, launching and monitoring jobs, and managing contention through its flexible scheduling queue.

At its core, Slurm uses a centralized controller (slurmctld) to track cluster state and assign work, while lightweight daemons (slurmd) on each node execute tasks and communicate hierarchically for fault tolerance. Optional components like slurmdbd and slurmrestd extend Slurm with accounting and REST APIs. A rich set of commands—such as srun, squeue, scancel, and sinfo—gives users and administrators full visibility and control.

Slurm’s modular plugin architecture supports nearly every aspect of cluster operation, including authentication, MPI integration, container runtimes, resource limits, energy accounting, topology-aware scheduling, preemption, and GPU management via Generic Resources (GRES). Nodes are organized into partitions, enabling sophisticated policies for job size, priority, fairness, oversubscription, reservation, and resource exclusivity.

Widely adopted across academia, research labs, and enterprise HPC environments, Slurm serves as the backbone for many of the world’s top supercomputers, offering a battle-tested, flexible, and highly configurable framework for large-scale distributed computing.