Join us

ContentUpdates and recent posts about Slurm..
Story
@laura_garcia shared a post, 1 year, 9 months ago
Software Developer, RELIANOID

Falcon Reliable Transport

🚀 Introduction to Falcon Reliable Transport 🚀 Recently, at Netdev 0x18, speakers Yadong Li, Jay Bhat, and others introduced Falcon, Google's innovative, hardware-offloaded reliable transport. Designed for high-bandwidth, low-latency workloads like AI/ML training and HPC, Falcon brings new capabiliti..

falcon reliable transport schema RELIANOID
Link
@prathamesh-sonpatki shared a link, 1 year, 9 months ago
SRE, Last9.io

Identify root spans in Otel Collector

Identify root spans in Otel Collector
Link
@prathamesh-sonpatki shared a link, 1 year, 9 months ago
SRE, Last9.io

Golang logging guide for developers

Golang logging guide for developers
Story
@laura_garcia shared a post, 1 year, 9 months ago
Software Developer, RELIANOID

Joining Info Tech Las Vegas

✈️ We’re thrilled to join Info-Tech LIVE 2024 in Las Vegas from Sept 17-19! With over 2,000+ IT leaders, this event will explore the theme "Exponential IT in Motion" through keynotes, workshops, and networking. Don’t miss out on the future of IT! #InfoTechLIVE2024#ITLeadership#TechInnovation#Network..

InfoTech Las Vegas RELIANOID
Story
@laura_garcia shared a post, 1 year, 9 months ago
Software Developer, RELIANOID

mTLS - Mutual Transport Layer Security

Mutual Transport Layer Security (mTLS), or Two-Way TLS, adds an extra layer of security by ensuring both the client and server authenticate each other using digital certificates. Building on the principles of Transport Layer Security (TLS), mTLS has seen increased adoption due to rising cybersecurit..

Knowledge base_what is mTLS_RELIANOID
Story
@squadcast shared a post, 1 year, 9 months ago

Choosing the Best SRE Tools for Your Business: A Buyer’s Guide

This guide is here to help you choose the best SRE tools for your enterprise team. 

We'll dive into the types of SRE tools, how to pick them, and the best practices for using them. By the end, you'll know exactly what works best for your team. We'll also highlight key factors to consider when choosing tools.

Story
@squadcast shared a post, 1 year, 9 months ago

Enterprise-Grade ITSM: Scaling Incident Response with ServiceNow & Squadcast

Integrating ServiceNow with Squadcast creates a robust IT Service Management (ITSM) solution, designed for enterprise teams dealing with complex systems and high-stakes environments. ServiceNow’s comprehensive ITSM capabilities—ranging from incident management to asset tracking—combined with Squadcast’s advanced incident response automation, intelligent alerting, and on-call management, enable teams to reduce downtime, streamline operations, and improve collaboration. This integration is a game-changer for scaling teams, helping them maintain reliability, minimize downtime, and respond to incidents swiftly.

Story
@squadcast shared a post, 1 year, 9 months ago

Optimizing Incident Management: Effective Stakeholder Communication with Squadcast

In critical incidents, keeping stakeholders informed is just as important as resolving the issue. Squadcast simplifies this by offering tools like stakeholder notifications, StatusPages, and Service Graphs. These features ensure clear communication with both internal and external stakeholders, allowing teams to focus on resolving incidents without worrying about constant updates. Squadcast enables real-time updates, automated notifications, and visual service representations to keep everyone aligned and minimize confusion during crises.

Story
@squadcast shared a post, 1 year, 9 months ago

The Engineer's Roadmap to Building Resilient Systems in High Growth Environments

Resilience engineering is essential for modern software systems, ensuring they recover quickly from disruptions and provide seamless user experiences. This blog explores the key concepts of resilience engineering, including the 4 R's: Robustness, Redundancy, Resourcefulness, and Rapidity. It also outlines a roadmap for engineers to build resilient systems in high-growth environments, from defining resilience goals to implementing scalable infrastructure and continuous monitoring.

Story
@squadcast shared a post, 1 year, 9 months ago

Maximizing Uptime: Four Essential System Monitoring Best Practices

System uptime is critical for organizations, directly impacting revenue, customer satisfaction, and internal operations. Downtime can result in significant financial losses and reputational damage. Proactive system monitoring is essential to mitigate these risks by enabling early detection, faster resolution, and performance optimization. Best practices for modern monitoring include defining KPIs, continuous monitoring, data analysis, and automation to reduce alert fatigue and improve system resilience.

Slurm Workload Manager is an open-source, fault-tolerant, and highly scalable cluster management and scheduling system widely used in high-performance computing (HPC). Designed to operate without kernel modifications, Slurm coordinates thousands of compute nodes by allocating resources, launching and monitoring jobs, and managing contention through its flexible scheduling queue.

At its core, Slurm uses a centralized controller (slurmctld) to track cluster state and assign work, while lightweight daemons (slurmd) on each node execute tasks and communicate hierarchically for fault tolerance. Optional components like slurmdbd and slurmrestd extend Slurm with accounting and REST APIs. A rich set of commands—such as srun, squeue, scancel, and sinfo—gives users and administrators full visibility and control.

Slurm’s modular plugin architecture supports nearly every aspect of cluster operation, including authentication, MPI integration, container runtimes, resource limits, energy accounting, topology-aware scheduling, preemption, and GPU management via Generic Resources (GRES). Nodes are organized into partitions, enabling sophisticated policies for job size, priority, fairness, oversubscription, reservation, and resource exclusivity.

Widely adopted across academia, research labs, and enterprise HPC environments, Slurm serves as the backbone for many of the world’s top supercomputers, offering a battle-tested, flexible, and highly configurable framework for large-scale distributed computing.