Join us

ContentUpdates and recent posts about kueue..
Link
@faun shared a link, 1 month, 3 weeks ago

Inside NVIDIA GPUs: Anatomy of high performance matmul kernels

NVIDIA Hopper packs serious architectural tricks. At the core: **Tensor Memory Accelerator (TMA)**, **tensor cores**, and **swizzling**—the trio behind async, cache-friendly matmul kernels that flirt with peak throughput. But folks aren't stopping at cuBLAS. They're stacking new tactics: **warp-gro.. read more  

Inside NVIDIA GPUs: Anatomy of high performance matmul kernels
Link
@faun shared a link, 1 month, 3 weeks ago

Shai-Hulud npm Supply Chain Attack

Malicious npm packages just leveled up: this one dropped a self-spreading worm that hijacks repos and leaks secrets the moment it lands. It abuses `postinstall` scripts to run TruffleHog and swipe tokens straight from your codebase. Then it uses GitHub Actions to exfiltrate the loot and auto-publis.. read more  

Shai-Hulud npm Supply Chain Attack
Link
@faun shared a link, 1 month, 3 weeks ago

How FinOps Drives Value for Every Engineering Dollar

Duolingo’s FinOps crew didn’t just track cloud costs—they wired up sharp, automated observability across 100+ microservices. Real-time alerts now catch AI and infra spend spikes before they torch the budget. They sliced TTS costs by 40% with in-memory caching. Dumped pricey CloudWatch metrics for P.. read more  

How FinOps Drives Value for Every Engineering Dollar
Link
@faun shared a link, 1 month, 3 weeks ago

Observability for the Invisible: Tracing Message Drops in Kafka Pipelines

When an event drops silently in a distributed system, it is not a bug, it is an architectural blind spot. Detect, debug, and prevent message loss in Kafka-based streaming pipelines using tools like OpenTelemetry, Fluent Bit, Jaeger, and dead-letter queues. Make sure observability gaps in event strea.. read more  

Link
@faun shared a link, 1 month, 3 weeks ago

Introducing DigitalOcean Organizations, a new and comprehensive account layer

DigitalOcean just dropped **Organizations**—a real upgrade for anyone juggling multiple Teams. Think one top-level account to rule them all: centralized user control, one invoice to track, and org-wide settings for taxes, credits, and permissions... read more  

Introducing DigitalOcean Organizations, a new and comprehensive account layer
Link
@faun shared a link, 1 month, 3 weeks ago

Demystifying Log Retention in Azure

Azure logs come in three flavors: **Activity Logs**, **Diagnostic Logs**, and **Log Analytics**. Each with its own rules for retention and billing. The catch? Those differences aren’t quirks—they’re baked in... read more  

Link
@faun shared a link, 1 month, 3 weeks ago

Top 30 Argo CD Anti-Patterns to Avoid When Adopting Gitops

A teardown of Argo CD anti-patterns calls out 28 common misfires—stuff like skipping Git for Application CRDs or stuffing Helm/Kustomize config right into Argo CD manifests. Yikes. It pushes for a cleaner setup: use **ApplicationSets** instead of rolling your own YAML, turn on **auto-sync/self-heal.. read more  

Link
@faun shared a link, 1 month, 3 weeks ago

What are Error Budgets? A Guide to Managing Reliability

OneUptime shows how to put **error budgets** to work—keeping feature velocity in check without tanking reliability. The goal: ship fast, stay within SLOs. They do it by tracking **burn rates**, syncing across teams, and tuning SLOs to match how users actually use the product. Less guesswork, more s.. read more  

Link
@faun shared a link, 1 month, 3 weeks ago

Top 10 Kubernetes Deployment Errors: Causes and Fixes (And Tips)

Misconfigured YAML. Broken image refs. Botched resource settings. Most Kubernetes deploys don't fail mysteriously—they fail predictably. This guide breaks down the top 10 culprits: things like `CrashLoopBackOff`, bad image pulls, and `OOMKills`. More importantly, it shows how to dodge them with bet.. read more  

Top 10 Kubernetes Deployment Errors: Causes and Fixes (And Tips)
Link
@faun shared a link, 1 month, 3 weeks ago

Intelligent Kubernetes Load Balancing at Databricks

Databricks replaced default Kubernetes load balancing for a **proxyless, client-side gRPC setup**, wired up through a custom control plane. No more **CoreDNS**. No more **kube-proxy**. Clients now get live endpoint discovery through **xDS**, plus smarter routing tricks like **Power of Two Choices** .. read more  

Intelligent Kubernetes Load Balancing at Databricks
Kueue is a Kubernetes-native job queueing and workload management system designed for large-scale, mixed compute environments such as AI/ML training, batch workloads, and HPC workflows. Instead of scheduling individual Pods, Kueue operates at the job level, deciding when a job should run based on resource quotas, fair-sharing policies, cluster availability, and workload priorities.

Kueue integrates tightly with Kubernetes, working alongside the default scheduler rather than replacing it. It provides features such as all-or-nothing (gang) admission, workload preemption, quota-based sharing across teams or tenants, and support for advanced frameworks like JobSet and Ray. Its goal is to help Kubernetes clusters run efficiently under heavy load while ensuring that critical, latency-sensitive, or large training jobs receive the resources they need without starving lower-priority workloads.