Join us

ContentUpdates and recent posts about kueue..
Link
@faun shared a link, 2 months, 1 week ago
FAUN.dev()

Why "What Happened First?" Is One of the Hardest Questions in Large-Scale Systems

Logical clocks trackevent orderin distributed systems—no need for synced wall clocks. Each node keeps a counter. On every event: tick it. On every message: tack on your counter. When you receive one? Merge and bump. This flips the script. Instead of chasing global time, distributed systems lean int.. read more  

Why "What Happened First?" Is One of the Hardest Questions in Large-Scale Systems
Link
@faun shared a link, 2 months, 1 week ago
FAUN.dev()

The Hidden AWS Cost Traps No One Warns You About (and How I Avoid Them)

Calling outfive sneaky AWS cost traps—the kind that creep in through overlooked defaults and quiet misconfigs, then blow up your bill while no one's watching... read more  

The Hidden AWS Cost Traps No One Warns You About (and How I Avoid Them)
Link
@faun shared a link, 2 months, 1 week ago
FAUN.dev()

Kubernetes Primer: Dynamic Resource Allocation (DRA) for GPU Workloads

Kubernetes 1.34 brings serious heat for anyone juggling GPUs or accelerators. MeetDynamic Resource Allocation (DRA)—a new way to schedule hardware like you mean it. DRA addsResourceClaims,DeviceClasses, andResourceSlices, slicing device management away from pod specs. It replaces the old device plu.. read more  

Kubernetes Primer: Dynamic Resource Allocation (DRA) for GPU Workloads
Link
@faun shared a link, 2 months, 1 week ago
FAUN.dev()

Kubernetes right-sizing with metrics-driven GitOps automation

AWS just dropped a GitOps-native pattern for tuning EKS resources—built to runoutsidethe cluster. It’s wired up withAmazon Managed Service for Prometheus,Argo CD, andBedrockto automate resource recommendations straight into Git. Here’s the play: it maps usage metrics to templated manifests, then sp.. read more  

Kubernetes right-sizing with metrics-driven GitOps automation
Link
@faun shared a link, 2 months, 1 week ago
FAUN.dev()

Amazon EKS Enables Ultra-Scale AI/ML Workloads with Support for 100K Nodes per Cluster

Amazon EKS just cranked its Kubernetes cluster limit to100,000 nodes—a 10x jump. The secret sauce? A reworkedetcdwith an internaljournalsystem andin-memorystorage. Toss in tightAPI server tuningand network tweaks, and the result is wild: 500 pods per second, 900K pods, 10M+ objects, no sweat—even un.. read more  

Amazon EKS Enables Ultra-Scale AI/ML Workloads with Support for 100K Nodes per Cluster
Link
@faun shared a link, 2 months, 1 week ago
FAUN.dev()

Lucidity turns spotlight onto Kubernetes storage costs

Lucidity has upgraded itsAutoScaler. It now handles persistent volumes on AWS-hosted Kubernetes, automatically scaling storage and reducing waste. The upgrade bringspod-level isolation,fault tolerance, andbulk Linux onboarding. Azure and GCP are next on the list... read more  

Lucidity turns spotlight onto Kubernetes storage costs
Link
@faun shared a link, 2 months, 1 week ago
FAUN.dev()

The Quiet Revolution in Kubernetes Security

Nigel Douglas discusses the challenges of security in Kubernetes, particularly with traditional base operating systems. Talos Linux offers a different approach with a secure-by-default, API-driven model specifically for Kubernetes. CISOs play a critical role in guiding organizations through the shif.. read more  

Link
@faun shared a link, 2 months, 1 week ago
FAUN.dev()

Kubernetes DNS Exploit Enables Git Credential Theft from ArgoCD

A new attack chain messes withKubernetes DNS resolutionandArgoCD’s certificate injectionto swipe GitHub credentials. With the right permissions, a user inside the cluster can reroute GitOps traffic to a fake internal service, sniff auth headers, and quietly walk off with tokens. What’s broken:GitOp.. read more  

Kubernetes DNS Exploit Enables Git Credential Theft from ArgoCD
Link
@faun shared a link, 2 months, 1 week ago
FAUN.dev()

Kubernetes VPA: Limitations, Best Practices, and the Future of Pod Rightsizing

Kubernetes'Vertical Pod Autoscaler (VPA)tries to be helpful by tweaking CPU and memory requests on the fly. Problem is, it needs to bounce your pods to do it. And if you're also runningHorizontal Pod Autoscaler (HPA)on the same metrics? Now they're fighting over control. VPA sees a narrow slice of .. read more  

Kubernetes VPA: Limitations, Best Practices, and the Future of Pod Rightsizing
Link
@faun shared a link, 2 months, 1 week ago
FAUN.dev()

Rethinking Efficiency for Cloud-Native AI Workloads

AI isn’t just burning compute—it's torching old-school FinOps. Reserved Instances? Idle detection? Cute, but not built for GPU bottlenecks and model-heavy pipelines. What’s actually happening:Infra teams are ditching cost-first playbooks for something smarter—business-aligned orchestrationthat chas.. read more  

Rethinking Efficiency for Cloud-Native AI Workloads
Kueue is a Kubernetes-native job queueing and workload management system designed for large-scale, mixed compute environments such as AI/ML training, batch workloads, and HPC workflows. Instead of scheduling individual Pods, Kueue operates at the job level, deciding when a job should run based on resource quotas, fair-sharing policies, cluster availability, and workload priorities.

Kueue integrates tightly with Kubernetes, working alongside the default scheduler rather than replacing it. It provides features such as all-or-nothing (gang) admission, workload preemption, quota-based sharing across teams or tenants, and support for advanced frameworks like JobSet and Ray. Its goal is to help Kubernetes clusters run efficiently under heavy load while ensuring that critical, latency-sensitive, or large training jobs receive the resources they need without starving lower-priority workloads.