Join us

ContentUpdates and recent posts about Pelagia..
Link
@kaptain shared a link, 3 days, 21 hours ago
FAUN.dev()

v1.36: Tiered Memory Protection with Memory QoS

Kubernetes v1.36 rolls out Memory QoS (alpha). Opt-inmemory reservation. Tiered protection by QoS class. Kubelet observability metrics. Kernel-version warnings. It separatesthrottlingfromreservation. A feature gate enables throttling. A kubelet config field controls tieredcgroup v2protection:Guarant.. read more  

Link
@kaptain shared a link, 3 days, 21 hours ago
FAUN.dev()

v1.36: In-Place Vertical Scaling for Pod-Level Resources Graduates to Beta

Kubernetes v1.36 moves In-Place Pod-Level Resources Vertical Scaling to Beta and flips the feature gate on by default. Operators can patch a Pod's aggregate resource to resize running Pods. Often no container restart is needed. Kubelet breaks the Pod-level change into per-container resize events. It.. read more  

Link
@kaptain shared a link, 3 days, 21 hours ago
FAUN.dev()

Auto-Diagnosing Kubernetes Alerts with HolmesGPT and CNCF Tools

STCLab built an AI investigation pipeline withHolmesGPT, a 200-linePythonplaybook, andOpenTelemetry. It streamedMimir,Loki, andTempointo Slack threads. Metadata-driven markdownrunbookslimited tools per namespace, cut wasted tool calls from 16 to 2, and let the same model resolve alerts faster... read more  

Auto-Diagnosing Kubernetes Alerts with HolmesGPT and CNCF Tools
Link
@kaptain shared a link, 3 days, 21 hours ago
FAUN.dev()

v1.36: Staleness Mitigation and Observability for Controllers

Kubernetes v1.36 shipsclient-goatomicFIFOprocessing and cache-introspection APIs. Controllers detect stale informer state and skip acting on it. kube-controller-managerenables the capability by default for four high-contention pod controllers. It addsalpha metricsfor skipped syncs and informer resou.. read more  

Link
@kala shared a link, 3 days, 22 hours ago
FAUN.dev()

An open-weights Chinese model just beat Claude, GPT-5.5, and Gemini in a programming challenge

The AI Coding Contest Day 12 matched ten models on a sliding‑letter puzzle. Open‑weightsKimi K2.6took first: 22 match points (7‑1‑0).MiMo V2‑Proscored second by blasting claims for intact ≥7‑letter seeds (43 points).GPT‑5.5andClaude Opus 4.7landed third and fifth. Grids ran10×10→30×30. Heavy scrambl.. read more  

An open-weights Chinese model just beat Claude, GPT-5.5, and Gemini in a programming challenge
Link
@kala shared a link, 3 days, 22 hours ago
FAUN.dev()

Monitoring LLM behavior: Drift, retries, and refusal patterns

Traditional software is predictable due to determinism, while generative AI is unpredictable. Engineers need a new infrastructure layer, the AI Evaluation Stack, to ship enterprise-ready AI products. The stack includes deterministic assertions and model-based assertions to ensure structural integrit.. read more  

Link
@kala shared a link, 3 days, 22 hours ago
FAUN.dev()

Introducing the Agent Readiness score. Check to see if your site is agent-ready

Cloudflare launchedIsItAgentReady. It scans200kdomains, scoresagent readiness, publishes weekly adoption charts, and exposes results via anAPI. It checksrobots.txt,llms.txt, content negotiation viaAccept: text/markdown,API Catalog,.well-known/mcp.json, OAuth discovery, andx402payments. Cloudflare ov.. read more  

Introducing the Agent Readiness score. Check to see if your site is agent-ready
Link
@kala shared a link, 3 days, 22 hours ago
FAUN.dev()

The AI engineering stack we built internally - on the platform we ship

Cloudflare wired AI into the engineering stack. LLM traffic funnels through aproxy WorkerandAI Gateway. It shippedWorkers AIand theAgents SDK. Daily users hit 3,683 (93% R&D). MR throughput climbed to ~10,952/week.Workers AIhandled 51B input tokens and cut a security agent's inference spend by 77%... read more  

The AI engineering stack we built internally - on the platform we ship
Link
@kala shared a link, 3 days, 22 hours ago
FAUN.dev()

Multi-Agent System Reliability

LLMs are unreliable out of the box, but multi-agent systems can improve by dividing work among specialized agents. Building robust systems involves leveraging human system patterns like hierarchy, consensus, adversarial debate, and knock-out in a multi-agent architecture to ensure correctness and re.. read more  

Link
@devopslinks shared a link, 4 days ago
FAUN.dev()

How incidents can teach us about what’s already working well

A famous optical illusion developed by Edward H. Adelson shows that two squares, despite appearing different in shade, are actually the same gray. This illusion demonstrates how the brain processes light, shadow, and objects when interpreting visual signals from the optic nerve. Studying such illusi.. read more  

How incidents can teach us about what’s already working well
Pelagia is a Kubernetes controller that provides all-in-one management for Ceph clusters installed by Rook. It delivers two main features:

Aggregates all Rook Custom Resources (CRs) into a single CephDeployment resource, simplifying the management of Ceph clusters.
Provides automated lifecycle management (LCM) of Rook Ceph OSD nodes for bare-metal clusters. Automated LCM is managed by the special CephOsdRemoveTask resource.

It is designed to simplify the management of Ceph clusters in Kubernetes installed by Rook.

Being solid Rook users, we had dozens of Rook CRs to manage. Thus, one day we decided to create a single resource that would aggregate all Rook CRs and deliver a smoother LCM experience. This is how Pelagia was born.

It supports almost all Rook CRs API, including CephCluster, CephBlockPool, CephFilesystem, CephObjectStore, and others, aggregating them into a single specification. We continuously work on improving Pelagia's API, adding new features, and enhancing existing ones.

Pelagia collects Ceph cluster state and all Rook CRs statuses into single CephDeploymentHealth CR. This resource highlights of Ceph cluster and Rook APIs issues, if any.

Another important thing we implemented in Pelagia is the automated lifecycle management of Rook Ceph OSD nodes for bare-metal clusters. This feature is delivered by the CephOsdRemoveTask resource, which automates the process of removing OSD disks and nodes from the cluster. We are using this feature in our everyday day-2 operations routine.