Architecting Resilient Connectivity for Ray Clusters on Kubernetes: A Troubleshooting Guide
Executive Summary
In distributed computing, Ray has become the standard for scaling AI and Python workloads. However, exposing a Ray Cluster (Dashboard, Client, and Serve) via an NGINX Ingress Controller often presents "Heisenbugs"—errors that appear and disappear after cluster restarts.
This technical report documents a real-world scenario where a Ray Ingress setup failed due to security-hardened webhooks, missing Layer 4 mappings, and Node IP drift. It provides a "Best Practice" blueprint for a production-ready, static-entry configuration.
1. The Challenge: "It worked yesterday, but not today."
A common symptom in Kubernetes is a configuration that works during the initial session but fails after a node reboot or pod reschedule. In this case, three distinct layers failed:
- The Security Layer: Modern NGINX controllers (v1.10+) block
configuration-snippetannotations to prevent code injection. - The Protocol Layer: Ray requires both HTTP (Dashboard/Serve) and raw TCP (Ray Client/GCS) traffic. Standard Ingress only handles HTTP.
- The Infrastructure Layer: Hardcoded
externalIPsin the Service became invalid as the underlying Node IPs changed during cluster lifecycle events.
2. Technical Deep-Dive: Root Cause Analysis
A. The Admission Webhook Conflict
When trying to fix header issues, the use of nginx.ingress.kubernetes.io/configuration-snippet triggered a BadRequest.
Error: admission webhook "validate.nginx.ingress.kubernetes.io" denied the request: risky annotation.
Cause: Security hardening in NGINX Ingress blocks custom snippets by default.
B. Missing Endpoints for TCP
The NGINX logs showed: Service does not have any active Endpoint for TCP port 9000.
Cause: While the Ingress handled HTTP traffic on port 80, the NGINX controller was not configured to "listen" or "route" the specific Ray TCP ports (10001, 9000, 6379) via its internal ConfigMap.
C. The Static IP Problem
By using a list of multiple externalIPs, the architecture was fragile. If the node hosting the Ingress Controller moved, the traffic entry point broke.
3. The Best-Practice Solution: "Pinned Gateway" Architecture
To resolve this, we implemented a Pinned Gateway Architecture. This ensures the Ingress Controller always lands on a specific, reliable node (IP .33) and utilizes the Host Network for maximum stability.
Step 1: Node Labeling (Persistence)
First, we mark our target gateway node to ensure Kubernetes always schedules the Ingress Controller there.
kubectl label node <your-node-name> ingress-ready=true
Step 2: Protocol Bridging (TCP ConfigMap)
We create a dedicated mapping for Ray's internal TCP protocols. This allows NGINX to handle non-HTTP traffic.
apiVersion: v1
kind: ConfigMap
metadata:
name: tcp-services
namespace: ingress-nginx
data:
"10001": "raycluster/ray-cluster-kuberay-head-svc:10001"
"9000": "raycluster/ray-cluster-kuberay-head-svc:9000"
"6379": "raycluster/ray-cluster-kuberay-head-svc:6379"
Step 3: Hardened Ingress Controller Deployment
We update the Ingress Controller to use hostNetwork and nodeSelector. This "pins" the controller to IP 192.168.1.33.
spec:
template:
spec:
hostNetwork: true # Use the physical Node IP directly
dnsPolicy: ClusterFirstWithHostNet
nodeSelector:
ingress-ready: "true" # Only run on our labeled .33 node
containers:
- args:
- /nginx-ingress-controller
- --tcp-services-configmap=$(POD_NAMESPACE)/tcp-services
# ... other args
Step 4: Clean, Snippet-Free Ingress Spec
By using pathType: ImplementationSpecific and the rewrite-target annotation, we avoid using "risky" snippets while maintaining full functionality for the Dashboard and Ray Serve.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: ray-head-ingress
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /$2
spec:
ingressClassName: nginx
rules:
- http:
paths:
- path: /ray(/|$)(.*)
backend:
service:
name: ray-cluster-kuberay-head-svc
port: { number: 8265 }
4. Operational Best Practices
To prevent future "Day 2" failures, follow these rules:
- Prefer ingressClassName over Annotations: Modern K8s clusters require the
spec.ingressClassNamefield for proper routing. - Trailing Slashes Matter: When using
rewrite-target, always access the dashboard viahttp://<IP>/ray/. Missing the final/will cause CSS/JS assets to fail to load. - Avoid Hardcoded External IPs: Instead of listing IPs in the Service, use
nodeSelectorandhostNetworkfor bare-metal clusters, or a proper LoadBalancer (like MetalLB) for virtual clusters. - Security Over Snippets: If you need custom Nginx headers, enable them globally in the Controller ConfigMap rather than locally in the Ingress object to pass security webhooks.
5. Conclusion
Connectivity issues in Kubernetes are rarely about a single broken line of code; they are usually about the interaction between security webhooks, network protocols, and scheduling logic. By pinning the Ingress Controller and properly mapping TCP services, we transform a fragile setup into a robust, production-ready AI gateway.
Tools Used: Kubernetes, KubeRay, NGINX Ingress, YAML Engineering.
This documentation is part of my DevOps Portfolio, showcasing my ability to resolve complex cloud-native networking challenges.