ORCA - Orchestration for Research Cloud Access¶
ORCA enables research institutions to seamlessly burst Kubernetes workloads from on-premises Kubernetes clusters to AWS, with native support for GPU-intensive AI/ML computing.
What is ORCA?¶
ORCA (Orchestration for Research Cloud Access) is a Kubernetes Virtual Kubelet provider that allows research Kubernetes clusters to dynamically extend capacity to AWS when local resources are exhausted.
Key Features¶
- 🎓 Research-First Design - Built for academic and research workloads
- 🖥️ AI/ML Accelerators - Support for NVIDIA GPUs (P6, P5, P4d, G6e), AWS Trainium, Inferentia, and FPGAs
- 🎯 Explicit Control - Users specify exact instance types, not guessed
- 💰 Cost-Aware - Budget controls, cost tracking, spot instance support
- 🔓 Open Source - Apache 2.0 licensed, community-driven
Quick Links¶
-
Getting Started
Get ORCA up and running in minutes
-
User Guide
Learn how to use ORCA for your workloads
-
Architecture
Understand how ORCA works
-
Development
Contribute to ORCA development
Architecture Overview¶
graph TB
subgraph "Research Cluster"
K8S[Kubernetes API]
VK[ORCA Virtual Kubelet]
POD[Pod with GPU Request]
end
subgraph "AWS"
EC2[EC2 Instance<br/>P5.48xlarge<br/>8x H100 GPUs]
SPOT[Spot Instances]
CR[Capacity Reservations]
end
POD -->|Schedule| VK
VK -->|Register| K8S
VK -->|Launch| EC2
VK -.->|Optional| SPOT
VK -.->|Preferred| CR
style VK fill:#4285f4,stroke:#333,stroke-width:2px,color:#fff
style EC2 fill:#ff9900,stroke:#333,stroke-width:2px
style POD fill:#326ce5,stroke:#333,stroke-width:2px,color:#fff Use Cases¶
AI/ML Training¶
Burst large model training to AWS GPUs, Trainium, or Inferentia when local clusters are full.
Cost-Optimized Computing¶
Use Trainium for 50% lower training costs or Inferentia for 70% lower inference costs compared to GPUs.
Research Computing¶
Access specialized hardware on-demand: FPGAs for genomics, latest GPUs for deep learning.
Multi-Tenant Research¶
Support multiple departments with separate budgets and cost tracking.
Why ORCA?¶
vs. Elotl Kip¶
- Kip is EOL (last updated 2021) - stuck on K8s 1.18, AWS SDK v1
- ORCA is modern - K8s 1.34, AWS SDK v2, Go 1.25, latest instance types (P6, G6e)
- ORCA prioritizes explicit control - users know their requirements
vs. AWS Fargate Virtual Kubelet¶
- Fargate provider is unmaintained and doesn't support GPUs
- ORCA is GPU-first - built for AI/ML research
vs. Building on Managed K8s¶
- ORCA extends existing Kubernetes clusters - research institutions already have K8s
- No migration needed - burst workloads, keep existing infrastructure
Project Status¶
Current Phase: Alpha Development
⚠️ ALPHA SOFTWARE - ORCA is under active development. Container execution is not yet implemented (Issue #8). Pods will be scheduled and EC2 instances will launch, but containers will not run.
What Works Today ✅¶
- Virtual Kubelet node registration and heartbeat
- EC2 instance lifecycle (create, terminate, query status)
- Instance selection (explicit, template, auto) - fully tested
- HTTP server with /healthz, /readyz, /metrics endpoints
- Configuration validation and AWS SDK integration
What Doesn't Work Yet ❌¶
- Container execution (Issue #8) - 🔴 CRITICAL BLOCKER
- kubectl logs (Issue #9) - requires container runtime
- kubectl exec (Issue #10) - requires container runtime
- GPU workloads - requires container runtime
- Pod networking - requires container runtime
- Volume mounting - requires container runtime
- Metrics collection (Issue #11)
Next Steps¶
- 🎯 Priority 1: Container Runtime Integration (Issue #8)
- 🎯 Priority 2: kubectl logs/exec (Issues #9, #10)
- 🎯 Priority 3: GPU Support and Capacity Reservations (Issue #12)
Roadmap¶
ORCA development follows a phased approach aligned with research computing needs. Track our progress on the GitHub project board.
Phase 1: MVP ✅ Complete¶
Months 1-3
Core Virtual Kubelet provider with basic pod-to-EC2 mapping and explicit instance selection. Simple lifecycle management.
Status: Implementation complete, metrics in progress
Phase 2: Production Features 🚧 In Progress¶
Months 4-6
Production-ready features including: - GPU support for all NVIDIA instance types (P6, P5, P4d, G6e) - Container runtime integration with containerd - kubectl logs and exec via CloudWatch and Systems Manager - Spot instance support for cost optimization
Phase 3: NRP Integration ⏳ Planned¶
Months 7-9
National Research Platform integration: - Automatic Ceph storage mounting - NRP namespace awareness and identity - Multi-tenancy with per-namespace quotas - Cost tracking and budget enforcement
Phase 4: Advanced Features ⏳ Future¶
Months 9+
Enterprise and advanced capabilities: - Intelligent scheduling algorithms - Capacity planning and forecasting - Compliance features (HIPAA, FedRAMP) - Multi-region support
Community¶
- Website: orcapod.dev
- GitHub: scttfrdmn/orca
- Issues: Report bugs or request features
- Discussions: Questions and ideas
- License: Apache 2.0
Getting Help¶
Built with 🌊 for research computing