ORCA - Orchestration for Research Cloud Access¶

ORCA enables research institutions to seamlessly burst Kubernetes workloads from on-premises Kubernetes clusters to AWS, with native support for GPU-intensive AI/ML computing.

What is ORCA?¶

ORCA (Orchestration for Research Cloud Access) is a Kubernetes Virtual Kubelet provider that allows research Kubernetes clusters to dynamically extend capacity to AWS when local resources are exhausted.

Key Features¶

🎓 Research-First Design - Built for academic and research workloads
🖥️ AI/ML Accelerators - Support for NVIDIA GPUs (P6, P5, P4d, G6e), AWS Trainium, Inferentia, and FPGAs
🎯 Explicit Control - Users specify exact instance types, not guessed
💰 Cost-Aware - Budget controls, cost tracking, spot instance support
🔓 Open Source - Apache 2.0 licensed, community-driven

Quick Links¶

Getting Started

Get ORCA up and running in minutes

Quick Start
User Guide

Learn how to use ORCA for your workloads

User Guide
Architecture

Understand how ORCA works

Architecture
Development

Contribute to ORCA development

Development

Architecture Overview¶

graph TB
    subgraph "Research Cluster"
        K8S[Kubernetes API]
        VK[ORCA Virtual Kubelet]
        POD[Pod with GPU Request]
    end

    subgraph "AWS"
        EC2[EC2 Instance<br/>P5.48xlarge<br/>8x H100 GPUs]
        SPOT[Spot Instances]
        CR[Capacity Reservations]
    end

    POD -->|Schedule| VK
    VK -->|Register| K8S
    VK -->|Launch| EC2
    VK -.->|Optional| SPOT
    VK -.->|Preferred| CR

    style VK fill:#4285f4,stroke:#333,stroke-width:2px,color:#fff
    style EC2 fill:#ff9900,stroke:#333,stroke-width:2px
    style POD fill:#326ce5,stroke:#333,stroke-width:2px,color:#fff

Use Cases¶

AI/ML Training¶

Burst large model training to AWS GPUs, Trainium, or Inferentia when local clusters are full.

Cost-Optimized Computing¶

Use Trainium for 50% lower training costs or Inferentia for 70% lower inference costs compared to GPUs.

Research Computing¶

Access specialized hardware on-demand: FPGAs for genomics, latest GPUs for deep learning.

Multi-Tenant Research¶

Support multiple departments with separate budgets and cost tracking.

Why ORCA?¶

vs. Elotl Kip¶

Kip is EOL (last updated 2021) - stuck on K8s 1.18, AWS SDK v1
ORCA is modern - K8s 1.34, AWS SDK v2, Go 1.25, latest instance types (P6, G6e)
ORCA prioritizes explicit control - users know their requirements

vs. AWS Fargate Virtual Kubelet¶

Fargate provider is unmaintained and doesn't support GPUs
ORCA is GPU-first - built for AI/ML research

vs. Building on Managed K8s¶

ORCA extends existing Kubernetes clusters - research institutions already have K8s
No migration needed - burst workloads, keep existing infrastructure

Project Status¶

Current Phase: Alpha Development

⚠️ ALPHA SOFTWARE - ORCA is under active development. Container execution is not yet implemented (Issue #8). Pods will be scheduled and EC2 instances will launch, but containers will not run.

What Works Today ✅¶

Virtual Kubelet node registration and heartbeat
EC2 instance lifecycle (create, terminate, query status)
Instance selection (explicit, template, auto) - fully tested
HTTP server with /healthz, /readyz, /metrics endpoints
Configuration validation and AWS SDK integration

What Doesn't Work Yet ❌¶

Container execution (Issue #8) - 🔴 CRITICAL BLOCKER
kubectl logs (Issue #9) - requires container runtime
kubectl exec (Issue #10) - requires container runtime
GPU workloads - requires container runtime
Pod networking - requires container runtime
Volume mounting - requires container runtime
Metrics collection (Issue #11)

Next Steps¶

🎯 Priority 1: Container Runtime Integration (Issue #8)
🎯 Priority 2: kubectl logs/exec (Issues #9, #10)
🎯 Priority 3: GPU Support and Capacity Reservations (Issue #12)

Roadmap¶

ORCA development follows a phased approach aligned with research computing needs. Track our progress on the GitHub project board.

Phase 1: MVP ✅ Complete¶

Months 1-3

Core Virtual Kubelet provider with basic pod-to-EC2 mapping and explicit instance selection. Simple lifecycle management.

Status: Implementation complete, metrics in progress

Phase 2: Production Features 🚧 In Progress¶

Months 4-6

Production-ready features including: - GPU support for all NVIDIA instance types (P6, P5, P4d, G6e) - Container runtime integration with containerd - kubectl logs and exec via CloudWatch and Systems Manager - Spot instance support for cost optimization

View Phase 2 issues →

Phase 3: NRP Integration ⏳ Planned¶

Months 7-9

National Research Platform integration: - Automatic Ceph storage mounting - NRP namespace awareness and identity - Multi-tenancy with per-namespace quotas - Cost tracking and budget enforcement

View Phase 3 issues →

Phase 4: Advanced Features ⏳ Future¶

Months 9+

Enterprise and advanced capabilities: - Intelligent scheduling algorithms - Capacity planning and forecasting - Compliance features (HIPAA, FedRAMP) - Multi-region support

View Phase 4 issues →

View all milestones →

Community¶

Website: orcapod.dev
GitHub: scttfrdmn/orca
Issues: Report bugs or request features
Discussions: Questions and ideas
License: Apache 2.0

Getting Help¶

Built with 🌊 for research computing