Skip to content

ORCA - Orchestration for Research Cloud Access

ORCA Logo

ORCA enables research institutions to seamlessly burst Kubernetes workloads from on-premises Kubernetes clusters to AWS, with native support for GPU-intensive AI/ML computing.

What is ORCA?

ORCA (Orchestration for Research Cloud Access) is a Kubernetes Virtual Kubelet provider that allows research Kubernetes clusters to dynamically extend capacity to AWS when local resources are exhausted.

Key Features

  • 🎓 Research-First Design - Built for academic and research workloads
  • 🖥️ AI/ML Accelerators - Support for NVIDIA GPUs (P6, P5, P4d, G6e), AWS Trainium, Inferentia, and FPGAs
  • 🎯 Explicit Control - Users specify exact instance types, not guessed
  • 💰 Cost-Aware - Budget controls, cost tracking, spot instance support
  • 🔓 Open Source - Apache 2.0 licensed, community-driven

Architecture Overview

graph TB
    subgraph "Research Cluster"
        K8S[Kubernetes API]
        VK[ORCA Virtual Kubelet]
        POD[Pod with GPU Request]
    end

    subgraph "AWS"
        EC2[EC2 Instance<br/>P5.48xlarge<br/>8x H100 GPUs]
        SPOT[Spot Instances]
        CR[Capacity Reservations]
    end

    POD -->|Schedule| VK
    VK -->|Register| K8S
    VK -->|Launch| EC2
    VK -.->|Optional| SPOT
    VK -.->|Preferred| CR

    style VK fill:#4285f4,stroke:#333,stroke-width:2px,color:#fff
    style EC2 fill:#ff9900,stroke:#333,stroke-width:2px
    style POD fill:#326ce5,stroke:#333,stroke-width:2px,color:#fff

Use Cases

AI/ML Training

Burst large model training to AWS GPUs, Trainium, or Inferentia when local clusters are full.

Cost-Optimized Computing

Use Trainium for 50% lower training costs or Inferentia for 70% lower inference costs compared to GPUs.

Research Computing

Access specialized hardware on-demand: FPGAs for genomics, latest GPUs for deep learning.

Multi-Tenant Research

Support multiple departments with separate budgets and cost tracking.

Why ORCA?

vs. Elotl Kip

  • Kip is EOL (last updated 2021) - stuck on K8s 1.18, AWS SDK v1
  • ORCA is modern - K8s 1.34, AWS SDK v2, Go 1.25, latest instance types (P6, G6e)
  • ORCA prioritizes explicit control - users know their requirements

vs. AWS Fargate Virtual Kubelet

  • Fargate provider is unmaintained and doesn't support GPUs
  • ORCA is GPU-first - built for AI/ML research

vs. Building on Managed K8s

  • ORCA extends existing Kubernetes clusters - research institutions already have K8s
  • No migration needed - burst workloads, keep existing infrastructure

Project Status

Current Phase: Alpha Development

⚠️ ALPHA SOFTWARE - ORCA is under active development. Container execution is not yet implemented (Issue #8). Pods will be scheduled and EC2 instances will launch, but containers will not run.

What Works Today ✅

  • Virtual Kubelet node registration and heartbeat
  • EC2 instance lifecycle (create, terminate, query status)
  • Instance selection (explicit, template, auto) - fully tested
  • HTTP server with /healthz, /readyz, /metrics endpoints
  • Configuration validation and AWS SDK integration

What Doesn't Work Yet ❌

  • Container execution (Issue #8) - 🔴 CRITICAL BLOCKER
  • kubectl logs (Issue #9) - requires container runtime
  • kubectl exec (Issue #10) - requires container runtime
  • GPU workloads - requires container runtime
  • Pod networking - requires container runtime
  • Volume mounting - requires container runtime
  • Metrics collection (Issue #11)

Next Steps

  • 🎯 Priority 1: Container Runtime Integration (Issue #8)
  • 🎯 Priority 2: kubectl logs/exec (Issues #9, #10)
  • 🎯 Priority 3: GPU Support and Capacity Reservations (Issue #12)

Roadmap

ORCA development follows a phased approach aligned with research computing needs. Track our progress on the GitHub project board.

Phase 1: MVP ✅ Complete

Months 1-3

Core Virtual Kubelet provider with basic pod-to-EC2 mapping and explicit instance selection. Simple lifecycle management.

Status: Implementation complete, metrics in progress

Phase 2: Production Features 🚧 In Progress

Months 4-6

Production-ready features including: - GPU support for all NVIDIA instance types (P6, P5, P4d, G6e) - Container runtime integration with containerd - kubectl logs and exec via CloudWatch and Systems Manager - Spot instance support for cost optimization

View Phase 2 issues →

Phase 3: NRP Integration ⏳ Planned

Months 7-9

National Research Platform integration: - Automatic Ceph storage mounting - NRP namespace awareness and identity - Multi-tenancy with per-namespace quotas - Cost tracking and budget enforcement

View Phase 3 issues →

Phase 4: Advanced Features ⏳ Future

Months 9+

Enterprise and advanced capabilities: - Intelligent scheduling algorithms - Capacity planning and forecasting - Compliance features (HIPAA, FedRAMP) - Multi-region support

View Phase 4 issues →

View all milestones →

Community

Getting Help


Built with 🌊 for research computing