AWS Custom Silicon and FPGA Support¶

ORCA supports AWS custom silicon accelerators and FPGAs for specialized AI/ML and compute workloads.

AWS Trainium (AI Training)¶

AWS Trainium is purpose-built for deep learning training, offering cost-effective training for large language models and other AI workloads.

Trainium Instance Types (2025)¶

Trn2.48xlarge: 16x Trainium2 chips, 192 vCPUs, 2TB RAM
~50% cost reduction vs P5 for training
Optimized for LLM training
NeuronLink interconnect for distributed training
Trn2.24xlarge: 8x Trainium2 chips, 96 vCPUs, 1TB RAM
Trn1.32xlarge: 16x Trainium1 chips, 128 vCPUs, 512GB RAM (previous generation)
Trn1n.32xlarge: 16x Trainium1 chips with enhanced networking

Example: LLM Training on Trainium¶

apiVersion: v1
kind: Pod
metadata:
  name: llm-training-trainium
  annotations:
    orca.research/instance-type: "trn2.48xlarge"
    orca.research/launch-type: "on-demand"
spec:
  nodeSelector:
    orca.research/provider: "aws"
  tolerations:
    - key: orca.research/burst-node
      operator: Equal
      value: "true"
      effect: NoSchedule
  containers:
    - name: trainer
      image: your-trainium-image:latest
      resources:
        requests:
          aws.amazon.com/neuron: "16"  # Request Trainium cores
        limits:
          aws.amazon.com/neuron: "16"

Trainium Benefits¶

Cost Optimization: ~50% lower cost per training compared to GPU instances
Purpose-Built: Optimized for transformer models and LLMs
Scale: NeuronLink provides high-bandwidth interconnect
PyTorch Support: AWS Neuron SDK with PyTorch integration
JAX Support: Native JAX/Flax support for research

When to Use Trainium¶

✅ Good for: - Large language model training (BERT, GPT, LLaMA, etc.) - Transformer-based models - Cost-sensitive training workloads - Long-running training jobs

❌ Not ideal for: - Models requiring CUDA-specific code - Workloads requiring NVIDIA-specific libraries - Inference (use Inferentia instead) - Short exploratory experiments

AWS Inferentia (AI Inference)¶

AWS Inferentia is optimized for high-performance, cost-effective ML inference.

Inferentia Instance Types (2025)¶

Inf2.48xlarge: 12x Inferentia2 chips, 192 vCPUs, 384GB RAM
Best price/performance for inference
Up to 4x throughput vs Inf1
Inf2.24xlarge: 6x Inferentia2 chips, 96 vCPUs, 192GB RAM
Inf2.8xlarge: 2x Inferentia2 chips, 32 vCPUs, 64GB RAM
Inf1.24xlarge: 16x Inferentia1 chips (previous generation, still supported)

Example: Model Inference on Inferentia¶

apiVersion: v1
kind: Pod
metadata:
  name: llm-inference
  annotations:
    orca.research/instance-type: "inf2.24xlarge"
    orca.research/launch-type: "on-demand"
spec:
  nodeSelector:
    orca.research/provider: "aws"
  tolerations:
    - key: orca.research/burst-node
      operator: Equal
      value: "true"
      effect: NoSchedule
  containers:
    - name: inference
      image: your-inferentia-image:latest
      resources:
        requests:
          aws.amazon.com/neuron: "6"  # Request Inferentia cores
        limits:
          aws.amazon.com/neuron: "6"

Inferentia Benefits¶

Cost Effective: Up to 70% lower cost per inference vs GPU
High Throughput: Optimized for batched inference
Low Latency: Purpose-built for production inference
Model Support: Broad framework support (PyTorch, TensorFlow, ONNX)

When to Use Inferentia¶

✅ Good for: - Production inference endpoints - High-throughput batch inference - Cost-sensitive deployments - Latency-critical applications - LLM serving (LLaMA, BERT, T5, etc.)

❌ Not ideal for: - Training workloads (use Trainium or GPUs) - Interactive model development - Models requiring CUDA

AWS FPGAs (Custom Acceleration)¶

FPGAs provide customizable hardware acceleration for specialized compute workloads.

FPGA Instance Types (2025)¶

F2.48xlarge: 8x Xilinx Alveo U250 FPGAs, 192 vCPUs, 2TB RAM
Latest generation (F1 retired in 2025)
PCIe Gen 4 support
Higher memory bandwidth
F2.16xlarge: 4x Xilinx Alveo U250 FPGAs, 64 vCPUs, 1TB RAM
F2.4xlarge: 1x Xilinx Alveo U250 FPGA, 16 vCPUs, 122GB RAM
F2.2xlarge: 1x Xilinx Alveo U250 FPGA, 8 vCPUs, 61GB RAM

Example: FPGA Workload¶

apiVersion: v1
kind: Pod
metadata:
  name: fpga-acceleration
  annotations:
    orca.research/instance-type: "f2.16xlarge"
    orca.research/launch-type: "on-demand"
spec:
  nodeSelector:
    orca.research/provider: "aws"
  tolerations:
    - key: orca.research/burst-node
      operator: Equal
      value: "true"
      effect: NoSchedule
  containers:
    - name: fpga-app
      image: your-fpga-image:latest
      resources:
        requests:
          aws.amazon.com/fpga: "4"  # Request FPGAs
        limits:
          aws.amazon.com/fpga: "4"

FPGA Use Cases¶

✅ Good for: - Custom hardware acceleration - Financial modeling and risk analysis - Genomics and bioinformatics - Video transcoding and processing - Network security and cryptography - Custom ML accelerators - High-frequency trading

❌ Not ideal for: - General-purpose computing - Workloads without FPGA expertise - Short-lived jobs (FPGA programming overhead)

FPGA Development¶

FPGAs require specialized development:

AWS FPGA Developer AMI: Pre-configured development environment
Xilinx Vitis: FPGA development tools
AFI (Amazon FPGA Image): Pre-built or custom FPGA images
OpenCL Support: Higher-level FPGA programming

Comparison Matrix¶

Feature	Trainium	Inferentia	NVIDIA GPU	FPGA
Primary Use	Training	Inference	Training/Inference	Custom Acceleration
Cost	Low	Very Low	High	Medium
Performance	High (Training)	High (Inference)	Highest	Customizable
Flexibility	Medium	Medium	High	Highest
Development	PyTorch/JAX	PyTorch/TF	CUDA/PyTorch	Xilinx/OpenCL
Time to Deploy	Fast	Fast	Fast	Slow (FPGA dev)
Availability	Good	Good	Limited	Good

ORCA Configuration¶

Instance Selection Examples¶

instances:
  templates:
    # Training templates
    llm-training-gpu:
      instanceType: p6.48xlarge      # NVIDIA B200
      launchType: spot

    llm-training-trainium:
      instanceType: trn2.48xlarge    # AWS Trainium2
      launchType: on-demand

    # Inference templates
    inference-gpu:
      instanceType: g6.xlarge        # NVIDIA L4
      launchType: on-demand

    inference-inferentia:
      instanceType: inf2.24xlarge    # AWS Inferentia2
      launchType: on-demand

    # FPGA templates
    fpga-acceleration:
      instanceType: f2.16xlarge      # 4x FPGAs
      launchType: on-demand

  # Allowed instance types
  allowedInstanceTypes:
    # Trainium
    - trn2.48xlarge
    - trn2.24xlarge
    - trn1.32xlarge
    - trn1n.32xlarge

    # Inferentia
    - inf2.48xlarge
    - inf2.24xlarge
    - inf2.8xlarge
    - inf1.24xlarge

    # FPGA
    - f2.48xlarge
    - f2.16xlarge
    - f2.4xlarge
    - f2.2xlarge

Cost Comparison (Approximate 2025 Pricing)¶

Training Workloads¶

P6.48xlarge (8x B200): ~$115/hour
P5.48xlarge (8x H100): ~$98/hour
Trn2.48xlarge (16x Trainium2): ~$50/hour ✅ 50% savings

Inference Workloads¶

G6.xlarge (1x L4): ~$1.20/hour
Inf2.24xlarge (6x Inferentia2): ~$8/hour ✅ Better throughput/cost

FPGA Workloads¶

F2.16xlarge (4x FPGAs): ~$22/hour

Best Practices¶

Trainium¶

Use for Large Models: Best ROI for models >1B parameters
Batch Training: Optimize batch sizes for Trainium
Distributed Training: Use NeuronLink for multi-node
Model Compilation: Pre-compile models with Neuron compiler

Inferentia¶

Batch Inference: Optimize for throughput over latency
Model Optimization: Use Neuron compiler optimizations
Right-Sizing: Choose instance size based on throughput needs
Model Caching: Pre-compile and cache models

FPGA¶

Long-Running Jobs: Amortize FPGA programming time
Reuse AFIs: Use pre-built Amazon FPGA Images
Custom Acceleration: Only when general compute insufficient
Development Time: Budget for FPGA development expertise

AWS Neuron SDK¶

Both Trainium and Inferentia require the AWS Neuron SDK:

# Example Dockerfile for Neuron workloads
FROM public.ecr.aws/neuron/pytorch-training-neuronx:2.1.0-neuronx-py310

# Install dependencies
RUN pip install transformers datasets

# Copy training code
COPY train.py /app/

# Run with Neuron
CMD ["neuron-train", "train.py"]

Resource Requests¶

Trainium/Inferentia¶

resources:
  requests:
    aws.amazon.com/neuron: "16"  # Number of Neuron cores
  limits:
    aws.amazon.com/neuron: "16"

FPGA¶

resources:
  requests:
    aws.amazon.com/fpga: "4"  # Number of FPGAs
  limits:
    aws.amazon.com/fpga: "4"

Future Support¶

ORCA will continue to support AWS custom silicon as new generations are released: - Trainium3 (expected 2026) - Inferentia3 (expected 2026) - Next-gen FPGAs

References¶

Last updated: October 2025