Documentation - Blamphs.ai

1. Getting Started

Welcome to Blamphs.ai! This guide will walk you through setting up autonomous GPU monitoring for your AWS infrastructure.

1.1 Sign Up

Visit blamphs.ai/signup and create an account using Google, GitHub, LinkedIn, or email/password
Your free trial starts immediately — no credit card required
You'll be redirected to the onboarding flow

1.2 Connect Your AWS Account

Blamphs.ai uses a read-only IAM role to monitor your GPU infrastructure. We never have write access to your AWS resources.

Create an IAM Role: In the AWS Console, go to IAM → Roles → Create Role
Select Trusted Entity: Choose "Another AWS account" and enter our account ID (provided in the onboarding flow)
Attach Policies: Add these read-only policies:
- ReadOnlyAccess (AWS managed policy)
- Or create a custom policy with minimal permissions: ec2:Describe*, cloudwatch:GetMetricStatistics, logs:GetLogEvents
Copy the Role ARN: Save the ARN (looks like arn:aws:iam::123456789012:role/BlamphsReadOnly)
Enter ARN in Blamphs: Paste the Role ARN into the Settings page and click "Connect"
Verify Connection: Blamphs will scan your infrastructure and display detected GPU clusters

1.3 Configure Your First Cluster

Once connected, Blamphs automatically discovers GPU instances across all regions:

Auto-discovery: We detect EC2 instances with GPU types (p3, p4, g4dn, g5, etc.)
Set Constraints: Define monitoring rules in natural language (e.g., "Never exceed $50k/month" or "Keep utilization above 80%")
Enable Autonomous Actions: Choose which actions Blamphs can take automatically (scaling, node cordoning, rebooting)

1.4 Dashboard Overview

Your dashboard provides real-time insights:

Cluster Health: Live status of all GPU nodes
Utilization Metrics: GPU usage, memory, temperature, power draw
Cost Tracking: Current spend vs. budget, savings delivered
Event Log: All autonomous actions taken by Blamphs

2. Product Features

Blamphs.ai is an autonomous control plane for AWS GPU workloads. It runs 24/7, monitoring your infrastructure and taking action to optimize costs and prevent failures.

2.1 Autonomous Scaling

How it works: Blamphs continuously monitors GPU utilization across all nodes. When a GPU is idle (0% utilization) for longer than your configured threshold, Blamphs automatically scales it down.

Idle Detection: Tracks GPU usage, CUDA processes, and training job status
Smart Scale-Down: Waits for safe moments (between training epochs, after checkpoint saves)
Instant Scale-Up: Detects workload spikes and provisions capacity milliseconds before needed
Cost Savings: Customers average 40% reduction in GPU spend

2.2 Self-Healing Infrastructure

How it works: Blamphs parses CUDA logs, system metrics, and process health to detect failures before they cascade.

CUDA Error Detection: Identifies memory errors, driver crashes, and GPU lock-ups
Zombie Process Cleanup: Finds stuck training jobs hogging resources
Automatic Cordoning: Marks unhealthy nodes as unavailable and drains workloads
Node Recovery: Reboots failed nodes or replaces them with healthy capacity

2.3 Predictive Cost Management

How it works: Blamphs learns your training patterns and forecasts spend based on historical data.

Budget Guardrails: Set monthly or weekly spending limits
Trend Analysis: Predicts next month's bill based on current usage
Savings Reports: Shows exactly how much Blamphs has saved you
No Surprise Fees: Get alerted before you hit your budget cap

2.4 Natural Language Configuration

How it works: Configure Blamphs by describing your constraints in plain English. No YAML, no config files.

Example Constraints:
- "Scale down GPUs idle for more than 30 minutes"
- "Never exceed $80k/month in GPU spend"
- "Keep at least 4 p4d.24xlarge instances warm at all times"
- "Reboot nodes that show CUDA errors twice in 10 minutes"
Real-time Validation: Blamphs confirms it understands your constraints before applying them

2.5 What Blamphs Monitors

GPU Metrics: Utilization, memory usage, temperature, power draw
System Health: CPU, RAM, disk I/O, network throughput
CUDA Logs: Driver errors, OOM events, kernel timeouts
Training Status: Checkpoint saves, epoch completion, loss curves
Cost Data: EC2 on-demand pricing, spot pricing, reserved instance usage

3. Security & Privacy

Blamphs.ai is built with security as a core principle. We understand you're trusting us with access to your critical infrastructure.

3.1 AWS Credential Security

Read-Only Access: Blamphs never has write permissions to your AWS infrastructure
IAM Role-Based: Uses AWS IAM roles (no long-lived access keys)
Encrypted Storage: All credentials stored with AES-256-GCM encryption at rest
TLS 1.3: All data transmitted over encrypted HTTPS connections
Revocable Access: Delete the IAM role anytime to instantly revoke access

3.2 Data Protection

What we collect:

GPU utilization metrics (%, memory, temperature)
EC2 instance metadata (instance type, region, availability zone)
CloudWatch logs (CUDA errors, system logs)
Cost and usage data (billing information)

What we DON'T collect:

Training data or model weights
Source code or application logic
Customer data processed by your workloads
SSH keys or database credentials

3.3 Infrastructure Security

Hosted on AWS: Blamphs runs on secure, SOC 2-compliant infrastructure
Network Isolation: Customer data is isolated using VPC segmentation
Access Controls: Role-based access with principle of least privilege
Audit Logs: All API requests and autonomous actions are logged
Regular Audits: Quarterly security reviews and penetration testing

3.4 Compliance

GDPR: Full compliance for European Economic Area users
CCPA: California Consumer Privacy Act compliance
SOC 2 Type II: Available for Enterprise plans
Australian Privacy Principles: Compliance for Australian users

3.5 Your Data Rights

Access: Request a copy of all data we've collected
Deletion: Delete your account and all associated data
Portability: Export your metrics and logs in JSON format
Revocation: Revoke AWS access instantly by deleting the IAM role

For security inquiries, contact security@blamphs.ai.

4. Billing & Pricing

Blamphs.ai offers transparent, predictable pricing with a free trial to get started.

4.1 Pricing Tiers

We offer three pricing tiers based on the size of your GPU infrastructure:

Starter: Up to 10 GPU nodes, $99/month
Growth: Up to 50 GPU nodes, $399/month
Enterprise: Unlimited nodes, custom pricing (contact sales)

All plans include:

Autonomous scaling and self-healing
Real-time monitoring and alerts
Cost tracking and savings reports
Natural language configuration
Email support (24-hour response time)

Enterprise plans add:

SOC 2 Type II compliance
Dedicated Slack channel
Custom integrations
SLA with 99.9% uptime guarantee

4.2 Free Trial

14-day free trial — no credit card required to start. The trial includes:

Full access to all Starter plan features
Monitor up to 10 GPU nodes
Real-time savings tracking
Email support

After the trial: You'll be prompted to select a paid plan. If you don't upgrade, your account will be paused (we won't delete your data for 30 days).

4.3 How Billing Works

Monthly Billing: Charged on the same day each month (e.g., if you sign up on March 15, you're billed on the 15th)
Automatic Renewal: Subscriptions renew automatically unless canceled
Proration: If you upgrade mid-month, we prorate the difference
Payment Methods: Credit card (Visa, Mastercard, Amex), ACH transfer (Enterprise only)

4.4 "No Surprise Fees" Guarantee

We hate surprise charges. Here's our promise:

Fixed Monthly Price: Your subscription cost never changes without notice
No Usage Fees: We charge per node, not per API call or data processed
30-Day Notice: If we raise prices, you'll get 30 days' notice
Grandfathered Rates: Existing customers keep their current price for 12 months after a price increase

4.5 Savings Model

How much can I save? Our customers average 40% reduction in GPU costs:

Idle GPU detection saves 25-35% on average
Spot instance optimization saves an additional 10-15%
Self-healing prevents costly downtime and manual intervention

Example: If you currently spend $10,000/month on GPU compute, Blamphs typically saves you $4,000/month. The $399 Growth plan pays for itself 10x over.

4.6 Cancellation & Refunds

Cancel Anytime: No long-term contracts or cancellation fees
Immediate Effect: Cancellation takes effect at the end of your current billing period
Data Retention: We keep your data for 30 days after cancellation in case you reactivate
Refund Policy: We generally don't offer refunds, but contact support if you have issues

4.7 Enterprise & Custom Plans

Need more than 50 nodes? Have unique requirements? Contact our sales team:

Email: sales@blamphs.ai
We offer volume discounts, custom contracts, and on-prem deployment options

5. Support & Resources

Need help? Here's how to reach us:

Email Support: support@blamphs.ai (24-hour response time)
Documentation: Full guides at blamphs.ai/docs
Status Page: Check system status at blamphs.ai (look for "Systems nominal")
Security Issues: security@blamphs.ai
Sales Inquiries: sales@blamphs.ai

6. Frequently Asked Questions

Q: Can Blamphs accidentally shut down critical workloads?

A: No. Blamphs only has read-only access to your AWS infrastructure. We can recommend scaling actions, but you control which actions we can execute. You can configure constraints like "never scale down nodes running training jobs" to add safety guardrails.

Q: What if I already use Kubernetes or AWS Auto Scaling?

A: Blamphs complements existing tools. We integrate with Kubernetes to detect pod-level GPU usage and work alongside AWS Auto Scaling Groups. Think of us as an intelligent layer on top that understands GPU-specific workloads.

Q: How long does setup take?

A: Most customers are up and running in under 10 minutes. Creating the IAM role takes 5 minutes, and Blamphs auto-discovers your infrastructure immediately after.

Q: Do you support multi-cloud (GCP, Azure)?

A: Not yet. We're AWS-only for now but plan to support GCP and Azure in 2026 Q3. Join the waitlist in your dashboard settings.

Q: What happens if Blamphs goes down?

A: Your infrastructure keeps running normally — we're monitoring and optimizing, not operating your workloads. If Blamphs is unavailable, you simply lose autonomous management temporarily. We have 99.9% uptime SLA for Enterprise customers.