My data science team needed GPUs.
I set one up.
I'm on the infra team. My data science homies are out here building real models — diffusion experiments, fine-tuning runs, big batch jobs — all of it choked by laptops that tap out at 16GB RAM. So they kept sliding into my DMs: “bhai, can you set up GPU compute?”
Standard move. I spun up a GPU server, shipped the credentials, and felt like I'd handled it.
per day. every day. whether anyone was training or not.
The GPU was idle
most of the time.
Here's what nobody tells you when you sign the GPU rental contract: compute is only running when someone's actively training. Between jobs? Idle. When training wraps at 2am? Still running. Monday morning before anyone logs in? Billed. Full rate.
The DS team developed this whole coordination ritual. Someone finishes, they send the message — “GPU's free, who wants it next?” — in Slack. If nobody responds fast enough, the meter just keeps going.
I ran the numbers. We were paying for idle GPU time roughly 65% of the time. The org was spending serious money on a machine that was, most hours of the day, doing absolutely nothing. That's not a utilization problem. That's a billing model problem.
Renting compute per day when you need it per job is like hiring a full-time delivery driver because you order food three times a week.
Before anyone trains a model,
someone has to set it all up.
That someone was me. Every time. CUDA. cuDNN. GPU drivers — and god forbid the CUDA version doesn't match the driver, because then you're on Stack Overflow reading posts from 2019. Then Python environments. Then SSH keys for the whole team. Then shared storage so outputs don't vanish when a job ends. Then security groups. Then someone messages you that it's broken again.
It took roughly 3 days to get everything working. And “working” is generous — it broke regularly and someone had to fix it every time.
The data science team was blocked on infra. The infra team was pulled into maintenance instead of building. The whole org was paying in time, headcount, and daily GPU rental. The mental overhead of running a Slack thread about who can use the GPU next.
Two problems. One billing model problem: you pay per day, you need compute per job. One infrastructure problem: someone has to set it up, and someone has to keep it running. I wanted to eliminate both.
SkyPilot. Modal. Good tools.
Didn't fit the problem.
SkyPilot automates a lot, but it needs a controller VM running in your AWS account at all times — even when zero jobs are active, you're paying for the controller. Idle cost with extra YAML.
Modal is clean, but it's their infrastructure. Their pricing layer on top of AWS. Your code moves through their system. For teams with compliance requirements, or anyone who just wants their data in their own account — that's a hard no.
The gap wasn't a better dashboard. It was a tool that got out of the way completely — one command, GPU running, job done, machine gone.
And deeper than billing: why should running a GPU job require an AWS account at all? Most AI engineers don't want to deal with IAM users, access keys, and VPC configs just to fine-tune a model. The real problem isn't just how you pay — it's how much you have to set up before you can run anything.
No AWS. No setup.
Just run.
I built crunr on one principle: the job is the unit, not the day. Run a script, a GPU fires up, the job runs, outputs land on your laptop, and the machine disappears. Between jobs: $0.00. Not rounded. Exactly zero.
And I took it further. You shouldn't need an AWS account to do any of this.
crunr cloud: get an API key at cloud.crunr.com, run crunr cloud login --key <key>, then crunr run train.py --gpu. Under 60 seconds from zero to a running GPU job. No AWS. No CUDA install. No IAM policy. No infra engineer.
That's crunr cloud. It provisions the cheapest matching GPU, uploads your code, installs your requirements.txt, runs your script, streams output to your terminal, downloads results, and terminates the instance. Termination runs in a finally: block — job done, crashed, or Ctrl+C, the machine cannot stay running.
For teams already on AWS who want their code to go directly from their laptop to their own account — no relay, not even crunr in the path — BYOC is there. Your keys never leave your machine. Full audit trail in CloudTrail. But you don't need it to start. Most people don't.
Two paths, one command: crunr run train.py --gpu. Switch backends any time with crunr use cloud or crunr use aws.
No more
“GPU is free, who wants it?”
The data science team doesn't coordinate GPU access anymore. Everyone runs their own jobs whenever they need to — their own instance, their own bill, no shared machine. No Slack thread. No pressure to keep a GPU busy to justify the daily cost.
New engineers get an API key and run in under a minute. No AWS account. No setup doc. No infra engineer to ping. The engineers with strict compliance requirements use BYOC — direct to their own account, nothing in between.
Cloud compute should cost what you used, not what you could have used. And running a GPU job shouldn't require three days of setup. The tool should get out of the way. The job output is the point.
