Per-Second GPU Billing: 40-60% Inference Savings for India

Every Indian AI founder I've spoken to in the last six months has the same complaint: "Our cloud bill is killing us, and most of it isn't even the work we're doing - it's the idle time we're paying for."

They're right. And the math, once you actually run it, is brutal.

If you're running inference on hourly-billed GPUs, you are very likely paying 2x to 3x more than you need to. Not because you picked the wrong GPU. Not because your model is inefficient. But because the billing model itself is wrong for your workload.

This post shows you exactly where the money leaks, with four real workload patterns and rupee-level math. By the end, you'll know whether per-second billing actually applies to your stack - or whether hourly is fine.

No hand-waving. No "up to 90% savings!" marketing. Just numbers.

The Core Problem: GPU Inference Is Bursty, But Hourly Billing Isn't

Training is a long, predictable workload. You spin up a GPU, run for 6–48 hours, spin it down. Hourly billing works fine here because your utilization is close to 100%.

Inference is the opposite. Inference traffic is spiky, unpredictable, and full of dead air:

A user sends a prompt. Your model runs for 3 seconds. Then nothing for 47 seconds.
A scheduled batch job runs for 8 minutes at 2 AM. The GPU sits idle the other 23 hours and 52 minutes.
A RAG pipeline embeds a document in 12 seconds, then waits for the next query.

In every one of these cases, hourly billing forces you to pay for the slowest unit of time the cloud will charge you: a full hour. If your job runs for 90 seconds, you pay for 3,600 seconds. That's a 40x markup on the actual compute you used.

Per-second billing fixes this by charging you for the seconds you actually used. The savings aren't theoretical — they show up the moment you switch.

Let's prove it with four workloads Indian startups actually run.

Workload 1: Bursty Inference API (the most common pattern)

Scenario: You're a SaaS startup running a Llama 3 8B inference endpoint for B2B customers. Traffic is moderate - about 200 requests per hour during Indian business hours, near-zero overnight. Average inference time: 4 seconds per request.

Compute footprint per day:

200 requests/hour × 10 active business hours = 2,000 requests
2,000 requests × 4 seconds = 8,000 seconds of actual GPU work
That's 2 hours and 13 minutes of real compute per day

Hourly billing reality on an A100 80GB at ₹189/hour:

You can't actually only pay for 2.2 hours, because traffic is spread across 10 hours.
To serve traffic across the day, you keep the GPU running for at least 10 hours.
Cost: 10 × ₹189 = ₹1,890/day = ~₹56,700/month
Actual GPU utilization: ~22%
You're paying for 78% idle time.

Per-second billing on the same A100 80GB at the same effective rate (₹189/hour ÷ 3,600 = ₹0.0525/second):

8,000 seconds × ₹0.0525 = ₹420/day = ~₹12,600/month
Cost reduction: ~78%

Even if you assume per-second pricing is slightly higher per second to offset the flexibility (say, 20% premium), you still land at around ₹15,120/month - roughly 73% lower than hourly.

This is the single biggest win, and it's why bursty inference is the canonical per-second billing use case.

Workload 2: RAG Pipeline with Embedding + Generation

Scenario: You're building a document-Q&A product. A user uploads a PDF, you embed it, store the vectors, and answer questions. Embedding takes 8–15 seconds. Generation takes 3–6 seconds per question. Average user: 1 upload + 4 questions per session. About 60 sessions per day. Compute footprint per day:

Embedding: 60 × 12 sec = 720 seconds
Generation: 60 × 4 questions × 4 sec = 960 seconds
Total: 1,680 seconds = 28 minutes of actual GPU work
But sessions are spread across ~12 hours of the day Hourly billing on an L40S at ~₹120/hour:
GPU runs for 12 hours to cover the spread = ₹1,440/day = ₹43,200/month
Actual utilization: 28 min / 720 min = ~3.9%
You're paying for 96% idle time

Per-second billing at the same effective rate (₹0.033/second):

1,680 seconds × ₹0.033 = ₹55/day = ₹1,650/month
Cost reduction: ~96%

This sounds insane, but the math is right - and this is exactly why RAG-heavy startups burn through pre-seed funding faster than they expect on hourly clouds. RAG is extremely spiky, and most founders don't see the bill problem until month three.

In practice, savings will land closer to 85–90% once you factor in cold starts, model loading, and minor per-second pricing premiums. Still life-changing.

Workload 3: Scheduled Batch Jobs (overnight processing)

Scenario: You run a content moderation service. Every night at 2 AM, you batch-process the day's flagged content through a vision model. The job takes 18 minutes on an A100.

Hourly billing:

A100 at ₹189/hour, billed for full hour minimum = ₹189/day = ₹5,670/month
Actual utilization for that "hour": 18/60 = 30%

Per-second billing:

18 minutes × 60 = 1,080 seconds × ₹0.0525 = ₹56.7/day = ₹1,701/month
Cost reduction: ~70%

Batch jobs are the quiet, unsexy win. They look small per day, but they compound across a year. ₹5,670 vs ₹1,701 per month is ₹47,628 in annual savings - for one job. Most ML teams have 5–10 of these.

Workload 4: Serverless API with Cold Starts

Scenario: You expose a Stable Diffusion XL endpoint that runs about 80 requests/day, unpredictably distributed. Each request takes ~7 seconds of actual GPU time, but the model takes ~12 seconds to cold-start when the GPU goes idle.

Hourly billing reality:

To avoid cold starts on every request, you keep the GPU warm for 16 hours/day
16 × ₹120 (L40S) = ₹1,920/day = ₹57,600/month
Actual GPU work: 80 × 7 sec = 560 sec/day = ~9 minutes
Utilization: 0.9%. Yes, less than one percent.

Per-second billing with smart warm-pool management:

Pay only for the 9 minutes of actual work + occasional cold-start overhead
Realistic monthly cost: ₹3,500–5,500/month depending on how aggressive your scale-to-zero is
Cost reduction: ~90–94%

This is the workload pattern where per-second billing isn't just cheaper - it's the only economically viable option. Running serverless-style APIs on hourly billing is roughly equivalent to keeping a taxi parked outside your house all day in case you need a 5-minute ride

The Honest Caveats

A few things this post is NOT claiming:

1. Per-second billing always wins. It doesn't. For long-running training jobs (6+ hours of continuous GPU usage), per-second offers no real advantage - you're going to use the full hour anyway, so per-hour or committed-use pricing is often cheaper.

2. The savings are pure profit. They're not. If your workload is genuinely bursty, per-second billing reveals that you've been over-provisioning, and the savings are real. But you'll also need to invest in scale-to-zero logic, warm-pool management, and cold-start optimization to capture them fully.

3. Cold starts are free. They're not. A model that takes 30 seconds to load costs you 30 seconds of billing every time it cold-starts. If your traffic is just spiky enough to trigger constant cold starts but just dense enough to need warmth, you can actually end up worse off. The fix is provider-level cold-start optimization (FlashBoot, model caching, etc.) - which good per-second clouds offer.

4. Per-second pricing sometimes carries a small premium per second. A few providers charge 10–20% more per second of compute than the equivalent per-hour rate, on the theory that flexibility has a price. Even with this premium, the workloads above still come out massively ahead - but you should run the math on your specific traffic pattern.

When Per-Second Billing Wins (And When It Doesn't)

A simple decision rule:

Use per-second billing if:

Your GPU utilization is below 60%
Your workload is bursty, request-driven, or scheduled
You're running RAG, real-time inference APIs, or batch jobs under an hour
You're early-stage and your traffic is unpredictable

Stick with per-hour or committed pricing if:

You're training a model for 8+ continuous hours
Your inference traffic is dense enough to keep utilization above ~70%
You can commit to 1-3 months of usage and want maximum discount

For 80%+ of Indian AI startups we saw - RAG-based, inference-heavy, traffic-spiky, pre-PMF - per-second is the rational choice. For the other 20%, hourly committed pricing wins.

How PodStack Approaches This

PodStack bills per-second by default. There's no "you must run for at least 1 hour" minimum, no rounding up, no surprises on the invoice. You pay for the seconds the GPU actually ran your code.

We pair this with two things that make per-second billing actually work in practice:

Fractional GPU allocation. If your inference job only needs 25% of an A100, you can rent 25%. Combined with per-second billing, this means small workloads pay genuinely small bills - not "small fraction of a big bill."

INR-denominated, no-egress pricing. Per-second savings get destroyed if your provider charges 9¢/GB egress. PodStack bills entirely in INR with zero egress, so the savings you see in our math actually land in your bank account.

The result: most Indian startups moving from AWS/Azure hourly inference to PodStack per-second see 40–60% reductions on inference workloads, with RAG and serverless workloads sometimes hitting 70–90%. These aren't marketing numbers - they're what falls out of the math when bursty workloads stop paying for idle time.

Run Your Own Numbers

Before you switch anything, do this exercise tonight:

Pull last month's GPU bill.
Open your monitoring dashboard.
Calculate: (Total GPU hours billed) vs (Total GPU hours actually computing).
The ratio is your utilization.

If it's below 60%, you're a per-second billing candidate, full stop. If it's below 30%, you're losing money every single day you stay on hourly.

The Indian AI infrastructure market is finally giving you the tools to fix this. Use them.

Why Per-Second GPU Billing Saves Indian Startups 40–60% on Inference (With Real Math)

The Core Problem: GPU Inference Is Bursty, But Hourly Billing Isn't

Workload 1: Bursty Inference API (the most common pattern)

Workload 2: RAG Pipeline with Embedding + Generation

Workload 3: Scheduled Batch Jobs (overnight processing)

Workload 4: Serverless API with Cold Starts

The Honest Caveats

When Per-Second Billing Wins (And When It Doesn't)

How PodStack Approaches This

Related posts

How To Blur Faces in Videos Using a Jupyter Notebook on Podstack

Podstack vs. Runpod vs. CoreWeave: Which Cloud GPU Platform Should You Choose in 2026?

Subscribe to the Podstack blog