Podstack Blog

How to Generate Multilingual Video Ads with ComfyUI, Wan 2.2, and Sarvam AI

Saurav Kumar — Mon, 18 May 2026 09:00:00 GMT

Running a video ad campaign across India means shipping the same creative in six or more languages. Studios solve this by shooting once and dubbing later — the visuals stay constant, only the voiceover changes. This tutorial reproduces that exact workflow with open-source video diffusion and a single API for Indic language generation.

By the end of this guide you'll have a command-line tool that takes one English brand prompt and produces six 30-second vertical MP4s — one each in Bengali, Odia, Telugu, Tamil, Hindi, and Marathi - sharing identical visuals but with native voiceovers in each language.

What you'll build

A reproducible ComfyUI workflow plus a TypeScript SDK that:

Uses Sarvam-M to draft a 30-second ad script from your brand prompt.
Generates six 5-second clips with Wan 2.2 (480×832) and stitches them into a vertical 30-second video.
Translates the script via Sarvam's translate endpoint and produces native TTS for each language.
Muxes audio and video and writes one MP4 per language to disk.

The hardware we used

We ran the entire pipeline on a single Podstack ComfyUI template with the following configuration:

GPU: 1× NVIDIA L40S, 48 GB VRAM
vCPU: 110
RAM: 241 GB
Persistent storage: 100 GB NFS volume mounted at /data (used for model weights and custom nodes - see Step 0)
Software image: ComfyUI 0.20.1 on PyTorch 2.6 / CUDA 12.4 (Conda env comfyui)

End-to-end runtime for six languages from a cold prompt: 22 minutes. That includes Sarvam script generation, six Wan 2.2 video passes (the latents are reused, only the voiceover changes per language), translation, six TTS calls, and the final mux.

Cost: Podstack vs the hyperscalers

Podstack bills per minute at ₹2.96/min for this configuration. The 22-minute run cost us ₹65.12 (~$0.78 USD) - one chai's worth of compute for a six-language ad batch.

Here's how that lines up against on-demand pricing on the hyperscalers for comparable hardware (single L40S with enough CPU/RAM to handle the video stitching, list price, US regions, mid-2026):

Podstack (110 vCPU / 241 GB / 1× L40S): ₹2.96/min → ₹65 for 22 min
AWS g6e.12xlarge (48 vCPU / 384 GB / 1× L40S): ~₹14.5/min → ~₹320 for 22 min (≈5× Podstack)
Azure NVads L40S v5 (single-GPU SKU, 36 vCPU / 220 GB): ~₹11/min → ~₹245 for 22 min (≈3.8× Podstack)
GCP g2-standard-96 (L4, not L40S — closest CPU/RAM match): ~₹9/min, but L4 ≠ L40S so you'd run longer and lose VRAM headroom

Hyperscaler numbers are list/on-demand prices and don't include egress, persistent disk, NAT, or India-region surcharges — all of which push real-world bills 20–40 % higher. Podstack's INR-denominated billing also avoids FX markup on your card statement.

The qualitative gap is bigger than the raw multiplier: the hyperscalers will happily charge you for the whole hour you spent waiting for the spot quota, the IAM role, and the VPC peering. Podstack's template is one click and bills the 22 minutes you actually used.

Prerequisites

Before you start, make sure you have:

A Podstack account with billing enabled (you'll deploy the ComfyUI template in Step 0).
A Sarvam AI API key. Sign up at sarvam.ai and copy the key from the dashboard.
Node.js 20+ and pnpm on your local machine.
Optional: a kubectl config if you prefer shell-over-kube to the web terminal.

Step 0 - Launch a ComfyUI pod on Podstack

From the Podstack console, choose Deploy → Templates → ComfyUI. Pick the L40S 48 GB flavour. Then - and this is the part most first-time users miss - attach a persistent volume and mount it at /data.

The /data mount is not optional. The ComfyUI image's entrypoint symlinks /data/custom_nodes → /opt/ComfyUI/custom_nodes and /data/models/<subdir>/* → /opt/ComfyUI/models/<subdir>/ on every container start. Without /data mounted, your custom nodes and the 14 GB of Wan 2.2 weights are wiped on every pod restart, and you burn 20 minutes re-downloading them each time.

Suggested volume size: 100 GB - enough for Wan 2.2 + VAE + CLIP + future model swaps. Smaller volumes will fill up the first time you try a second model.

Click Deploy. The pod is reachable in ~60 seconds. Note the assigned URL (e.g. https://zcr59-8188.cloud.podstack.ai) - that's both the ComfyUI UI and the SDK target.

Step 1 - Install the custom nodes on the pod

Open the Podstack web terminal (or kubectl exec) and clone the project under /opt/multilingual-ad. The install script downloads the Wan 2.2 fp8 UNet checkpoints, the matching VAE, the CLIP vision encoder, and pip-installs three custom Sarvam nodes (SarvamScriptNode, SarvamTranslateNode, SarvamTTSNode) into the ComfyUI conda environment.

The script writes the key to /data/env.sh (mode 0600) so the entrypoint sources it on every restart — meaning your Sarvam key survives pod restarts the same way your models do. It takes about 20 minutes - most of that is the ~14 GB Wan 2.2 weights download landing on /data.

Once it finishes, restart ComfyUI. There's no supervisorctl in the default image; killing python main.py is enough to trigger a clean pod restart and re-symlink the new custom nodes from /data.

Step 2 - Build the workflow graph once

Open the ComfyUI UI at your pod URL and build the graph from scratch - or import the starter graph from workflows/multilingual_ad.json. The shape is:

SarvamScriptNode → (script text) → six parallel KSampler branches (one per 5-second clip) → VAE Decode → Image Batch Concatenate → VHS_VideoCombine. In parallel, SarvamTranslateNode → SarvamTTSNode feeds the audio input of VHS_VideoCombine.

Export the finished graph as API JSON (Save (API Format)) and commit it to workflows/multilingual_ad.json. Then map the input node IDs in sdk/src/nodes.json so the SDK knows which nodes to overwrite per run.

Step 3 - Run the SDK from your laptop

The SDK targets the ComfyUI HTTP API. It queues the workflow six times: the first run captures the generated English script and the latent seeds; the remaining five reuse them, swapping only the language code passed to SarvamTranslateNode and SarvamTTSNode. This is what guarantees identical visuals across languages.

Outputs land in sdk/outputs/<lang>/ad_<lang>-IN_<seed>.mp4 - six files, all sharing the same seed and prompt, differing only in the voiceover track. End-to-end wall-clock on the spec above: ~22 minutes.

Step 4 - Iterate on prompts and languages

Restrict the run to a subset of languages while you iterate on visual quality, then expand once you're happy:

A fixed seed pins the entire batch to the same visual roll, which is the common case for ad QA - you want every reviewer looking at the same shot.

Gotchas worth knowing

Sarvam-M is a reasoning model. Its raw completions include a <think> preamble before the final answer. The SarvamScriptNode strips it before downstream nodes see the text - if you wire Sarvam-M into your own graph, do the same or your script will leak chain-of-thought into the voiceover.

Wan 2.2 ships as separate high-noise and low-noise UNets. Use the fp8 variants - they fit in 48 GB with room for the VAE, while the bf16 variants do not.

VHS_VideoCombine silently produces a silent MP4 if the audio input isn't wired. If your output has no voiceover, re-check that connection before anything else.

If you forgot the /data mount in Step 0 and the install script appeared to succeed, you'll find out the hard way on the first pod restart when MISSING: SarvamScriptNode shows up in the verify script. Re-deploy the pod with the volume attached — there is no in-place rescue.

Conclusion

You now have a one-command pipeline that turns a brand prompt into a campaign-ready set of localized video ads, running on an Indian-billed L40S for under a dollar per batch. The interesting part isn't that any one model is doing magic - it's that the production pattern (generate visuals once, swap only the voiceover) maps cleanly onto a ComfyUI graph, a small SDK on top of it, and a pod template that bills per minute instead of per hour.

Next steps: extend the language list (Sarvam covers 11 Indic languages), swap Wan 2.2 for a higher-resolution model when your VRAM budget allows, or wire the SDK into a queue so a marketing team can self-serve through a form. The graph stays the same; only the inputs change.

Watch the output

Here's one of the actual MP4s the pipeline produced - the Hindi voiceover variant from a "Premium chai brand, cozy monsoon vibe" prompt. Visuals are identical across the other five languages; only the audio track changes.

▶ Play sample-output-hindi.mp4 · (1.3 MB, hosted on the Podstack-ai/example repo)

Try it yourself

Spin up the same ComfyUI template we used at podstack.ai - new accounts get a joining bonus that covers more than enough credits to run this six-language pipeline end-to-end and have room left over to experiment.

All the code in this post - the install script, the custom Sarvam nodes, the workflow JSON, and the TypeScript SDK - is open source at github.com/Podstack-ai/example. Clone it, point it at your pod, and you should have your first multilingual ad batch in under 30 minutes.

How To Blur Faces in Videos Using a Jupyter Notebook on Podstack

Saurav Kumar — Tue, 12 May 2026 18:30:00 GMT

Introduction

When building video datasets that contain real people - such as stock footage, surveillance clips, or user-generated content - protecting the privacy of individuals is critical. Faces must be anonymised before any dataset can be responsibly published or shared.

In this tutorial, you will walk through a Jupyter notebook - faceblur_opencv.ipynb (GITHUB LINK TO REPOSITORY) - that runs entirely on Podstack.ai using the PyTorch CUDA 12 + OpenCV template. The notebook is organised into self-contained cells, each building on the last. By the time you reach the final cell, it will have:

Streamed and filtered the WebVid-10M dataset to find videos containing people
Downloaded those videos using Python's requests library
Verified each file is readable with OpenCV
Detected every face in every frame using MTCNN (Multi-task Cascaded Convolutional Networks) on GPU
Blurred each detected face using OpenCV's Gaussian blur
Written the anonymised frames to new video files
Archived everything into a single zip for download

The notebook produced these results on a single Podstack GPU pod: 550 videos processed, 256,326 frames read, 171,480 faces blurred - in approximately 92 minutes.

Prerequisites

Before you begin, you will need:

A Podstack.ai account - sign up and claim your joining bonus to receive free GPU credits
A pod launched from the PyTorch CUDA 12 + OpenCV template, which comes with torch, torchvision, opencv-python, and CUDA 12 pre-installed
The notebook file faceblur_opencv.ipynb, which you can upload directly to your pod's Jupyter environment

The following additional packages are installed inside the notebook itself in the first cell, so no manual setup is required:

datasets - for streaming WebVid-10M from Hugging Face
requests - for downloading video files
facenet-pytorch - for MTCNN face detection
tqdm - for progress tracking

Step 1 - Launching Your Podstack Pod and Opening the Notebook

Log in to Podstack.ai and click New Pod. From the template gallery, select the PyTorch CUDA 12 + OpenCV template. This template ships with:

Python 3.10
PyTorch with CUDA 12 support
OpenCV pre-built with video codec support
JupyterLab accessible directly from your browser

Once your pod is running, click Open JupyterLab from the pod dashboard. In the JupyterLab file browser, upload faceblur_opencv.ipynb using the upload button, then double-click it to open it.

Note: The Podstack PyTorch CUDA 12 + OpenCV template pre-configures all CUDA environment variables. You do not need to set CUDA_HOME or install GPU drivers manually - the pod handles this for you.

Step 2 - Cell 1: Exploring the Dataset

The first cell loads the WebVid-10M dataset in streaming mode and prints the very first entry to confirm the connection is working.

Cell output:

The dataset is loaded in streaming mode - no data is cached locally. The next(iter(ds)) call fetches only the first entry over the network, confirming the dataset is accessible without downloading all 10 million records.

Note: streaming=True means each next() call fetches one entry from the Hugging Face servers in real time. This is ideal for large datasets where you only need a subset.

Step 3 - Cell 2: Filtering for Videos That Contain People

The second cell adds a keyword filter on the video caption (name field) to find clips likely to contain human faces. Only videos whose captions include words like "woman", "man", "person", or "face" are kept.

Cell output:

This keyword approach is a fast, cheap heuristic - it will not catch every video containing a face, but it dramatically narrows the candidate pool before any expensive GPU inference runs.

Step 4 - Cell 3: Downloading a Sample Video

The third cell downloads the filtered video using requests in streaming mode. Chunked downloading avoids loading the entire file into memory at once, which matters when working with many files.

Cell output:

Step 5 - Cell 4: Verifying the Video with OpenCV

Before committing GPU time to a file, this cell checks that OpenCV can open it and successfully read at least one frame. This guards against corrupt downloads and codec-incompatible files - both of which appear in real-world datasets.

Cell output:

The tuple (316, 600, 3) represents height, width, and the three BGR colour channels OpenCV uses by default. If opened returns False, the file is either missing, corrupt, or using an unsupported codec.

Step 6 - Cell 5: Running the Face Detection and Blur Loop

This is the core cell of the notebook. It processes every downloaded video file frame by frame - using MTCNN for face detection and OpenCV's Gaussian blur for anonymisation - then writes each modified frame to a new output file.

Cell output:

The tqdm progress bar updates live in the notebook output area as each video is processed. Some videos emitted codec warnings to stderr - Unable to read codec parameters and moov atom not found - but these were caught by the try/except block and did not interrupt the run.

Understanding the Key Parameters

keep_all=True tells MTCNN to return bounding boxes for every face in the frame, not just the highest-confidence one. This is essential for crowd scenes or any frame with more than one person.

cv2.GaussianBlur(face, (51, 51), 30) applies a Gaussian blur with a 51×51 kernel and standard deviation of 30. A larger kernel produces a heavier blur. The kernel dimensions must always be odd integers. This setting renders faces unrecognisable without leaving a visually jarring black rectangle over the region.

cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) is required on every frame because OpenCV reads video as BGR by default, while MTCNN was trained on RGB images. Skipping this conversion leads to noticeably degraded detection accuracy.

cv2.VideoWriter_fourcc(*'mp4v') uses the MPEG-4 codec for output, which is broadly compatible across platforms. If you need H.264 output, replace 'mp4v' with 'avc1', though availability depends on your OpenCV build.

Warning: If MTCNN returns a bounding box that partially falls outside the frame boundaries, slicing frame[y1:y2, x1:x2] can produce an empty array. The face.size > 0 guard prevents a crash in this case.

Step 7 - Cell 6: Archiving the Output

The final cell zips the entire blurred_videos directory into a single archive for easy download.

Cell output:

This produces blurred_videos.zip in the notebook's working directory. You can download it directly from the JupyterLab file browser by right-clicking the file and selecting Download.

Conclusion

In this tutorial, you walked through faceblur_opencv.ipynb cell by cell - a Jupyter notebook running on Podstack's PyTorch CUDA 12 + OpenCV template. Across six cells, the notebook streamed a large video dataset, filtered for human-containing clips, downloaded and verified them, ran GPU-accelerated face detection with MTCNN, applied Gaussian blur to every detected face, and packaged the results for download. No local environment setup, no driver installation, and no infrastructure management was needed.

What to Try Next

To extend the notebook further, consider these improvements directly in new cells:

Skip frames. At 30fps, consecutive frames are nearly identical. Running MTCNN every 3rd or 5th frame and reusing bounding boxes in between cuts inference time significantly.
Use MTCNN's batch mode. Pass a list of frames instead of one at a time to better saturate the GPU.
Try a faster detector. YOLOv8-face and RetinaFace offer higher throughput than MTCNN if processing speed is the priority.
Process multiple videos in parallel. Wrap the video loop in a ThreadPoolExecutor to process several files concurrently.

Run This Notebook on Podstack

This notebook was built and executed entirely on Podstack.ai using the PyTorch CUDA 12 + OpenCV template - no local CUDA setup, no driver headaches, no dependency conflicts. The pod was live and the notebook was running in under a minute.

To run it yourself:

Go to podstack.ai and create a free account
Claim your joining bonus - new users receive free GPU credits on sign-up
Launch a new pod using the PyTorch CUDA 12 + OpenCV template
Upload faceblur_opencv.ipynb via JupyterLab and run each cell from top to bottom

The Podstack template gallery also includes pre-built example notebooks for common computer vision tasks - object detection, image segmentation, video processing, and more - so you can explore and adapt them without starting from scratch.

Get started on Podstack →

Don't forget to claim your joining bonus when you sign up - it gives you free GPU credits to try out the notebook examples immediately, at no cost.

How To Fine-Tune an LLM with Unsloth Studio on Podstack

Saurav Kumar — Thu, 07 May 2026 18:30:00 GMT

Introduction

Fine-tuning a large language model (LLM) lets you adapt a general-purpose base model - like Llama, Mistral, or Qwen - to a specific task, domain, or style. Until recently, this required writing custom training scripts, managing CUDA dependencies, and renting expensive GPUs by the hour with no visibility into what your training run was actually doing.

Unsloth is an open-source library that makes LLM fine-tuning roughly 2x faster and uses 50–70% less memory than vanilla Hugging Face training, thanks to hand-written Triton kernels and aggressive memory optimizations. Unsloth Studio is the visual interface built on top of that library: a browser-based UI where you select a model, pick a dataset, configure hyperparameters, and watch training progress in real time - no Python scripts required.

In this tutorial, you will deploy an Unsloth Studio instance on Podstack using a one-click template, configure a QLoRA fine-tuning run on TinyLlama, and monitor the training process through the Studio dashboard. By the end, you will have a fine-tuned model adapter that you can export, deploy, or chat with directly inside the Studio.

Prerequisites

Before you begin, make sure you have:

A Podstack account. You can sign up at cloud.podstack.ai. New accounts come with credits to get started.
Basic familiarity with LLM concepts. You should understand what a base model, dataset, and fine-tuning are at a conceptual level. You do not need to know PyTorch or Hugging Face APIs.
An SSH key pair (optional). Only needed if you want command-line access to your pod. Notebook and Studio access work entirely through the browser.
A Hugging Face account (optional). Required only if you want to use private datasets or push your trained adapter to the Hugging Face Hub.

Step 1 — Spinning Up an Unsloth Studio Pod on Podstack

Podstack provides one-click templates for popular ML environments, including a pre-configured Unsloth Studio image with CUDA 13, the Unsloth library, and the Studio UI ready to go.

To deploy your pod, log in to https://cloud.podstack.ai and navigate to the Pods section in the left sidebar. Click Create Pod, then select the Unsloth Studio (CUDA 13) template from the marketplace.

Next, choose your GPU. Podstack offers a range of options - for this tutorial, an NVIDIA L40S (48 GB VRAM) is a good balance of price and capability. The L40S handles QLoRA fine-tuning on models up to 13B parameters comfortably, and it costs significantly less per hour than an H100. If you plan to fine-tune larger models or use full fine-tuning instead of QLoRA, consider an H100 or A100 instead.

After confirming the configuration, click Deploy. Your pod will be running within about 5 seconds.

Step 2 — Accessing Your Pod

Once your pod is running, click on it from the Pods list to open the pod detail page. You will see four ways to access your environment:

Web Terminal - an interactive shell directly in the browser, useful for running shell commands or installing extra packages.

SSH Access - a standard SSH command in the format ssh -i ~/.ssh/<key_name> podstack@ssh-<id>.cloud.podstack.ai. Use this if you want to connect editors like VS Code Remote or Cursor to your pod.

Notebook Access - a JupyterLab instance on a dedicated subdomain, protected by an auto-generated password.

Studio URL - the Unsloth Studio web interface, also on its own subdomain.

The URL you want for this tutorial is the Studio URL, which will look something like https://<id>-8888.cloud.podstack.ai/studio.

Important: The auto-generated password for notebook and Studio access is shown only once on the pod detail page. Copy it to a password manager before you leave the page. If you lose it, you can regenerate it from the same page.

The right side of the pod detail page also shows live infrastructure stats - CPU, memory, storage, and GPU utilization. Keep this tab open during training to monitor your pod's health independently of the Studio's own GPU monitor.

Step 3 — Configuring Your First Fine-Tuning Run

Open the Studio URL in your browser. You will see the Unsloth Studio interface with a left sidebar containing New Chat, Compare, Search, Train, Recipes, and Export.

Click Train to enter the Fine-tuning Studio. The main panel has three tabs: Configure, Current Run, and History. Start with Configure.

Selecting a Base Model

n the model selector, choose unsloth/tinyllama-bnb-4bit. This is a pre-quantized 4-bit version of TinyLlama (1.1B parameters) - small enough to train quickly while still being a real LLM, which makes it ideal for learning the workflow.

For production work, Unsloth maintains pre-quantized 4-bit versions of most popular open models under the unsloth/ prefix on Hugging Face, including unsloth/llama-3.1-8b-bnb-4bit, unsloth/qwen2.5-7b-bnb-4bit, and unsloth/mistral-7b-v0.3-bnb-4bit. All of these fit comfortably on an L40S with QLoRA.

Choosing a Dataset

For the dataset, use HuggingFaceH4/ultrachat_200k. This is a standard supervised fine-tuning (SFT) dataset of 200,000 multi-turn conversations, widely used as a baseline for instruction tuning. Studio handles the chat template formatting automatically - you don't need to manually concatenate user and assistant turns or insert special tokens.

If you have your own dataset, you can either upload it directly to the pod or reference it by Hugging Face dataset ID. The expected format is either a list of conversation turns or a prompt/completion pair structure.

Selecting a Training Method

Choose QLoRA as your training method. QLoRA quantizes the base model to 4-bit and trains small low-rank adapter matrices on top of it. This gives you most of the quality of full fine-tuning at a fraction of the memory cost - for TinyLlama, QLoRA uses under 4 GB of VRAM where full fine-tuning would need 20+ GB.

The other options are:

LoRA - same idea as QLoRA but without the 4-bit quantization. Use this if you have plenty of VRAM and want slightly higher quality.

Full fine-tune - updates every parameter in the base model. Rarely worth it unless you have a specific reason and a lot of VRAM.

Setting Hyperparameters

The defaults Unsloth Studio picks are sensible for most fine-tuning runs:

Learning rate: 2e-4 - standard for QLoRA. Lower it to 1e-4 if your loss is unstable.
LoRA rank: 16 - controls adapter capacity. Increase to 32 or 64 for more domain shift.
LoRA alpha: 16 - typically set equal to rank.
Dropout: 0 - Unsloth's optimizations work best with no dropout.
Batch size: 2 with gradient accumulation: 4 - gives an effective batch size of 8.
Epochs: 1 - UltraChat is large enough that one epoch is plenty.

For your first run, accept the defaults. You can always tune later once you understand how a baseline run behaves.

Step 4 — Starting the Run and Reading the Dashboard

Click Start Training. Studio switches to the Current Run tab and the dashboard goes live.

Across the top, you'll see the live training stats:

Step counter showing current step out of total (e.g., Step 15 / 19476)
Loss - the training loss for the current batch
LR - current learning rate
Grad Norm - gradient magnitude after clipping
Model and Method for reference
Throughput - steps per second and total tokens processed
ETA - estimated time remaining

Below that are four real-time charts: Training Loss, Gradient Norm, Learning Rate schedule, and Eval Loss.

On the right is the GPU Monitor, which shows live values for utilization, VRAM usage, temperature, and power draw. After a minute of training on the L40S, you should expect to see something like:

Utilization: 95–100% - the GPU is the bottleneck, which is what you want
VRAM: ~4 / 45 GB for TinyLlama in 4-bit (you have plenty of headroom for larger models or batch sizes)
Temperature: 60–75°C - normal operating range
Power: 320–340 / 350 W - near the power cap, indicating the kernels are fully loaded

If utilization is sitting at 40%, your dataloader is the bottleneck - increase the number of workers or batch size. If VRAM is at 95%, you're one OOM crash away from a failed run - reduce batch size or gradient accumulation.

What a Healthy Loss Curve Looks Like

For a healthy QLoRA run, you should see:

Training loss trending down smoothly. For TinyLlama on UltraChat, expect to start around 1.4–1.5 and drop to 1.2–1.3 within the first 50 steps, then continue declining slowly.
Gradient norm stable in the 0.1–0.5 range, with occasional spikes that get clipped. Persistent climbing grad norm means the run is diverging — stop and lower the learning rate.
Learning rate following whatever schedule you configured (linear warmup then cosine decay by default).
Eval loss tracking training loss with a small gap. A growing gap means overfitting; if it appears, stop early or reduce epochs.

Step 5 — Iterating, Comparing, and Exporting

Long fine-tuning runs are usually wasteful. The honest workflow is:

Subsample your dataset to 5,000–20,000 examples
Train for 30–60 minutes
Evaluate the result in Studio's chat interface (left sidebar → New Chat)
If the model behaves as you want, scale up to the full dataset

Studio's Recipes tab (left sidebar) lets you save configurations and re-run them later - useful for hyperparameter sweeps. The Compare tab lets you diff two runs side by side, which is far easier than digging through Weights & Biases logs.

When you're happy with a run, go to Export. You have three options:

Push to Hugging Face Hub - uploads your adapter to a private or public repo
Save as GGUF - quantizes the merged model for use with llama.cpp and Ollama
Merge LoRA into base - produces a standalone fine-tuned model in safetensors format

For most production deployments, GGUF is what you want - it runs efficiently on CPUs and consumer GPUs.

Conclusion

In this tutorial, you deployed Unsloth Studio on a Podstack GPU pod, configured a QLoRA fine-tuning run on TinyLlama, and learned how to read the Studio dashboard to evaluate training health. You also saw how to iterate on runs, compare configurations, and export your fine-tuned model.

The broader takeaway: LLM fine-tuning has gone from a multi-day infrastructure project to a workflow you can complete in an afternoon. Unsloth made the kernels fast, QLoRA made the memory cheap, pre-quantized models made setup trivial, and Studio removed the training script entirely. Paired with per-minute GPU pricing on Podstack, the cost of trying a fine-tuning idea is now small enough that the only real bottleneck is whether you have a dataset worth training on.

Ready to train your own model? Head to https://podstack.ai - claim your joining bonus, spin up an Unsloth Studio instance using our one-click template, and you'll be fine-tuning within seconds.

For deeper exploration, see:

Unsloth documentation for advanced configuration and supported models

2. QLoRA paper for the theoretical background on 4-bit fine-tuning

3. Hugging Face datasets hub for finding training data

Is AI Infrastructure the New EV - Already Obsolete?

Saurav Kumar — Wed, 06 May 2026 18:30:00 GMT

Every generation, a new technology earns the label of "stranded asset." Electric vehicles rewrote the calculus for gas-station networks. Streaming obsoleted the DVD rental industry overnight. Now the question is whether the billions being poured into AI data centers share the same fate.

The analogy to EV technology is seductive. Both sectors feature hardware generations that compress years of improvement into months. Both have seen incumbents scramble to retrofit old infrastructure for new demands. And both carry the uncomfortable possibility that today's flagship investment is tomorrow's liability.

But the analogy frays under scrutiny - and understanding exactly where it breaks down is key.

Where the Comparison Holds

Nvidia's GPU roadmap is the clearest proxy. The A100 (2020) was the undisputed training champion. Within two years, the H100 made it economically inferior. The H200 followed. The B200 is now state-of-the-art - on an 18-month cadence that shows no signs of slowing.

For organizations that purchased AI compute on a CapEx basis - building private data centers, buying physical racks - this pace represents genuine obsolescence risk. Not because the hardware stops working, but because training next-generation models demands architectures it cannot support.

Power infrastructure compounds this. As model training pushes toward gigawatt-scale consumption, data centers built for earlier power densities face retrofit costs that rival the original investment.

Where the EV Analogy Breaks Down

Unlike an EV battery pack - a physical assembly that cannot be firmware-updated into a different chemistry - AI infrastructure is fundamentally software-defined. The same GPU cluster that trained a model in 2023 can run inference workloads in 2026. The compute doesn't expire; it gets redeployed.

More importantly, the dominant model of AI infrastructure consumption is cloud-based OpEx, not on-premises CapEx. When AWS or Azure absorbs the hardware refresh cycle, the enterprise customer is insulated from the churn. What was existential for the gas station owner is a rounding error for the driver paying at the pump.

The Hidden Obsolescence: Economic, Not Physical

The subtler and more dangerous form of obsolescence is economic, not physical. Older chips don't stop functioning - they become uncompetitive. H100 clusters get repriced into inference workloads where their cost-per-token still makes sense.

But for organizations sized around CapEx models - sovereign AI initiatives, startups that bought rather than rented - the math is unforgiving. The next model generation requires 5–10x the compute of the previous one. If your infrastructure can't scale, you're running yesterday's AI at today's prices.

This is the gas-station-in-2015 problem. The station wasn't broken. The economics were broken.

Who Is Actually Exposed?

The exposure is concentrated. Cloud-native organizations consuming AI as a service carry minimal direct risk - their providers absorb the generational churn. The exposed parties are those who made large, fixed bets on specific hardware: governments building sovereign AI capacity, enterprises running private AI clouds, and the hyperscalers themselves.

The hyperscalers aren't naive about this. Microsoft, Google, and Amazon structure hardware procurement as a rolling refresh - never fully committed to one generation, always hedging into the next.

The Correct Mental Model

The most accurate frame isn't EV batteries - it's semiconductor fabs. A fab built for 28nm didn't break when 7nm arrived. It became economically inferior for some workloads and superior for others. The asset didn't disappear; it found its level in a tiered market.

AI infrastructure will follow the same pattern. H100 clusters become inference workhorses. A100s handle fine-tuning. The cutting edge advances; the previous generation reprices, not vanishes.

The risk is not obsolescence in the EV sense. The risk is being caught holding a CapEx position the market reprices faster than your depreciation schedule. That's a financial risk masquerading as a technology risk - and it requires sophisticated hedging, not panic and not complacency.

Podstack vs. Runpod vs. CoreWeave: Which Cloud GPU Platform Should You Choose in 2026?

Saurav Kumar — Tue, 05 May 2026 18:30:00 GMT

The cloud GPU market has matured rapidly, and "just rent an H100" is no longer the simple decision it sounds like. Pricing models, regional availability, data residency, deployment workflows, and target audiences all vary dramatically between providers - and the right choice depends as much on who you are as on what you're building.

Three platforms come up repeatedly in conversations about GPU infrastructure: Runpod, CoreWeave, and Podstack. They're often grouped together, but they're aimed at very different users. Runpod is the developer-first, globally available GPU cloud built for fast iteration. CoreWeave is the AI-native hyperscaler powering frontier labs like OpenAI and Mistral. Podstack is India's sovereign GPU cloud, purpose-built for teams that need INR billing, data residency inside Indian data centers, and DPDP compliance.

This guide breaks down how each platform compares across GPU performance, pricing, deployment experience, customizability, integrations, compliance, and community - so you can pick the one that actually fits your workload.

Platform Overview

Runpod launched in 2022 and has become the go-to choice for AI developers, researchers, and startups who want to spin up GPUs in seconds and only pay for what they use. It runs across 30+ global regions through a mix of Secure Cloud (professional data centers) and Community Cloud (vetted providers), supports more than 30 GPU SKUs from RTX 4090s up to B200s, and bills per minute. The whole platform is designed to remove infrastructure friction so you can focus on shipping models.

CoreWeave has been in the GPU game since 2017 and now positions itself as an "AI-native" hyperscaler. It's the infrastructure of choice for OpenAI, Mistral AI, and IBM's Granite models. CoreWeave runs a Kubernetes-native platform optimized for massive multi-node training clusters, with bare-metal infrastructure, InfiniBand networking, and managed lifecycle services. It's powerful - and clearly aimed at enterprises and frontier labs that have predictable, large-scale compute needs and procurement teams to match.

Podstack is the newest entrant of the three, founded in 2024 by ex-Oracle engineers (with backgrounds spanning IIT Kharagpur, IIM Lucknow, and IIIT Bengaluru, including a Docker Captain, CNCF Ambassador, and Google Developer Expert). Podstack positions itself explicitly as "the RunPod of India" - a sovereign GPU cloud running entirely inside Indian data centers, with INR billing, zero egress fees, ISO 27001 certification, and DPDP (Digital Personal Data Protection Act) compliance. It offers NVIDIA L40S, A100, and H100 GPUs from ₹92/hour with pay-per-second billing, plus a proprietary PodVirt platform for fractional GPU allocation.

In short: Runpod is built for global developer agility, CoreWeave for enterprise-scale AI infrastructure, and Podstack for teams that need compliance, data residency and currency-native billing.

GPU Selection and Performance

All three platforms run NVIDIA's serious AI hardware, but the depth of catalog and target SKUs differ.

Runpod offers the broadest spread - 30+ GPU types covering everything from RTX 4000 (16GB) for tiny inference jobs up through 4090s, L40/L40S, A6000, A100 (40GB and 80GB), H100, H200, and B200. That range matters because not every workload needs an H100. A Stable Diffusion fine-tune runs beautifully on a 4090 at a fraction of the price. A 7B model inference job is wasted on an H100. Runpod lets you right-size, and new GPUs roll out as soon as NVIDIA ships them.

CoreWeave focuses on data-center-grade GPUs: A40, A6000, A100, H100, H200, GB200, and B200, typically in multi-GPU SXM configurations with NVLink and InfiniBand interconnects. If you're training a 70B+ parameter model across 64+ GPUs and need every byte of bisection bandwidth, this is the kind of stack you want. Consumer GPUs like the 4090 generally aren't part of the catalog - CoreWeave isn't trying to serve hobbyists.

Podstack offers a focused but practical catalog: NVIDIA L40S (48GB), A100 (40/80GB), and H100. The L40S in particular is interesting for Indian teams - it's a strong inference and fine-tuning GPU at a much lower price point than an H100, and it's available with INR billing and no cross-border data transfer concerns. Podstack's PodVirt platform also enables fractional GPU allocation, so you can rent a slice of an A100 or H100 instead of paying for a whole card when you don't need it - useful for cost-sensitive experimentation.

For raw single-GPU performance, an A100 is an A100 regardless of provider. The differentiator is which GPUs are available, how quickly you can get one, and where the silicon physically sits.

Pricing and Cost Efficiency

Pricing is where the three platforms diverge most clearly, because they're priced for different customers.

Runpod publishes transparent on-demand rates and bills by the millisecond with no commitments. Recent published rates include H100 PRO around $1.90/hr, A100 80GB around $1.79/hr, L40S around $1.22/hr, RTX 4090 around $0.69/hr, and entry-level cards under $0.40/hr. There's no charge for ingress or egress, no minimum spend, and you can stop a pod the moment your job finishes. For bursty workloads — generating a few hundred images, fine-tuning a small model overnight, prototyping a new architecture — this is hard to beat.

CoreWeave's pricing is largely contract-driven for serious customers. They offer reserved capacity, multi-month commitments, and SLA-backed pricing that can be very competitive at scale (especially against AWS, GCP, and Azure for the same hardware). On-demand rates exist but the platform is really optimized for teams that know they'll burn millions of GPU-hours and want to lock in capacity. If you're doing a one-off weekend project, CoreWeave isn't structured to make that easy or cheap.

Podstack lists rates starting from ₹100/hour for L40S (roughly $1.10 USD at current exchange rates, depending on the GPU tier) with pay-per-second billing and zero egress fees. For Indian teams, the bigger story is currency and compliance: paying in INR removes FX volatility from your cost forecasting, GST is handled cleanly for Indian tax purposes, and there's no surprise bill from data leaving the country. Fractional GPU allocation through PodVirt also means you can run development workloads on a slice of an A100 for a fraction of the full-card cost.

The right comparison really depends on your situation. A US-based startup doing iterative ML work will almost always find Runpod cheapest in practice. A frontier lab doing a 6-month pre-training run will probably get the best per-hour rate from CoreWeave on a reserved contract. An Indian team building a domestic AI product where customer data can't leave the country will find Podstack uniquely positioned - the others can't legally serve that workload the same way.

Deployment Experience and Ease of Use

Runpod's deployment model is built around speed. You pick a GPU, pick a template (or bring your own container), and click deploy. FlashBoot technology gets pods cold-starting in seconds, and the Hub has pre-configured templates for PyTorch, TensorFlow, ComfyUI, vLLM, Stable Diffusion Web UI, and dozens of other common frameworks. JupyterLab, SSH, and persistent volumes are all available with no infrastructure work on your part. Someone with no Kubernetes experience can have a model running in under five minutes.

CoreWeave is Kubernetes-native by design. That's a feature if you're an infrastructure team that already speaks Kubernetes - you get fine-grained control over scheduling, networking, storage classes, and orchestration, and you can integrate cleanly with existing GitOps workflows. It's a steeper learning curve if you don't. There's no one-click "launch Stable Diffusion" button on CoreWeave; you're expected to bring your own manifests, Helm charts, or container definitions. The payoff is total control and production-grade orchestration.

Podstack sits closer to Runpod on the ease-of-use spectrum. It offers Pods, VMs, and serverless inference, plus a Python SDK and CLI for programmatic control. The platform is explicitly designed to feel familiar to developers coming from Runpod (the comparison is right in their marketing), with quick provisioning and pre-built environments. The fractional GPU allocation through PodVirt is the standout proprietary capability - useful when you want development workloads on a slice of an expensive card without spinning up a full instance.

For most individual developers and small teams, the experience hierarchy is: Runpod and Podstack are smooth and fast; CoreWeave is powerful but requires real DevOps muscle.

Compliance, Data Residency, and Sovereignty

This is the dimension that has changed the most in the last two years, and it's where Podstack carves out its clearest advantage.

Runpod operates Secure Cloud data centers across 31 global regions and is SOC 2 Type II compliant. It's a strong fit for international teams and offers HIPAA-aligned options. Data can be pinned to specific regions, but the platform is fundamentally a global network — appropriate for most international AI workloads but not specifically tailored to any single country's data sovereignty requirements.

CoreWeave operates a growing footprint of US and European data centers and offers enterprise-grade compliance (SOC 2, HIPAA, etc.). It's well-suited to large US and EU enterprises but doesn't currently market itself as a sovereign cloud for any specific non-Western jurisdiction.

Podstack runs entirely inside Indian data centers and is built around Indian regulatory requirements - DPDP Act compliance, ISO 27001 certification, GST-compliant INR invoicing, and zero cross-border data transfer for compute and storage. For Indian banks, healthcare companies, government-adjacent AI projects, or any team working with data that must remain within Indian borders, this isn't a nice-to-have - it's the whole reason to pick a provider. The IndiaAI mission and the broader push toward sovereign compute make this category increasingly important.

If your data has no jurisdictional constraints, Runpod or CoreWeave will serve you well. If you're building for the Indian market and need data residency, Podstack is in a category of one among the three. With Podstack plans to create soverign cloud in Dubai and South Asian countries as a part of expansion further provides soverign cloud to those regions.

Customizability and Integration

Runpod ships REST APIs, SDKs, and serverless inference endpoints designed specifically for AI workflows. You can launch pods programmatically, deploy models as auto-scaling endpoints, integrate with Hugging Face for model pulls, hook into Weights & Biases for experiment tracking, and build full applications around its API surface without ever touching Kubernetes. The platform is opinionated toward AI-specific workflows - if you want to deploy a Stable Diffusion endpoint and call it via HTTP from your app, that's a documented path with templates.

CoreWeave gives you infrastructure primitives. APIs, Terraform providers, Kubernetes operators, and bare-metal access let you build whatever you want — including production-grade multi-region inference platforms with custom networking and storage. The trade-off is that you're building it. There's no managed "Stable Diffusion API" service; you're constructing those higher-level abstractions yourself on top of CoreWeave's compute and orchestration.

Podstack provides a Python SDK, CLI, and REST APIs for managing pods, VMs, and serverless inference. It's designed to be programmatically accessible from the start, with templates for ComfyUI, Unsloth fine-tuning, vLLM, and other common AI tooling. The serverless inference offering pairs nicely with the L40S for cost-effective production endpoints. For Indian developers integrating AI into domestic products, the combination of INR billing, local-currency invoicing, and AI-native APIs reduces friction substantially.

Community and Support

Runpod has cultivated an active, public community - a large Discord server where users swap tips and templates, public tutorials and blog posts, and 24/7 support across all tiers. The community feel is one of the platform's quieter strengths; if you hit a weird issue with a specific model, someone has probably already posted about it.

CoreWeave's support model is enterprise-style: dedicated account contacts for major customers, ticket-based support, and deep technical engineering relationships for the labs they serve. There's no widely-known public community forum, because their customer base mostly doesn't need one. Documentation is solid; community discussion is mostly absent.

Podstack is younger and its community is still forming. The founding team's public profile (Docker Captain, CNCF Ambassador, Google Developer Expert) brings credibility and visibility in the cloud-native and Indian developer communities, and the company is active in the local AI ecosystem. For Indian developers, having founders who speak the local market and respond directly to feedback is genuinely useful.

Which Platform Fits Which Workload

A quick decision guide:

If you're an individual developer, researcher, hobbyist, or a startup anywhere in the world that wants the broadest GPU selection, the fastest deployment, the most transparent pay-as-you-go pricing, and an active community to lean on, Runpod is the natural choice. It scales from your first weekend project up through production inference at thousands of requests per second, and you don't need a procurement department to use it.

If you're a frontier AI lab, a large enterprise running sustained training workloads, or any team that needs Kubernetes-native orchestration, multi-thousand-GPU clusters, InfiniBand networking, and SLA-backed contract pricing - and you have the infrastructure team to operate at that level - CoreWeave is built for you.

If you're an Indian AI team, an enterprise that must keep data inside Indian borders, a regulated industry working under DPDP requirements, or a startup that wants INR billing and zero FX risk, Podstack is the only one of the three actually designed for your constraints. The L40S availability, fractional GPU allocation, and ISO 27001 + DPDP compliance package addresses real, jurisdictionally-specific problems that the global platforms can't solve as cleanly.

These three aren't really competing for the same customer - they're each strongest where the others are weakest. Runpod owns global developer agility. CoreWeave owns enterprise scale. Podstack owns Indian sovereignty. Pick the one whose center of gravity matches yours.

FAQ

Q: Can I run Stable Diffusion or fine-tune an LLM on all three platforms? A: Yes. All three offer GPUs with sufficient VRAM (24GB+) for image generation and small-to-medium LLM fine-tuning. Runpod has the most pre-built templates for these workflows, Podstack has L40S options that are particularly good for inference and fine-tuning at lower cost, and CoreWeave will handle them at scale though with more setup work.

Q: Which platform is cheapest for a one-off project? A: Runpod's per-second billing and broad GPU range typically make it cheapest for short, intermittent jobs globally. For Indian teams, Podstack's INR billing and zero egress can come out lower in practice once FX and bandwidth costs are included. CoreWeave is generally not optimized for one-off projects.

Q: Do I need Kubernetes knowledge? A: For Runpod, no - the platform abstracts containers behind a simple UI and templates. For Podstack, no - the SDK and CLI are designed for direct developer use. For CoreWeave, yes - Kubernetes fluency is essentially required to use the platform effectively.

Q: What if my data has to stay in India? A: Podstack is the only one of these three that runs entirely inside Indian data centers with DPDP compliance, INR billing, and zero cross-border data transfer. Runpod and CoreWeave both have global footprints but aren't structured around Indian data sovereignty requirements.

Q: Which is best for production inference at scale? A: All three can do it. Runpod's serverless inference with auto-scaling endpoints is the fastest path for most teams. CoreWeave is best for very large-scale production deployments where you need fine-grained orchestration. Podstack's serverless inference with L40S GPUs is excellent for production inference targeting Indian users with low latency.

Q: What about fractional GPUs? A: Podstack's PodVirt platform offers fractional GPU allocation as a core feature, useful for cost-controlled development. Runpod offers smaller GPUs and community cloud options that achieve similar cost-efficiency. CoreWeave is generally focused on full-GPU and multi-GPU configurations rather than fractional allocation.

Why Per-Second GPU Billing Saves Indian Startups 40–60% on Inference (With Real Math)

Saurav Kumar — Mon, 04 May 2026 09:51:00 GMT

Every Indian AI founder I've spoken to in the last six months has the same complaint: "Our cloud bill is killing us, and most of it isn't even the work we're doing - it's the idle time we're paying for."

They're right. And the math, once you actually run it, is brutal.

If you're running inference on hourly-billed GPUs, you are very likely paying 2x to 3x more than you need to. Not because you picked the wrong GPU. Not because your model is inefficient. But because the billing model itself is wrong for your workload.

This post shows you exactly where the money leaks, with four real workload patterns and rupee-level math. By the end, you'll know whether per-second billing actually applies to your stack - or whether hourly is fine.

No hand-waving. No "up to 90% savings!" marketing. Just numbers.

The Core Problem: GPU Inference Is Bursty, But Hourly Billing Isn't

Training is a long, predictable workload. You spin up a GPU, run for 6–48 hours, spin it down. Hourly billing works fine here because your utilization is close to 100%.

Inference is the opposite. Inference traffic is spiky, unpredictable, and full of dead air:

A user sends a prompt. Your model runs for 3 seconds. Then nothing for 47 seconds.
A scheduled batch job runs for 8 minutes at 2 AM. The GPU sits idle the other 23 hours and 52 minutes.
A RAG pipeline embeds a document in 12 seconds, then waits for the next query.

In every one of these cases, hourly billing forces you to pay for the slowest unit of time the cloud will charge you: a full hour. If your job runs for 90 seconds, you pay for 3,600 seconds. That's a 40x markup on the actual compute you used.

Per-second billing fixes this by charging you for the seconds you actually used. The savings aren't theoretical — they show up the moment you switch.

Let's prove it with four workloads Indian startups actually run.

Workload 1: Bursty Inference API (the most common pattern)

Scenario: You're a SaaS startup running a Llama 3 8B inference endpoint for B2B customers. Traffic is moderate - about 200 requests per hour during Indian business hours, near-zero overnight. Average inference time: 4 seconds per request.

Compute footprint per day:

200 requests/hour × 10 active business hours = 2,000 requests
2,000 requests × 4 seconds = 8,000 seconds of actual GPU work
That's 2 hours and 13 minutes of real compute per day

Hourly billing reality on an A100 80GB at ₹189/hour:

You can't actually only pay for 2.2 hours, because traffic is spread across 10 hours.
To serve traffic across the day, you keep the GPU running for at least 10 hours.
Cost: 10 × ₹189 = ₹1,890/day = ~₹56,700/month
Actual GPU utilization: ~22%
You're paying for 78% idle time.

Per-second billing on the same A100 80GB at the same effective rate (₹189/hour ÷ 3,600 = ₹0.0525/second):

8,000 seconds × ₹0.0525 = ₹420/day = ~₹12,600/month
Cost reduction: ~78%

Even if you assume per-second pricing is slightly higher per second to offset the flexibility (say, 20% premium), you still land at around ₹15,120/month - roughly 73% lower than hourly.

This is the single biggest win, and it's why bursty inference is the canonical per-second billing use case.

Workload 2: RAG Pipeline with Embedding + Generation

Scenario: You're building a document-Q&A product. A user uploads a PDF, you embed it, store the vectors, and answer questions. Embedding takes 8–15 seconds. Generation takes 3–6 seconds per question. Average user: 1 upload + 4 questions per session. About 60 sessions per day. Compute footprint per day:

Embedding: 60 × 12 sec = 720 seconds
Generation: 60 × 4 questions × 4 sec = 960 seconds
Total: 1,680 seconds = 28 minutes of actual GPU work
But sessions are spread across ~12 hours of the day Hourly billing on an L40S at ~₹120/hour:
GPU runs for 12 hours to cover the spread = ₹1,440/day = ₹43,200/month
Actual utilization: 28 min / 720 min = ~3.9%
You're paying for 96% idle time

Per-second billing at the same effective rate (₹0.033/second):

1,680 seconds × ₹0.033 = ₹55/day = ₹1,650/month
Cost reduction: ~96%

This sounds insane, but the math is right - and this is exactly why RAG-heavy startups burn through pre-seed funding faster than they expect on hourly clouds. RAG is extremely spiky, and most founders don't see the bill problem until month three.

In practice, savings will land closer to 85–90% once you factor in cold starts, model loading, and minor per-second pricing premiums. Still life-changing.

Workload 3: Scheduled Batch Jobs (overnight processing)

Scenario: You run a content moderation service. Every night at 2 AM, you batch-process the day's flagged content through a vision model. The job takes 18 minutes on an A100.

Hourly billing:

A100 at ₹189/hour, billed for full hour minimum = ₹189/day = ₹5,670/month
Actual utilization for that "hour": 18/60 = 30%

Per-second billing:

18 minutes × 60 = 1,080 seconds × ₹0.0525 = ₹56.7/day = ₹1,701/month
Cost reduction: ~70%

Batch jobs are the quiet, unsexy win. They look small per day, but they compound across a year. ₹5,670 vs ₹1,701 per month is ₹47,628 in annual savings - for one job. Most ML teams have 5–10 of these.

Workload 4: Serverless API with Cold Starts

Scenario: You expose a Stable Diffusion XL endpoint that runs about 80 requests/day, unpredictably distributed. Each request takes ~7 seconds of actual GPU time, but the model takes ~12 seconds to cold-start when the GPU goes idle.

Hourly billing reality:

To avoid cold starts on every request, you keep the GPU warm for 16 hours/day
16 × ₹120 (L40S) = ₹1,920/day = ₹57,600/month
Actual GPU work: 80 × 7 sec = 560 sec/day = ~9 minutes
Utilization: 0.9%. Yes, less than one percent.

Per-second billing with smart warm-pool management:

Pay only for the 9 minutes of actual work + occasional cold-start overhead
Realistic monthly cost: ₹3,500–5,500/month depending on how aggressive your scale-to-zero is
Cost reduction: ~90–94%

This is the workload pattern where per-second billing isn't just cheaper - it's the only economically viable option. Running serverless-style APIs on hourly billing is roughly equivalent to keeping a taxi parked outside your house all day in case you need a 5-minute ride

The Honest Caveats

A few things this post is NOT claiming:

1. Per-second billing always wins. It doesn't. For long-running training jobs (6+ hours of continuous GPU usage), per-second offers no real advantage - you're going to use the full hour anyway, so per-hour or committed-use pricing is often cheaper.

2. The savings are pure profit. They're not. If your workload is genuinely bursty, per-second billing reveals that you've been over-provisioning, and the savings are real. But you'll also need to invest in scale-to-zero logic, warm-pool management, and cold-start optimization to capture them fully.

3. Cold starts are free. They're not. A model that takes 30 seconds to load costs you 30 seconds of billing every time it cold-starts. If your traffic is just spiky enough to trigger constant cold starts but just dense enough to need warmth, you can actually end up worse off. The fix is provider-level cold-start optimization (FlashBoot, model caching, etc.) - which good per-second clouds offer.

4. Per-second pricing sometimes carries a small premium per second. A few providers charge 10–20% more per second of compute than the equivalent per-hour rate, on the theory that flexibility has a price. Even with this premium, the workloads above still come out massively ahead - but you should run the math on your specific traffic pattern.

When Per-Second Billing Wins (And When It Doesn't)

A simple decision rule:

Use per-second billing if:

Your GPU utilization is below 60%
Your workload is bursty, request-driven, or scheduled
You're running RAG, real-time inference APIs, or batch jobs under an hour
You're early-stage and your traffic is unpredictable

Stick with per-hour or committed pricing if:

You're training a model for 8+ continuous hours
Your inference traffic is dense enough to keep utilization above ~70%
You can commit to 1-3 months of usage and want maximum discount

For 80%+ of Indian AI startups we saw - RAG-based, inference-heavy, traffic-spiky, pre-PMF - per-second is the rational choice. For the other 20%, hourly committed pricing wins.

How PodStack Approaches This

PodStack bills per-second by default. There's no "you must run for at least 1 hour" minimum, no rounding up, no surprises on the invoice. You pay for the seconds the GPU actually ran your code.

We pair this with two things that make per-second billing actually work in practice:

Fractional GPU allocation. If your inference job only needs 25% of an A100, you can rent 25%. Combined with per-second billing, this means small workloads pay genuinely small bills - not "small fraction of a big bill."

INR-denominated, no-egress pricing. Per-second savings get destroyed if your provider charges 9¢/GB egress. PodStack bills entirely in INR with zero egress, so the savings you see in our math actually land in your bank account.

The result: most Indian startups moving from AWS/Azure hourly inference to PodStack per-second see 40–60% reductions on inference workloads, with RAG and serverless workloads sometimes hitting 70–90%. These aren't marketing numbers - they're what falls out of the math when bursty workloads stop paying for idle time.

Run Your Own Numbers

Before you switch anything, do this exercise tonight:

Pull last month's GPU bill.
Open your monitoring dashboard.
Calculate: (Total GPU hours billed) vs (Total GPU hours actually computing).
The ratio is your utilization.

If it's below 60%, you're a per-second billing candidate, full stop. If it's below 30%, you're losing money every single day you stay on hourly.

The Indian AI infrastructure market is finally giving you the tools to fix this. Use them.