ZStack Cloud Platform
Single Server Deployment with Full Features, Free for One Year
Modern AI programs have moved from exploratory pilots to production pipelines with strict SLAs and cost targets. As clusters scale, two pains show up repeatedly: stranded capacity and unpredictable queue time. When fractional GPUs are scattered across nodes, multi-GPU training jobs cannot fit; when placement ignores interconnect topology, expensive accelerators stall on I/O. Intelligent GPU scheduling solves both by aligning job shapes with the right bins, honoring topology, and feeding back runtime signals into placement and autoscaling. The payoff is higher throughput, steadier iteration cycles, and a more predictable cost per experiment.
Keep the target simple and measurable: Streaming Multiprocessor utilization on active GPUs, median and 95th-percentile queue time under peak load, and cost per successful run. Tie these to SLOs and review them weekly; treat regressions as incidents.
Fragmentation arises when fractional slices are spread thinly; “almost fits” becomes “cannot start.” Memory bandwidth and interconnect mismatches silently throttle training. Cold artifact fetch and dataset stalls waste precious GPU minutes. Intelligent schedulers and warmed caches close these gaps.
Topology-aware placement and explicit queueing policies push more experiments through the same hardware. Developers ship faster, finance sees steadier spend, and platform teams can justify capacity with hard numbers.
An effective AI Infrastructure Stack layers cleanly and treats scheduling as a contract between the control plane and the data plane. The key is to abstract heterogeneity without hiding the signals that matter to placement.
At the base: accelerators, hosts, and interconnects. Above them: virtualization for safe tenancy, then containers for packaging jobs. Data and artifact services feed training and inference. On top: serving, observability, and governance. Each layer should surface just enough metadata—GPU model and memory, NVLink island, dataset locality, policy tags—so the scheduler can make high-fidelity decisions.
The control plane owns admission, priority, quotas, and policy; the data plane executes, reports health, and exports fine-grained metrics. Avoid hidden node-local scripts that silently override decisions made upstream—make scheduling behavior explicit in queues and policies.
Full-device allocation remains best for large training that saturates memory bandwidth. Fractional modes—MIG-like hard partitioning or MPS-style context sharing—shine for inference and small finetunes. Time-slicing works for bursty evaluation tasks. In ZStack Cloud, administrators can present GPUs as physical devices (pGPU) or virtual devices (vGPU) and switch sharing scopes per device, which maps naturally to class-based offerings for different tenants and workloads. On the GPU Device page, you’ll see pGPU and vGPU lists, status, attachments, and per-device details such as utilization, memory, temperature, and power—useful signals for intelligent placement.
Define a small set of GPU classes (for example, A100-full, A100-MIG-1g, L4-fractional) with clear QoS and quotas. Projects request classes, not raw device IDs, simplifying fairness and showback. Where data isolation is paramount, keep tenants on separate VM or node boundaries and use shared modes only within those guardrails.
Kubernetes is a strong control fabric when its primitives are tuned for AI.
Jobs declare GPU, CPU, memory, and ephemeral storage requests; affinities and taints steer pods to compatible nodes; topology spread constraints balance hotspots. Admission controllers can auto-inject defaults so users focus on model logic rather than YAML minutiae.
Not all links are equal. A placement that keeps tensor-parallel ranks on the same NVLink island and aligns CPU pages to the right NUMA node frequently yields double-digit throughput gains. Surface NVLink/PCIe groupings via node labels and let the scheduler bias toward co-located ranks.
Distributed training fails if only part of a job starts. Gang scheduling holds resources until all ranks can launch. Backfilling opportunistically runs smaller jobs while a gang waits. Priority and preemption ensure release-blocking retrains don’t starve, provided you checkpoint long-running jobs.
Data parallelism favors large batches; tensor and pipeline parallelism split models that don’t fit in memory; expert (MoE) architectures benefit from shard-aware placement. Provide templates for each so users select strategies rather than assembling them from scratch.
Inference needs low latency and stability; training tolerates queues but needs high utilization. Separate queues and policies accordingly, protecting each from the other’s traffic patterns.
Capacity is a living plan. Autoscaling must anticipate, not merely react; reliability should minimize disruption without hoarding spare GPUs.
Design node groups as bins that mirror job profiles: single-GPU for online inference, 4/8-GPU bins for training gangs, fractional pools for finetunes. Autoscalers that add the right shapes reduce fragmentation and improve “first-try” starts.
Scale stateless inference horizontally; use VPA for steady pre- and post-processing. Preserve a small framebuffer margin on GPU workloads to avoid OOM flaps during micro-bursts.
Queue depth, job arrival rates, and average runtime enable predictive scaling that brings capacity online just before the high-water mark. Pair with cool-down timers and batch windows that encourage bin-packing.
Use Pod Disruption Budgets for multi-rank jobs, cordon/drain workflows for GPU hosts, and device health probes so flaky accelerators are quarantined quickly. With checkpointing, preemptions become bumps rather than outages.
GPUs are fast; keep them fed.
Stage hot datasets near the accelerators. Shard large datasets and prefetch asynchronously. Use content-addressable artifact stores so caches remain reliable across namespaces and teams.
When collectives dominate, low-latency fabrics matter as much as GPU count. Combine locality-aware placement with tolerations that keep sensitive jobs off slow links. Validate with microbenchmarks across the message sizes you actually use.
Slim images, warmed registry mirrors, and lazy weight loading keep cold-start penalties down. For canary inference, run small warm pools that accept traffic instantly while the rest scale behind.
ZStack focuses on making private and hybrid clouds practical for teams that want performance, data control, and cost predictability on their own terms. In GPU estates, a virtualization-first foundation lets you expose accelerators cleanly to container orchestration while retaining isolation at the VM boundary.
ZStack describes an AI scheduling layer that supports bare metal, VMs, and containers, with fine-grained GPU slicing and binpack/spread strategies—designed to unify heterogeneous accelerators under one control plane and drive higher utilization. It highlights real-time resource monitoring and dynamic scheduling across clusters to reduce cost and improve resiliency.
ZStack Cloud separates pGPU and vGPU in its GPU Device view and exposes per-device metrics—utilization, memory, temperature, power—to drive intelligent GPU scheduling across your AI Infrastructure Stack. Operators define class-based GPU offerings, scope sharing globally or by project, and select passthrough for bandwidth-hungry training or virtualize into multiple vGPUs for dense, multi-tenant inference. With ZSphere as the virtualization base and Zaku handling Kubernetes orchestration and autoscaling, topology-aware placement and QoS policies translate into higher utilization and shorter queue times.
ZStack Zaku is a Kubernetes platform that exposes pGPU and vGPU as classes, unifies multicluster operations, and cuts cold starts.
Many teams start with VM-centric isolation for data-sensitive training, then phase in containers for elasticity. ZStack supports both paths: it documents virtualization of NVIDIA and AMD GPUs into vGPU pools and recommends driver versions so monitoring and vGPU features work properly, and it exposes mediated-device (MDEV) specifications via CLI—handy when building fractional offerings that align to job shapes. Real deployments also show GPU passthrough powering latency-sensitive video analytics, reinforced by backup and DR modules for long-running inference services.
A: It’s the combination of accelerators and hosts, virtualization and containers, data services, serving, and governance assembled into one platform for training and inference. GPU Scheduling improves it by matching job shapes to the right bins, honoring interconnect topology, and feeding monitoring data into placement. On ZStack, operators can view pGPU/vGPU utilization, memory, temperature, and power—signals that help policies reduce fragmentation and idle time.
A: The autoscaler turns queue pressure into the right capacity at the right time. If your node groups are shaped for common patterns—single-GPU bins for inference, 4/8-GPU bins for training, fractional pools for finetunes—new nodes arrive ready to pack jobs tightly, trimming queue time and raising effective utilization.
A: GPU Sharing exposes part of a device to multiple jobs. Hard partitions (MIG-like) and MPS contexts yield high density and steady QoS for inference and small finetunes. Time-slicing works for bursty evaluations. Memory-hungry training prefers full devices for bandwidth and determinism. In ZStack, admins can virtualize pGPUs into multiple vGPUs and set sharing modes per device for clean multi-tenant boundaries.
A: Data parallelism is lenient on topology but wants steady storage throughput; tensor and pipeline parallelism prefer GPUs on the same NVLink island; expert parallelism benefits from shard-aware placement. Encode these hints as labels and anti-affinity rules, and use gang scheduling so all ranks start together.
A: Watch SM utilization, queue time (median and P95), and cost per GPU-hour per completed run. Add GPU monitoring metrics like memory utilization, power, and PCIe I/O where available; these confirm whether jobs are compute-bound or I/O-bound and guide bin-shape and topology decisions.