Capabilities & Limitations
GoServe is designed to be a high-performance, developer-friendly alternative to Python-based inference servers. Below is a detailed breakdown of what is currently supported and what is on our roadmap.
Current Capabilities
🚀 High-Performance Core
- Native Inference: Leverages the ONNX Runtime C library directly via CGO, bypassing Python's Global Interpreter Lock (GIL).
- Concurrency: Uses Go's native goroutines to handle multiple concurrent inference requests with minimal overhead.
- Resource Efficiency: Extremely low memory footprint (~50MB idle) and sub-second cold starts.
🧠Generic Model Engine
- Automated Introspection: No need to manually specify input/output node names. GoServe discovers them automatically upon model loading.
- Dynamic Input Support: Supports multi-dimensional tensors of any rank (e.g., 1D classification features, 4D image tensors).
- Multi-Model Registry: Load and serve multiple different models simultaneously from a single GoServe instance.
📊 Production Readiness
- Observability: Built-in Prometheus metrics for tracking request volume, latency, and inference performance.
- Resilience: Automatic panic recovery ensures the server stays up even if an unexpected error occurs during inference.
- Dockerized: Multi-stage Docker build using Google Distroless for a secure, minimal production environment.
Supported Data Types
Currently, GoServe supports the following ONNX tensor types: - FLOAT32: Standard for most neural networks and regression models. - INT64: Common for classification labels and embedding IDs.
Current Limitations (WIP)
While GoServe is ready for many production use cases, it has the following limitations:
1. Data Type Support
- Missing Types: Support for
DOUBLE,INT32,STRING, andBOOLtensors is currently being implemented. - Type Conversion: GoServe expects the JSON input to match the model's required precision (though it handles standard JSON-to-Float64 conversion automatically).
2. REST/JSON Overhead
- Serialization: For very large inputs (like 4K images or massive batches), the time spent parsing JSON strings can exceed the time spent on actual inference.
- Recommendation: Use batches of moderate size for optimal performance.
3. Hardware Support
- CPU Only: Current builds are optimized for CPU inference (x86_64).
- GPU: CUDA/TensorRT execution providers are not yet enabled in the default build.
4. Advanced ONNX Features
- Custom Ops: Models requiring custom C++ operators not included in the standard ONNX Runtime distribution are not yet supported.
Roadmap
Check our GitHub Issues for progress on: - [ ] gRPC Support: To eliminate JSON overhead. - [ ] GPU/CUDA Builds: For deep learning acceleration. - [ ] Full Type Coverage: Supporting all 16+ ONNX data types. - [ ] Zero-Copy Inference: Direct memory access for ultra-low latency.