Capabilities & Limitations

GoServe is designed to be a high-performance, developer-friendly alternative to Python-based inference servers. Below is a detailed breakdown of what is currently supported and what is on our roadmap.

Current Capabilities

🚀 High-Performance Core

Native Inference: Leverages the ONNX Runtime C library directly via CGO, bypassing Python's Global Interpreter Lock (GIL).
Concurrency: Uses Go's native goroutines to handle multiple concurrent inference requests with minimal overhead.
Resource Efficiency: Extremely low memory footprint (~50MB idle) and sub-second cold starts.

🧠 Generic Model Engine

Automated Introspection: No need to manually specify input/output node names. GoServe discovers them automatically upon model loading.
Dynamic Input Support: Supports multi-dimensional tensors of any rank (e.g., 1D classification features, 4D image tensors).
Multi-Model Registry: Load and serve multiple different models simultaneously from a single GoServe instance.

📊 Production Readiness

Observability: Built-in Prometheus metrics for tracking request volume, latency, and inference performance.
Resilience: Automatic panic recovery ensures the server stays up even if an unexpected error occurs during inference.
Dockerized: Multi-stage Docker build using Google Distroless for a secure, minimal production environment.

Supported Data Types

Currently, GoServe supports the following ONNX tensor types: - FLOAT32: Standard for most neural networks and regression models. - INT64: Common for classification labels and embedding IDs.

Current Limitations (WIP)

While GoServe is ready for many production use cases, it has the following limitations:

1. Data Type Support

Missing Types: Support for DOUBLE, INT32, STRING, and BOOL tensors is currently being implemented.
Type Conversion: GoServe expects the JSON input to match the model's required precision (though it handles standard JSON-to-Float64 conversion automatically).

2. REST/JSON Overhead

Serialization: For very large inputs (like 4K images or massive batches), the time spent parsing JSON strings can exceed the time spent on actual inference.
Recommendation: Use batches of moderate size for optimal performance.

3. Hardware Support

CPU Only: Current builds are optimized for CPU inference (x86_64).
GPU: CUDA/TensorRT execution providers are not yet enabled in the default build.

4. Advanced ONNX Features

Custom Ops: Models requiring custom C++ operators not included in the standard ONNX Runtime distribution are not yet supported.

Roadmap

Check our GitHub Issues for progress on: - [ ] gRPC Support: To eliminate JSON overhead. - [ ] GPU/CUDA Builds: For deep learning acceleration. - [ ] Full Type Coverage: Supporting all 16+ ONNX data types. - [ ] Zero-Copy Inference: Direct memory access for ultra-low latency.