An LLM enthusiast shared their solution for monitoring a home LLM server, focusing on performance visibility and crash diagnostics.

System Architecture

The architecture is based on Docker containers, including:

  • Grafana: for data visualization.
  • Prometheus: for metrics collection.
  • dcgm-exporter: for exposing NVIDIA's DCGM (Data Center GPU Manager) metrics.
  • llama-server: the LLM server.
  • go-tapo-exporter: for power consumption monitoring.
  • A custom Docker image: for exposing model load states and scraping statistics from nvidia-smi processes.

Dashboard Functionality

The Grafana dashboard provides a comprehensive overview of the LLM server's performance, with the following metrics:

  • Prompt and token processing rates.
  • GPU utilization and memory paging.
  • Power consumption.
  • VRAM and RAM usage per compute process.
  • Network and disk throughput.

Furthermore, the dashboard allows direct loading and unloading of LLM models via an interactive graphical interface.