Pre-Deployment Checklists
Scenario-specific verification checklists for LLM on-premise deployments. Not recommendations—verification gates.
These checklists are verification gates, not deployment guides. They identify what must be confirmed before going live in each scenario.
Usage: Review all items in your scenario's checklist. Unresolved items = deployment risk. Document exceptions explicitly.
Context: Factory floor, production line, or plant deployment. Focus on uptime, determinism, edge constraints, and audit requirements.
HARDWARE & INFRASTRUCTURE
- Hardware validated for industrial environment (temperature, dust, vibration)
- Power supply redundancy confirmed (UPS, backup generator if critical)
- Network isolation verified (air-gap or VLAN segmentation documented)
- GPU/CPU cooling adequate for 24/7 operation
- Physical access controls in place (locked cabinets, badge access)
- Disaster recovery hardware identified (spare parts, replacement timeline)
- Model size fits available VRAM with 20% headroom
- Storage capacity sufficient for logs + model versions (6-month retention minimum)
SOFTWARE & CONFIGURATION
- Model version pinned and hash-verified (no auto-updates)
- Inference reproducibility tested (same input → same output)
- Input sanitization implemented (prompt injection defense)
- Output filtering configured (PII, sensitive data patterns)
- Rate limiting configured per user/session
- Error handling documented (model unavailable, OOM, timeout scenarios)
- Rollback procedure tested (revert to previous model version)
- Dependencies frozen (no floating package versions)
OPERATIONS & MONITORING
- Request/response logging enabled (audit trail)
- Monitoring alerts configured (GPU temp, memory, disk space, inference latency)
- Log rotation configured (prevent disk fill)
- Backup schedule established (configuration, model weights, logs)
- On-call escalation path documented
- Maintenance window defined (for updates/patches)
- Uptime SLA documented (99.9% = 8.7 hours/year downtime budget)
- Performance baseline established (latency percentiles, throughput)
COMPLIANCE & DOCUMENTATION
- Change control process documented (who can update, approval required)
- Incident response plan documented (model failure, security breach)
- User training completed (operators, maintenance staff)
- System documentation complete (architecture diagram, runbook, troubleshooting guide)
- Data retention policy defined (input logs, output logs, model versions)
- Access control matrix documented (who has admin, read-only, etc.)
- Vendor contact information documented (hardware, software, support contracts)
Context: FDA 21 CFR Part 11 or equivalent GxP environment. Focus on validation, audit trails, data integrity, and regulatory compliance.
VALIDATION & QUALIFICATION
- Validation Master Plan (VMP) approved
- User Requirements Specification (URS) documented and approved
- Design Qualification (DQ) completed
- Installation Qualification (IQ) completed and documented
- Operational Qualification (OQ) completed and documented
- Performance Qualification (PQ) completed and documented
- Validation report generated and approved by QA
- Re-validation schedule established (model updates, infrastructure changes)
- Traceability matrix: requirements → test cases → results
DATA INTEGRITY & AUDIT (21 CFR Part 11)
- Electronic signatures implemented (user authentication for critical actions)
- Audit trail enabled (who, what, when, why for all changes)
- Audit trail immutability verified (tamper-proof logging)
- Time synchronization configured (NTP, time zone documented)
- Data backup validated (restore test completed successfully)
- Access control validated (role-based permissions tested)
- Record retention policy documented (input, output, model versions, audit logs)
- Data archival process validated (long-term storage, retrieval tested)
HARDWARE QUALIFICATION
- ECC RAM confirmed (error-correcting memory for data integrity)
- Hardware bill of materials (BOM) documented
- Hardware qualification certificates obtained (if applicable)
- Environmental conditions validated (temperature, humidity logs)
- Network isolation documented and tested (air-gap validation)
- Hardware redundancy qualified (failover tested)
CHANGE CONTROL & DOCUMENTATION
- Change Control procedure documented and approved
- Change impact assessment template defined
- QA review and approval workflow established
- Deviation management process documented
- CAPA (Corrective and Preventive Action) process defined
- Training records maintained (all operators qualified)
- Standard Operating Procedures (SOPs) written and approved
- Periodic review schedule established (annual system review)
SECURITY & COMPLIANCE
- Data privacy assessment completed (GDPR/HIPAA if applicable)
- Vendor audit completed (third-party software/hardware suppliers)
- Cybersecurity assessment completed (penetration test, vulnerability scan)
- Disaster recovery tested (full system restore from backup)
- Business continuity plan documented
- Compliance audit readiness confirmed (mock audit passed)
Context: Corporate IT deployment for general knowledge management, support, or productivity. Focus on security, scalability, integration, and operational excellence.
SECURITY & ACCESS CONTROL
- SSO integration completed (SAML, OIDC, or AD/LDAP)
- Multi-factor authentication (MFA) enforced for admin access
- Role-based access control (RBAC) implemented and tested
- Data residency requirements confirmed (geography, jurisdiction)
- Encryption at rest enabled (model weights, logs, user data)
- Encryption in transit enabled (TLS 1.3 minimum)
- Security scan passed (vulnerability assessment, no critical CVEs)
- Penetration test completed (third-party or internal red team)
- DLP (Data Loss Prevention) integration confirmed if required
INTEGRATION & INTEROPERABILITY
- API endpoints documented (OpenAPI/Swagger spec published)
- API authentication tested (tokens, keys, OAuth flows)
- Integration with existing tools validated (Slack, Teams, ServiceNow, etc.)
- Webhook delivery tested (event notifications working)
- API rate limits documented and enforced
- Error responses standardized (consistent error codes, messages)
- Versioning strategy defined (backward compatibility policy)
SCALABILITY & PERFORMANCE
- Load testing completed (peak concurrent users validated)
- Autoscaling configured (horizontal or vertical, if applicable)
- Load balancer configured and tested (if multi-instance)
- Caching strategy implemented (reduce redundant inference)
- Queue management configured (prevent overload, backpressure)
- Performance SLA defined (latency targets: p50, p95, p99)
- Capacity planning documented (growth projections, hardware runway)
OPERATIONS & OBSERVABILITY
- Monitoring dashboards deployed (Grafana, Datadog, or equivalent)
- Alerting configured (PagerDuty, Slack, email for critical events)
- Logging centralized (ELK, Splunk, or equivalent)
- Distributed tracing enabled (OpenTelemetry, Jaeger if microservices)
- Runbook documented (common issues, troubleshooting steps)
- On-call rotation established (24/7 coverage if required)
- Incident management process defined (severity levels, escalation)
- Post-mortem process established (root cause analysis, blameless)
COST & GOVERNANCE
- Cost tracking enabled (usage metrics, attribution by team/dept)
- Budget alerts configured (spending thresholds)
- Usage policies documented (acceptable use, prohibited use cases)
- Chargeback model defined (if internal cost allocation required)
- License compliance verified (software dependencies, model licenses)
- Privacy policy reviewed by legal (user data handling, retention)
- Terms of service published (internal users, external if applicable)
RELATED DECISION TOOLS
VIEW SCENARIOS