Partner Health Agent — API & Service Monitoring

Proactive monitoring: health checks → latency tracking → error detection → auto-failover → SLA management · 99.9% uptime target

Partner registry
Thunes (remittance)
Flutterwave (Africa)
Circle (USDC)
Wise (Europe)
Marqeta (cards)
Health checks
HTTP ping (10s)
API auth test (30s)
Transaction test (60s)
Webhook receiver
Status page scrape
Check frequency: 10s
Latency tracking
Response time (p50)
Response time (p95)
Response time (p99)
Timeout rate
Trend analysis
Error tracking
Error rate (%)
Error code breakdown
4xx vs 5xx
Retry success rate
Error patterns
Health score
Healthy (90-100)
Degraded (70-89)
Unhealthy (<70)
Composite score
Historical trend
Alert thresholds
Latency > 2s (warn)
Latency > 5s (critical)
Error rate > 1% (warn)
Error rate > 5% (critical)
Downtime > 60s
Anomaly detection
ML baseline deviation
Sudden error spike
Latency degradation
Volume drop
Unusual patterns
Alert routing
Slack #ops-alerts
PagerDuty (critical)
Email (summary)
SMS (P1 only)
On-call rotation
Noise reduction
Alert grouping
Flap detection
Maintenance windows
Dependency mapping
Auto-resolve
Escalation
L1: On-call eng (5m)
L2: Tech lead (15m)
L3: Eng manager (30m)
L4: VP Eng (1h)
Incident commander
Failover triggers
3 consecutive failures
Error rate > 10%
Latency > 10s
Manual trigger
Maintenance mode
Backup selection
Corridor mapping
Backup partner rank
Backup health check
Capacity verification
Cost comparison
Traffic shift
Gradual (10% → 100%)
Immediate (critical)
Circuit breaker
Request queuing
User notification
Failover time: <30s
Recovery
Primary health restored
Gradual traffic return
Queue drain
Reconciliation check
Post-mortem trigger
SLA definitions
Uptime: 99.9%
Latency p95: <2s
Error rate: <0.5%
Support response: 4h
Incident resolution: 24h
Current status
Thunes
99.92%
Flutterwave
99.71%
Circle
99.98%
Wise
99.95%
Marqeta
99.99%
SLA credits
Breach detection
Credit calculation
Invoice adjustment
Claim filing
Payment tracking
Vendor management
Quarterly review
Performance scorecard
Contract renewal
Pricing negotiation
New vendor eval
Reporting
Daily health report
Weekly SLA summary
Monthly scorecard
Quarterly exec review
Board summary
Real-time view
Partner health grid
Active incidents
Traffic distribution
Error rate charts
Latency heatmap
Historical analysis
Uptime trends
Incident history
MTTR analysis
Pattern detection
Capacity planning
Incident management
Incident timeline
Impact assessment
Communication log
Post-mortem
Action items
Avg MTTR: 12 min
Automation
Auto-remediation
Runbook execution
Status page update
Customer comms
Ticket creation