Kredete - Partner Health Agent

Partner Health Agent — API & Service Monitoring

Proactive monitoring: health checks → latency tracking → error detection → auto-failover → SLA management · 99.9% uptime target

1 Real-time health monitoring

Partner registry

Thunes (remittance)

Flutterwave (Africa)

Circle (USDC)

Wise (Europe)

Marqeta (cards)

Health checks

HTTP ping (10s)

API auth test (30s)

Transaction test (60s)

Webhook receiver

Status page scrape

Check frequency: 10s

Latency tracking

Response time (p50)

Response time (p95)

Response time (p99)

Timeout rate

Trend analysis

Error tracking

Error rate (%)

Error code breakdown

4xx vs 5xx

Retry success rate

Error patterns

Health score

Healthy (90-100)

Degraded (70-89)

Unhealthy (<70)

Composite score

Historical trend

2 Intelligent alerting

Alert thresholds

Latency > 2s (warn)

Latency > 5s (critical)

Error rate > 1% (warn)

Error rate > 5% (critical)

Downtime > 60s

Anomaly detection

ML baseline deviation

Sudden error spike

Latency degradation

Volume drop

Unusual patterns

Alert routing

Slack #ops-alerts

PagerDuty (critical)

Email (summary)

SMS (P1 only)

On-call rotation

Noise reduction

Alert grouping

Flap detection

Maintenance windows

Dependency mapping

Auto-resolve

Escalation

L1: On-call eng (5m)

L2: Tech lead (15m)

L3: Eng manager (30m)

L4: VP Eng (1h)

Incident commander

3 Auto-failover & recovery

Failover triggers

3 consecutive failures

Error rate > 10%

Latency > 10s

Manual trigger

Maintenance mode

Backup selection

Corridor mapping

Backup partner rank

Backup health check

Capacity verification

Cost comparison

Traffic shift

Gradual (10% → 100%)

Immediate (critical)

Circuit breaker

Request queuing

User notification

Failover time: <30s

Recovery

Primary health restored

Gradual traffic return

Queue drain

Reconciliation check

Post-mortem trigger

4 SLA tracking & management

SLA definitions

Uptime: 99.9%

Latency p95: <2s

Error rate: <0.5%

Support response: 4h

Incident resolution: 24h

Current status

Thunes

99.92%

Flutterwave

99.71%

Circle

99.98%

Wise

99.95%

Marqeta

99.99%

SLA credits

Breach detection

Credit calculation

Invoice adjustment

Claim filing

Payment tracking

Vendor management

Quarterly review

Performance scorecard

Contract renewal

Pricing negotiation

New vendor eval

Reporting

Daily health report

Weekly SLA summary

Monthly scorecard

Quarterly exec review

Board summary

5 Operations dashboard

Real-time view

Partner health grid

Active incidents

Traffic distribution

Error rate charts

Latency heatmap

Historical analysis

Uptime trends

Incident history

MTTR analysis

Pattern detection

Capacity planning

Incident management

Incident timeline

Impact assessment

Communication log

Post-mortem

Action items

Avg MTTR: 12 min

Automation

Auto-remediation

Runbook execution

Status page update

Customer comms

Ticket creation