Scalability, Availability and Failover Guide

Scalability, Availability and Failover Guide

MyOperator is architected to scale vertically per account (tested to ~9,900 users and ~4 million call‑minutes/day on a single account with a >55% capacity buffer) and horizontally across tenants (multi‑tenant scale). We operate across 9 regional data centers with replication, health checks, and automatic failover. For localized carrier issues, admins can switch inbound routes; for broader incidents, platform‑level auto‑routing and active/standby or active/active designs keep calls flowing.


Table of contents


Quick facts & SLOs

Platform scale

  • Per‑account vertical scale proven to ~9,900 users and ~4M call‑minutes/day.
  • Horizontal scale across tenants (multi‑tenant, sharded data tiers).
  • 55% reserved buffer capacity to absorb surges.

Availability

  • 99% uptime service commitment under SLA.
  • 9 regional data centers with load balancing and cross‑region redundancy.

Data resilience

  • Replicated + sharded databases; call engines operate as isolated nodes to reduce blast radius.
  • Offline/cached mode for call engines during transient network loss.

Scalability model

Vertical (within an account)

  • Scale users, departments, IVR levels, and call volume inside a single tenant to tested limits (see Quick facts).
  • Elastic resource allocation per tenant during peaks.

Horizontal (across accounts)

  • Stateless call engines behind load balancers distribute traffic across nodes and regions.
  • Data layers employ sharding for call records and media; storage tiers expand without downtime.

Dimensions of scalability

  • Administrative: Many clients on shared, isolated infrastructure.
  • Functional: New features can be rolled out without affecting baseline capacity.
  • Geographic: Access and routing optimized across regions/circles.
  • Load: Auto‑scale pools up and down with traffic.
  • Generational: Components upgraded incrementally without large‑scale rewrites.

High availability (HA) architecture

  • Multi‑region footprint: 9 regional data centers; each region carries load and can back up others.
  • No single point of failure: Redundant instances at every layer (edge, call engine, DB, storage).
  • Health checks & self‑tests: Continuous probes and rolling windows of production metrics guide routing and scaling.
  • Data protection: Synchronous/asynchronous replication as appropriate; integrity checks during promotion/failback.

Failover strategies

Cluster topologies

  • Active–Standby: Rapid promotion when primary node/region is impaired.
  • Active–Active: Parallel paths share load; if one fails, others continue serving with minimal disruption.

Call‑path resilience

  • Auto‑routing around carrier congestion at the telecom layer.
  • Manual inbound route switch available to admins for localized issues (see Runbooks).
  • Cache/queue backed operation to ride out transient dependencies.

Data‑tier failover

  • Shard promotion and replica election procedures.
  • Write fencing to prevent split‑brain; replay/repair jobs after recovery.

Capacity planning playbook

Inputs to watch (weekly):

  • Peak concurrent calls (CPC), post‑dial delay (PDD), ASR (answer‑seizure ratio), AHT, error codes by circle/provider.
  • Storage growth (CDRs, recordings), export queue times.

Thresholds & actions:

  • CPC ≥ 70% of tenant limit for 3 consecutive days → request capacity uplift via Support.
  • PDD > 4s or ASR drop ≥ 10% (after excluding agent unavailability) → initiate route health check and consider manual route switch.
  • Recording storage ≥ 80% of plan allowance → archive/export or expand quota.

Quarterly review:

  • Revalidate team sizes, routing method (serial/simultaneous/balanced), and ring‑time (keep 20–30s).
  • Confirm disaster‑recovery contacts and escalation lists.

Operations & monitoring (SRE)

Core SLIs

  • Availability: % successful call setups; Error budgets tied to SLA.
  • Latency: PDD, media path jitter/packet loss.
  • Quality: MOS proxies, drop rate after ring.
  • Throughput: Calls/minute, concurrent channels by region.
  • Reliability: MTTR, MTBF, incident recurrence rate.

Alerting

  • Symptomatic alerts (ASR dips, PDD spikes) + cause‑oriented alerts (carrier error codes, region health).
  • Auto‑mitigation hooks trigger route reweights/failovers.

Incident management

  • Defined SEV levels, on‑call rotation, status updates, and post‑incident reviews with corrective actions.

Disaster recovery (DR): RTO/RPO & tests

  • Modes: Hot (active–active), Warm (standby with rapid spin‑up), Cold (restore from backups).
  • Targets: Application‑level RTO minutes to low hours; RPO near‑real‑time for replicated tiers, longer for cold archives.
  • Exercises: Semi‑annual region failover tests; quarterly backup restores; game‑days for carrier outages.

Customer controls & best practices

  • Manual inbound route switch during regional congestion (Manage → Design Callflow → Inbound Route Settings → Advanced → Switch Route).
  • Choose routing method per Department to balance speed vs fairness (Serial / Simultaneous / Balanced).
  • Keep Call Availability = ON for agents; maintain 20–30s ring‑time to avoid artificial misses.
  • Maintain a sample log (2–5 recent call IDs) for Support when reporting quality issues.
  • Use Call Logs/Analytics to monitor ASR, statuses, and patterns by provider/circle.

Runbooks (copy‑paste)

A) Manual inbound route switch (localized carrier issue)

  1. Sign in → Manage → Design Callflow.
  2. Open Inbound Route Settings → Advanced Settings.
  3. Click Switch Route (or Change Telecom Route).
  4. Select an alternate route → Save/Apply.
  5. Place 3–5 test calls; monitor ASR/PDD for 30–60 minutes.
  6. Revert when original route stabilizes.
Screenshot placeholder
Alt text: “Inbound Route Settings showing Advanced → Switch Route.”

B) Capacity review (monthly)

  1. Open Reports/Analytics → Call Summary / Call Logs.
  2. Extract CPC, ASR, PDD by hour and by circle/provider.
  3. Compare to thresholds (see Capacity playbook).
  4. If above thresholds, open a Support ticket for uplift or provider review.

C) Incident report to Support

Subject: Carrier/Route Degradation – Sample calls for analysisAccount ID: <id>Issue window (IST): <start–end>Symptoms: <ASR drop / high PDD / drops>Samples:1) <call_id> – <timestamp> – <last 4 digits> – <status>2) <call_id> – <timestamp> – <last 4 digits> – <status>Actions tried: route switch, agent checks

FAQ

What uptime does MyOperator commit to?
We target 99% under our SLA.

How many data centers do you use?
We operate across 9 regions with redundancy and failover.

Can one account handle very large teams?
Yes—tested to ~9,900 users and ~4M call‑minutes/day per account, with elastic headroom.

What happens if a region fails?
Traffic shifts to healthy regions (active–active/standby patterns); data tiers fail over to replicas.

What can I do during a local outage?
Use the manual route switch and share 2–5 sample calls with Support for deep carrier traces.