Scalability, Availability and Failover Guide

Scalability, Availability and Failover Guide

MyOperator is architected to scale vertically per account (tested to ~9,900 users and ~4 million call‑minutes/day on a single account with a >55% capacity buffer) and horizontally across tenants (multi‑tenant scale). We operate across 9 regional data centers with replication, health checks, and automatic failover. For localized carrier issues, admins can switch inbound routes; for broader incidents, platform‑level auto‑routing and active/standby or active/active designs keep calls flowing.


Table of contents


Quick facts & SLOs

Platform scale

  • Per‑account vertical scale proven to ~9,900 users and ~4M call‑minutes/day.
  • Horizontal scale across tenants (multi‑tenant, sharded data tiers).
  • 55% reserved buffer capacity to absorb surges.

Availability

  • 99% uptime service commitment under SLA.
  • 9 regional data centers with load balancing and cross‑region redundancy.

Data resilience

  • Replicated + sharded databases; call engines operate as isolated nodes to reduce blast radius.
  • Offline/cached mode for call engines during transient network loss.

Scalability model

Vertical (within an account)

  • Scale users, departments, IVR levels, and call volume inside a single tenant to tested limits (see Quick facts).
  • Elastic resource allocation per tenant during peaks.

Horizontal (across accounts)

  • Stateless call engines behind load balancers distribute traffic across nodes and regions.
  • Data layers employ sharding for call records and media; storage tiers expand without downtime.

Dimensions of scalability

  • Administrative: Many clients on shared, isolated infrastructure.
  • Functional: New features can be rolled out without affecting baseline capacity.
  • Geographic: Access and routing optimized across regions/circles.
  • Load: Auto‑scale pools up and down with traffic.
  • Generational: Components upgraded incrementally without large‑scale rewrites.

High availability (HA) architecture

  • Multi‑region footprint: 9 regional data centers; each region carries load and can back up others.
  • No single point of failure: Redundant instances at every layer (edge, call engine, DB, storage).
  • Health checks & self‑tests: Continuous probes and rolling windows of production metrics guide routing and scaling.
  • Data protection: Synchronous/asynchronous replication as appropriate; integrity checks during promotion/failback.

Failover strategies

Cluster topologies

  • Active–Standby: Rapid promotion when primary node/region is impaired.
  • Active–Active: Parallel paths share load; if one fails, others continue serving with minimal disruption.

Call‑path resilience

  • Auto‑routing around carrier congestion at the telecom layer.
  • Manual inbound route switch available to admins for localized issues (see Runbooks).
  • Cache/queue backed operation to ride out transient dependencies.

Data‑tier failover

  • Shard promotion and replica election procedures.
  • Write fencing to prevent split‑brain; replay/repair jobs after recovery.

Capacity planning playbook

Inputs to watch (weekly):

  • Peak concurrent calls (CPC), post‑dial delay (PDD), ASR (answer‑seizure ratio), AHT, error codes by circle/provider.
  • Storage growth (CDRs, recordings), export queue times.

Thresholds & actions:

  • CPC ≥ 70% of tenant limit for 3 consecutive days → request capacity uplift via Support.
  • PDD > 4s or ASR drop ≥ 10% (after excluding agent unavailability) → initiate route health check and consider manual route switch.
  • Recording storage ≥ 80% of plan allowance → archive/export or expand quota.

Quarterly review:

  • Revalidate team sizes, routing method (serial/simultaneous/balanced), and ring‑time (keep 20–30s).
  • Confirm disaster‑recovery contacts and escalation lists.

Operations & monitoring (SRE)

Core SLIs

  • Availability: % successful call setups; Error budgets tied to SLA.
  • Latency: PDD, media path jitter/packet loss.
  • Quality: MOS proxies, drop rate after ring.
  • Throughput: Calls/minute, concurrent channels by region.
  • Reliability: MTTR, MTBF, incident recurrence rate.

Alerting

  • Symptomatic alerts (ASR dips, PDD spikes) + cause‑oriented alerts (carrier error codes, region health).
  • Auto‑mitigation hooks trigger route reweights/failovers.

Incident management

  • Defined SEV levels, on‑call rotation, status updates, and post‑incident reviews with corrective actions.

Disaster recovery (DR): RTO/RPO & tests

  • Modes: Hot (active–active), Warm (standby with rapid spin‑up), Cold (restore from backups).
  • Targets: Application‑level RTO minutes to low hours; RPO near‑real‑time for replicated tiers, longer for cold archives.
  • Exercises: Semi‑annual region failover tests; quarterly backup restores; game‑days for carrier outages.

Customer controls & best practices

  • Manual inbound route switch during regional congestion (Manage → Design Callflow → Inbound Route Settings → Advanced → Switch Route).
  • Choose routing method per Department to balance speed vs fairness (Serial / Simultaneous / Balanced).
  • Keep Call Availability = ON for agents; maintain 20–30s ring‑time to avoid artificial misses.
  • Maintain a sample log (2–5 recent call IDs) for Support when reporting quality issues.
  • Use Call Logs/Analytics to monitor ASR, statuses, and patterns by provider/circle.

Runbooks (copy‑paste)

A) Manual inbound route switch (localized carrier issue)

  1. Sign in → Manage → Design Callflow.
  2. Open Inbound Route Settings → Advanced Settings.
  3. Click Switch Route (or Change Telecom Route).
  4. Select an alternate route → Save/Apply.
  5. Place 3–5 test calls; monitor ASR/PDD for 30–60 minutes.
  6. Revert when original route stabilizes.
Screenshot placeholder
Alt text: “Inbound Route Settings showing Advanced → Switch Route.”

B) Capacity review (monthly)

  1. Open Reports/Analytics → Call Summary / Call Logs.
  2. Extract CPC, ASR, PDD by hour and by circle/provider.
  3. Compare to thresholds (see Capacity playbook).
  4. If above thresholds, open a Support ticket for uplift or provider review.

C) Incident report to Support

Subject: Carrier/Route Degradation – Sample calls for analysisAccount ID: <id>Issue window (IST): <start–end>Symptoms: <ASR drop / high PDD / drops>Samples:1) <call_id> – <timestamp> – <last 4 digits> – <status>2) <call_id> – <timestamp> – <last 4 digits> – <status>Actions tried: route switch, agent checks

FAQ

What uptime does MyOperator commit to?
We target 99% under our SLA.

How many data centers do you use?
We operate across 9 regions with redundancy and failover.

Can one account handle very large teams?
Yes—tested to ~9,900 users and ~4M call‑minutes/day per account, with elastic headroom.

What happens if a region fails?
Traffic shifts to healthy regions (active–active/standby patterns); data tiers fail over to replicas.

What can I do during a local outage?
Use the manual route switch and share 2–5 sample calls with Support for deep carrier traces.


    • Related Articles

    • Scalability, Availability and Failover Guide

      MyOperator is a cloud-based call management solution and hence scalability is designed to the very core of it. Scalability is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to ...
    • Incase of a non-toll free number, will failover DID work automatically if the primary DID fails?

      No, Failover DIDs are applicable only for toll free numbers.
    • "Message Trail": Your Simple Guide

      This guide will help you understand and sell our powerful new WhatsApp tracking feature, Message Trail. Think of it as a super-smart reporting tool for any automated template messages your customers send via our API. 1. Use Cases: Who Needs Message ...
    • What is a failover DID? How does it work?

      A failover DID is a backup inbound phone number that automatically takes over when your primary number can’t accept calls, keeping you reachable through outages or congestion. Table of contents Quick definition Why you might need one How failover ...
    • What is the "Call availability" option?

      The "Call availability" option allows all Pro users (Basic, Moderator or Administrator) to turn "on" or "off" the permission to receive IVR calls. If a Pro user is available to receive IVR calls, then he/she can turn “On” the call availability ...