Scalability, Availability and Failover Guide

MyOperator is architected to scale vertically per account (tested to ~9,900 users and ~4 million call‑minutes/day on a single account with a >55% capacity buffer) and horizontally across tenants (multi‑tenant scale). We operate across 9 regional data centers with replication, health checks, and automatic failover. For localized carrier issues, admins can switch inbound routes; for broader incidents, platform‑level auto‑routing and active/standby or active/active designs keep calls flowing.

Quick facts & SLOs

Platform scale

Per‑account vertical scale proven to ~9,900 users and ~4M call‑minutes/day.
Horizontal scale across tenants (multi‑tenant, sharded data tiers).
55% reserved buffer capacity to absorb surges.

Availability

99% uptime service commitment under SLA.
9 regional data centers with load balancing and cross‑region redundancy.

Data resilience

Replicated + sharded databases; call engines operate as isolated nodes to reduce blast radius.
Offline/cached mode for call engines during transient network loss.

Scalability model

Vertical (within an account)

Scale users, departments, IVR levels, and call volume inside a single tenant to tested limits (see Quick facts).
Elastic resource allocation per tenant during peaks.

Horizontal (across accounts)

Stateless call engines behind load balancers distribute traffic across nodes and regions.
Data layers employ sharding for call records and media; storage tiers expand without downtime.

Dimensions of scalability

Administrative: Many clients on shared, isolated infrastructure.
Functional: New features can be rolled out without affecting baseline capacity.
Geographic: Access and routing optimized across regions/circles.
Load: Auto‑scale pools up and down with traffic.
Generational: Components upgraded incrementally without large‑scale rewrites.

High availability (HA) architecture

Multi‑region footprint: 9 regional data centers; each region carries load and can back up others.
No single point of failure: Redundant instances at every layer (edge, call engine, DB, storage).
Health checks & self‑tests: Continuous probes and rolling windows of production metrics guide routing and scaling.
Data protection: Synchronous/asynchronous replication as appropriate; integrity checks during promotion/failback.

Failover strategies

Cluster topologies

Active–Standby: Rapid promotion when primary node/region is impaired.
Active–Active: Parallel paths share load; if one fails, others continue serving with minimal disruption.

Call‑path resilience

Auto‑routing around carrier congestion at the telecom layer.
Manual inbound route switch available to admins for localized issues (see Runbooks).
Cache/queue backed operation to ride out transient dependencies.

Data‑tier failover

Shard promotion and replica election procedures.
Write fencing to prevent split‑brain; replay/repair jobs after recovery.

Capacity planning playbook

Inputs to watch (weekly):

Peak concurrent calls (CPC), post‑dial delay (PDD), ASR (answer‑seizure ratio), AHT, error codes by circle/provider.
Storage growth (CDRs, recordings), export queue times.

Thresholds & actions:

CPC ≥ 70% of tenant limit for 3 consecutive days → request capacity uplift via Support.
PDD > 4s or ASR drop ≥ 10% (after excluding agent unavailability) → initiate route health check and consider manual route switch.
Recording storage ≥ 80% of plan allowance → archive/export or expand quota.

Quarterly review:

Revalidate team sizes, routing method (serial/simultaneous/balanced), and ring‑time (keep 20–30s).
Confirm disaster‑recovery contacts and escalation lists.

Operations & monitoring (SRE)

Core SLIs

Availability: % successful call setups; Error budgets tied to SLA.
Latency: PDD, media path jitter/packet loss.
Quality: MOS proxies, drop rate after ring.
Throughput: Calls/minute, concurrent channels by region.
Reliability: MTTR, MTBF, incident recurrence rate.

Alerting

Symptomatic alerts (ASR dips, PDD spikes) + cause‑oriented alerts (carrier error codes, region health).
Auto‑mitigation hooks trigger route reweights/failovers.

Incident management

Defined SEV levels, on‑call rotation, status updates, and post‑incident reviews with corrective actions.

Disaster recovery (DR): RTO/RPO & tests

Modes: Hot (active–active), Warm (standby with rapid spin‑up), Cold (restore from backups).
Targets: Application‑level RTO minutes to low hours; RPO near‑real‑time for replicated tiers, longer for cold archives.
Exercises: Semi‑annual region failover tests; quarterly backup restores; game‑days for carrier outages.

Customer controls & best practices

Manual inbound route switch during regional congestion (Manage → Design Callflow → Inbound Route Settings → Advanced → Switch Route).
Choose routing method per Department to balance speed vs fairness (Serial / Simultaneous / Balanced).
Keep Call Availability = ON for agents; maintain 20–30s ring‑time to avoid artificial misses.
Maintain a sample log (2–5 recent call IDs) for Support when reporting quality issues.
Use Call Logs/Analytics to monitor ASR, statuses, and patterns by provider/circle.

Runbooks (copy‑paste)

A) Manual inbound route switch (localized carrier issue)

Sign in → Manage → Design Callflow.
Open Inbound Route Settings → Advanced Settings.
Click Switch Route (or Change Telecom Route).
Select an alternate route → Save/Apply.
Place 3–5 test calls; monitor ASR/PDD for 30–60 minutes.
Revert when original route stabilizes.

Screenshot placeholder
Alt text: “Inbound Route Settings showing Advanced → Switch Route.”

B) Capacity review (monthly)

Open Reports/Analytics → Call Summary / Call Logs.
Extract CPC, ASR, PDD by hour and by circle/provider.
Compare to thresholds (see Capacity playbook).
If above thresholds, open a Support ticket for uplift or provider review.

C) Incident report to Support

Subject: Carrier/Route Degradation – Sample calls for analysisAccount ID: <id>Issue window (IST): <start–end>Symptoms: <ASR drop / high PDD / drops>Samples:1) <call_id> – <timestamp> – <last 4 digits> – <status>2) <call_id> – <timestamp> – <last 4 digits> – <status>Actions tried: route switch, agent checks

FAQ

What uptime does MyOperator commit to?
We target 99% under our SLA.

How many data centers do you use?
We operate across 9 regions with redundancy and failover.

Can one account handle very large teams?
Yes—tested to ~9,900 users and ~4M call‑minutes/day per account, with elastic headroom.

What happens if a region fails?
Traffic shifts to healthy regions (active–active/standby patterns); data tiers fail over to replicas.

What can I do during a local outage?
Use the manual route switch and share 2–5 sample calls with Support for deep carrier traces.

Related Articles
Scalability, Availability and Failover Guide
MyOperator is a cloud-based call management solution and hence scalability is designed to the very core of it. Scalability is the capability of a system, network, or process to handle a growing amount of work, or its potential to be enlarged to ...
Incase of a non-toll free number, will failover DID work automatically if the primary DID fails?
No, Failover DIDs are applicable only for toll free numbers.
"Message Trail": Your Simple Guide
This guide will help you understand and sell our powerful new WhatsApp tracking feature, Message Trail. Think of it as a super-smart reporting tool for any automated template messages your customers send via our API. 1. Use Cases: Who Needs Message ...
What is a failover DID? How does it work?
A failover DID is a backup inbound phone number that automatically takes over when your primary number can’t accept calls, keeping you reachable through outages or congestion. Table of contents Quick definition Why you might need one How failover ...
What does the Call Availability toggle do in MyOperator and how do I change my status?
⚡Quick answer - Call Availability controls whether you are included in inbound IVR / queue routing. Toggle it to On (Available) when you can take calls and to Off (Away) when you are busy. Changes take effect within seconds. When should I use this ...

Scalability, Availability and Failover Guide

Scalability, Availability and Failover Guide

Table of contents

Quick facts & SLOs

Scalability model

High availability (HA) architecture

Failover strategies

Capacity planning playbook

Operations & monitoring (SRE)

Disaster recovery (DR): RTO/RPO & tests

Customer controls & best practices

Runbooks (copy‑paste)

A) Manual inbound route switch (localized carrier issue)

B) Capacity review (monthly)

C) Incident report to Support

FAQ

Related Articles

Scalability, Availability and Failover Guide

Incase of a non-toll free number, will failover DID work automatically if the primary DID fails?

"Message Trail": Your Simple Guide

What is a failover DID? How does it work?

What does the Call Availability toggle do in MyOperator and how do I change my status?