MyOperator is architected to scale vertically per account (tested to ~9,900 users and ~4 million call‑minutes/day on a single account with a >55% capacity buffer) and horizontally across tenants (multi‑tenant scale). We operate across 9 regional data centers with replication, health checks, and automatic failover. For localized carrier issues, admins can switch inbound routes; for broader incidents, platform‑level auto‑routing and active/standby or active/active designs keep calls flowing.
Table of contents
Quick facts & SLOs
Platform scale
- Per‑account vertical scale proven to ~9,900 users and ~4M call‑minutes/day.
- Horizontal scale across tenants (multi‑tenant, sharded data tiers).
- 55% reserved buffer capacity to absorb surges.
Availability
- 99% uptime service commitment under SLA.
- 9 regional data centers with load balancing and cross‑region redundancy.
Data resilience
- Replicated + sharded databases; call engines operate as isolated nodes to reduce blast radius.
- Offline/cached mode for call engines during transient network loss.
Scalability model
Vertical (within an account)
- Scale users, departments, IVR levels, and call volume inside a single tenant to tested limits (see Quick facts).
- Elastic resource allocation per tenant during peaks.
Horizontal (across accounts)
- Stateless call engines behind load balancers distribute traffic across nodes and regions.
- Data layers employ sharding for call records and media; storage tiers expand without downtime.
Dimensions of scalability
- Administrative: Many clients on shared, isolated infrastructure.
- Functional: New features can be rolled out without affecting baseline capacity.
- Geographic: Access and routing optimized across regions/circles.
- Load: Auto‑scale pools up and down with traffic.
- Generational: Components upgraded incrementally without large‑scale rewrites.
High availability (HA) architecture
- Multi‑region footprint: 9 regional data centers; each region carries load and can back up others.
- No single point of failure: Redundant instances at every layer (edge, call engine, DB, storage).
- Health checks & self‑tests: Continuous probes and rolling windows of production metrics guide routing and scaling.
- Data protection: Synchronous/asynchronous replication as appropriate; integrity checks during promotion/failback.
Failover strategies
Cluster topologies
- Active–Standby: Rapid promotion when primary node/region is impaired.
- Active–Active: Parallel paths share load; if one fails, others continue serving with minimal disruption.
Call‑path resilience
- Auto‑routing around carrier congestion at the telecom layer.
- Manual inbound route switch available to admins for localized issues (see Runbooks).
- Cache/queue backed operation to ride out transient dependencies.
Data‑tier failover
- Shard promotion and replica election procedures.
- Write fencing to prevent split‑brain; replay/repair jobs after recovery.
Capacity planning playbook
Inputs to watch (weekly):
- Peak concurrent calls (CPC), post‑dial delay (PDD), ASR (answer‑seizure ratio), AHT, error codes by circle/provider.
- Storage growth (CDRs, recordings), export queue times.
Thresholds & actions:
- CPC ≥ 70% of tenant limit for 3 consecutive days → request capacity uplift via Support.
- PDD > 4s or ASR drop ≥ 10% (after excluding agent unavailability) → initiate route health check and consider manual route switch.
- Recording storage ≥ 80% of plan allowance → archive/export or expand quota.
Quarterly review:
- Revalidate team sizes, routing method (serial/simultaneous/balanced), and ring‑time (keep 20–30s).
- Confirm disaster‑recovery contacts and escalation lists.
Operations & monitoring (SRE)
Core SLIs
- Availability: % successful call setups; Error budgets tied to SLA.
- Latency: PDD, media path jitter/packet loss.
- Quality: MOS proxies, drop rate after ring.
- Throughput: Calls/minute, concurrent channels by region.
- Reliability: MTTR, MTBF, incident recurrence rate.
Alerting
- Symptomatic alerts (ASR dips, PDD spikes) + cause‑oriented alerts (carrier error codes, region health).
- Auto‑mitigation hooks trigger route reweights/failovers.
Incident management
- Defined SEV levels, on‑call rotation, status updates, and post‑incident reviews with corrective actions.
Disaster recovery (DR): RTO/RPO & tests
- Modes: Hot (active–active), Warm (standby with rapid spin‑up), Cold (restore from backups).
- Targets: Application‑level RTO minutes to low hours; RPO near‑real‑time for replicated tiers, longer for cold archives.
- Exercises: Semi‑annual region failover tests; quarterly backup restores; game‑days for carrier outages.
Customer controls & best practices
- Manual inbound route switch during regional congestion (Manage → Design Callflow → Inbound Route Settings → Advanced → Switch Route).
- Choose routing method per Department to balance speed vs fairness (Serial / Simultaneous / Balanced).
- Keep Call Availability = ON for agents; maintain 20–30s ring‑time to avoid artificial misses.
- Maintain a sample log (2–5 recent call IDs) for Support when reporting quality issues.
- Use Call Logs/Analytics to monitor ASR, statuses, and patterns by provider/circle.
Runbooks (copy‑paste)
A) Manual inbound route switch (localized carrier issue)
- Sign in → Manage → Design Callflow.
- Open Inbound Route Settings → Advanced Settings.
- Click Switch Route (or Change Telecom Route).
- Select an alternate route → Save/Apply.
- Place 3–5 test calls; monitor ASR/PDD for 30–60 minutes.
- Revert when original route stabilizes.
Screenshot placeholder
Alt text: “Inbound Route Settings showing Advanced → Switch Route.”
B) Capacity review (monthly)
- Open Reports/Analytics → Call Summary / Call Logs.
- Extract CPC, ASR, PDD by hour and by circle/provider.
- Compare to thresholds (see Capacity playbook).
- If above thresholds, open a Support ticket for uplift or provider review.
C) Incident report to Support
Subject: Carrier/Route Degradation – Sample calls for analysisAccount ID: <id>Issue window (IST): <start–end>Symptoms: <ASR drop / high PDD / drops>Samples:1) <call_id> – <timestamp> – <last 4 digits> – <status>2) <call_id> – <timestamp> – <last 4 digits> – <status>Actions tried: route switch, agent checks
FAQ
What uptime does MyOperator commit to?
We target 99% under our SLA.
How many data centers do you use?
We operate across 9 regions with redundancy and failover.
Can one account handle very large teams?
Yes—tested to ~9,900 users and ~4M call‑minutes/day per account, with elastic headroom.
What happens if a region fails?
Traffic shifts to healthy regions (active–active/standby patterns); data tiers fail over to replicas.
What can I do during a local outage?
Use the manual route switch and share 2–5 sample calls with Support for deep carrier traces.