Infrastructure

Cloud Migration Planning: A Step-by-Step Guide

A practical guide to planning cloud migrations for B2B SaaS teams. Covers assessment, strategy selection, risk management, and cutover planning.

Illicus Team · March 29, 2026 · 11 min read

Most cloud migrations that go wrong do not fail because the technology was hard. They fail because the planning was shallow. Teams underestimate what they are moving, skip the rollback conversation until it is too late, and treat “go live” as the finish line instead of a checkpoint.

This guide walks through a structured approach to cloud migration planning for B2B SaaS teams: from initial assessment through post-migration optimization. The goal is a migration that is boring: predictable, reversible, and free of late-night emergency calls.

If you want an outside review of your current infrastructure before you start, an Infrastructure Audit is often the fastest way to surface risks that are not visible from inside the team.

Why migrations fail

Before covering what to do, it is worth being specific about the failure modes, because the same ones appear repeatedly.

No rollback plan. Teams invest heavily in forward planning and almost nothing in reverse planning. When something breaks at 2am, “roll back” cannot be the first time anyone has thought about what that means. Rollback needs to be defined, tested, and agreed on before the first workload moves.

Big-bang cutovers. Moving everything at once maximizes risk and minimizes your ability to diagnose problems. A single weekend migration for a platform with 12 services and 3 databases is not a migration plan: it is a hope.

Underestimating data migration. Application code is usually the easy part. Data has weight: volume, consistency requirements, transformation complexity, and validation overhead. Teams that budget two days for data migration regularly spend two weeks.

Undocumented dependencies. Monoliths and long-running services accumulate implicit dependencies: direct database connections between services, hardcoded IPs, shared file mounts, vendor integrations that assume a specific egress IP. These only surface when you move something and something else breaks.

Treating migration as an infrastructure project. Migration affects every team that ships code. If product engineering, QA, and customer success are not in the loop, you will hit coordination failures at the worst possible time.

Assessment phase

You cannot plan a migration you do not understand. The assessment phase produces a complete inventory of what you have, what depends on what, and what each workload actually requires.

1. Infrastructure inventory. List every compute resource, database, storage bucket, load balancer, queue, and third-party integration. Include resources that are managed by different teams or were set up manually years ago. Terraform state, AWS Config, or a tool like CloudMapper can help: but do not rely entirely on automated discovery. Walk the runbooks and ask the engineers who have been there longest.

2. Dependency mapping. For each service, document: what it calls, what calls it, and what shared state it touches. A service dependency graph at this stage will save weeks of debugging later. Pay attention to synchronous call chains: anything that requires a sub-100ms round trip needs to be co-located or replaced before you move it.

3. Workload classification. Not every workload should be migrated the same way. Classify each one along two axes: migration complexity (how hard is it to move?) and business criticality (what is the cost of downtime or degradation?). High complexity, high criticality workloads deserve the most planning time and the most conservative migration strategy.

4. Performance baselines. Before you touch anything, instrument and record your current performance: p50/p95/p99 latencies, error rates, resource utilization, and costs. These become your acceptance criteria after migration. Without a baseline, you will have no way to know if the new environment is actually better.

Strategy selection: the 6 Rs

The “6 Rs” framework (originally from Gartner, widely adopted by AWS) gives you a structured vocabulary for migration decisions. Each workload gets assigned one.

Rehost (lift-and-shift). Move the workload as-is to equivalent cloud infrastructure. Fast and low-risk in isolation, but you carry your existing inefficiencies into the cloud. Best for: legacy applications where refactoring is not funded, workloads with hard deadlines, or anything you plan to retire within 12 months.

Replatform (lift-and-reshape). Move with targeted optimizations: swap your self-managed MySQL for RDS, replace a self-hosted message broker with SQS, move from EC2-based jobs to a managed container service. You get cloud benefits without a full rewrite. Best for: workloads where 2–3 infrastructure swaps will meaningfully reduce operational burden.

Refactor (re-architect). Redesign the application to take full advantage of cloud-native capabilities. Higher upfront cost, best long-term outcome. Best for: core product services where architectural constraints are actively limiting delivery speed.

Repurchase. Replace the workload with a SaaS alternative. If you are running your own SMTP relay, SFTP server, or data warehouse, a managed service is almost always cheaper and safer. Best for: non-differentiating infrastructure that exists because nobody replaced it.

Retire. Shut it down. Every inventory turns up services that nobody is sure are still in use. Check logs, check call graphs, confirm with the owning team, and remove if unused. Best for: anything with no traffic and no clear owner.

Retain. Leave it in place, at least for now. Some workloads have regulatory constraints, latency requirements, or vendor dependencies that make cloud migration genuinely not worth it yet. Best for: workloads with hard data residency requirements, or anything tightly coupled to on-prem hardware.

A realistic migration portfolio for a mid-size SaaS might split roughly 40% rehost, 35% replatform, 15% retire, and 10% refactor. The exact mix depends on your architecture, but the point is that not everything needs the same treatment.

Risk management

Risk management in a migration is not a checklist item: it is a set of operational practices that you put in place before you move anything and keep running until migration is complete.

Dual-running. For critical services, run old and new environments in parallel for a defined period before cutover. Traffic is still going to the old environment; the new environment is being validated against real load. This catches configuration drift, environment-specific bugs, and performance regressions before they affect customers.

Feature flags. If your application supports feature flags, use them to gate traffic to migrated services at the application layer. This gives you a fast, tested kill switch that does not require a DNS change or infrastructure rollback.

Data sync strategies. For databases with active writes, you need continuous replication from source to target during the migration window. AWS DMS, native database replication, or a custom CDC pipeline: the right choice depends on your database engine and acceptable lag. Test your replication setup early and measure lag under production-like load, not just in idle tests.

Rollback triggers. Define the conditions that will cause you to abort and roll back. These should be specific: error rate above 1% for 5 consecutive minutes, p99 latency above 800ms, replication lag above 30 seconds. Vague triggers (“if it seems bad”) lead to arguments during an incident. Document the triggers, agree on them before the migration window, and assign a named person who has authority to call it.

Cutover planning

The cutover is the highest-risk period of any migration. Most of the work happens before the window opens.

Runbook template. A good cutover runbook includes: a step-by-step sequence with estimated durations, the name of the person responsible for each step, verification commands to confirm each step completed successfully, and the rollback procedure for each step. The runbook should be reviewed and rehearsed before the live window: ideally in a staging cutover that exercises the real procedure.

Communication plan. Define who gets notified and when. Internal stakeholders (engineering, support, sales) need a timeline and a status update channel. Customer-facing communication should be drafted in advance for the scenarios where you need it: planned maintenance notice, degraded service notice, incident notification.

Go/no-go criteria. Before you open the migration window, hold a go/no-go call. Check: replication lag is within acceptable bounds, all participating engineers are on the call, monitoring is active, rollback procedure has been reviewed in the last 24 hours, and there are no open SEV-1 or SEV-2 incidents. If any item is red, delay. A one-week delay costs far less than a failed migration.

Window sizing. For each service, estimate the migration steps conservatively, double that estimate, and set the window end time at that doubled estimate. If you finish early, great. If something takes longer than expected, you have buffer before you have to decide between completing an unsafe migration and rolling back.

For teams that need execution support during a complex migration, see Migration Delivery for how we structure and run these windows.

Post-migration

Getting workloads into the cloud is not the end of the project. The first 30–60 days after cutover are where you determine whether the migration actually delivered what it was supposed to.

Optimization. The right-sizing you did during assessment is a starting point, not a conclusion. Under real production load, you will find resources that are over-provisioned (costs are higher than expected) and under-provisioned (latency or error rates are elevated). Run a rightsizing review at 30 days with real utilization data.

Cost monitoring. Set up billing alerts before you move a single workload. Day-one cloud bills regularly surprise teams that did not account for data transfer costs, NAT gateway charges, or the difference between reserved and on-demand pricing. A cost dashboard with per-service attribution should be running from day one.

Decommissioning legacy. Old infrastructure does not go away on its own. Assign a deadline for each decommissioned workload and track it explicitly. Until the old environment is turned off, you are paying for two environments and your team is mentally maintaining two mental models.

Incident response updates. Your runbooks, alerting, and on-call procedures all reference infrastructure that no longer exists in the same form. Update them in the first week, not six months later when someone is debugging an incident at midnight.

If you want to see what a well-executed migration looks like end to end, the Cloud Migration Without Downtime case study covers the approach we used for a production migration with zero customer impact.

What a well-planned migration looks like

A good migration plan is conservative about timelines, specific about rollback, and honest about complexity. It has one owner who is accountable for the outcome, a written runbook that has been rehearsed, and a set of acceptance criteria that were agreed on before any work started.

The teams that do this well treat migration as a product: with a backlog, owners, and defined done criteria: not as an infrastructure task that engineering handles on the side.

If you want a structured review of your current environment before you start, or need execution support for a complex migration, get in touch and we can scope what the right starting point looks like.

#Cloud migration #AWS #Infrastructure #DevOps #Architecture

Related Services

Infrastructure Audit

Focused assessment of your cloud and security posture.

Migration Delivery

End-to-end cloud migration with risk-managed cutovers.

Need help with this?

We help engineering teams implement these practices in production—without unnecessary complexity.

No prep required. We'll share a plan within 48 hours.

Book a 20-minute discovery call