Breaking the Monolith: How We Split a Node.js Backend into Go Microservices on AWS ECS-Without Stopping the World

We didn’t “rewrite everything from scratch”. We couldn’t. We have a product to ship and customers to support. What we did at SolarGenix.ai was more boring and more durable: we peeled a Node.js (TypeScript) monolith apart one seam at a time, stood up sm…


This content originally appeared on DEV Community and was authored by Voskan Voskanyan

We didn’t "rewrite everything from scratch". We couldn’t. We have a product to ship and customers to support. What we did at SolarGenix.ai was more boring and more durable: we peeled a Node.js (TypeScript) monolith apart one seam at a time, stood up small Go services on AWS ECS, and changed the way services talk to each other-from synchronous REST fan-out to events on EventBridge.

This is the playbook we actually used. It’s not a template for everyone, but if you're staring at a similar migration, it should save you a few traps.

Why we changed

Our monolith wasn’t "bad". It was successful enough to accumulate real traffic and real teams. But success came with some sharp edges:

  • Tight coupling inside request/response. A single HTTP request often fanned out to three or more internal calls. When one downstream was slow, p95 went up for the entire user flow.
  • Cascading latency. Spiky load in one area caused queueing in unrelated areas because everything was tied to the same hot path.
  • Deploy risk. Shipping a "small" change could nudge a shared code path and cause unwanted side-effects. Rollbacks existed in theory; in practice, the blast radius was too big.

Our goals were mundane: safer deploys, clearer ownership, and predictable delivery. We wanted a system where one domain could evolve without touching six others, and where we could choose consistency vs. latency explicitly instead of inheriting it accidentally.

The plan (gradual, not big-bang)

We kept users on REST at the edge. Internals moved from sync calls to events.

  • REST at the boundary. Edge handlers stayed in the monolith for a while because the API surface was stable and familiar to clients.
  • Turn side-effects into events. Instead of calling N services synchronously, the handler published an event-proposal.published is the canonical example-and returned fast.
  • EventBridge for bus + routing. We picked Amazon EventBridge as the managed bus and routing layer. It let us route with simple patterns and avoid building our own "bus that only we understand."
  • Go for small services. New domain services were written in Go. Teams liked the simplicity, the small memory footprint, and the standard library. No framework rabbit hole.
  • Mirror prod names in staging. Staging mirrored production naming (with a -stg suffix), so cutovers were predictable. Versioned detail-type events (proposal.published:v2) gave us room to evolve.

We ran this migration as a rolling set of small moves, not a calendar-driven "big switch."

Contracts and compatibility

OpenAPI at the edge

Clients didn’t need to care that our internals changed. We published OpenAPI contracts for the edge endpoints and kept them stable. Where we knew we would break something, we versioned the endpoint explicitly or added a feature flag to control new behavior.

Event envelope

Every event followed the same envelope. It sounds nitpicky; it saved us a lot of confusion.

{
  "id": "01J9Q9W0R3W3S3DAXYVZ8R3M7V",
  "source": "app.solargenix",
  "type": "proposal.published",
  "version": "v2",
  "occurredAt": "2025-09-21T14:33:12Z",
  "data": {
    "proposalId": "pr_0f3b1e",
    "accountId": "acc_7a21",
    "publishedBy": "u_193",
    "currency": "USD"
  }
}
  • id is a stable ULID (more readable, roughly time-sortable).
  • type is a domain noun: proposal.published, account.updated, etc.
  • version is a major version only. Minor changes must be additive.
  • occurredAt is UTC. The producers set it once. Consumers don’t "fix" it.

Dual-publish for majors

Major changes published both v1 and v2 for a sprint. Consumers opted into v2 when ready. We measured v1 usage and removed it when zero. No hidden toggles, no guessing.

Idempotency, retries, DLQ

You don’t get exactly-once delivery. Assume duplicates; you’ll sleep better.

  • Idempotency table in DynamoDB. Each consumer writes a row keyed by eventId with a TTL at least equal to the retry window (we use 24h). If the key exists, the side-effect already happened; drop it.
  • Per-target DLQs and explicit retry policies. Every rule target has its own DLQ and RetryPolicy. When something breaks, it breaks locally. We can triage by DLQ name and owner.
  • At-least-once thinking. We stopped treating duplicates as "bugs" and started treating lack of idempotency as the bug.

Minimal consumer shape (SQS buffer -> Lambda):

// Skeleton; real code has tracing, structured logs, and metrics.
func Handle(ctx context.Context, e events.SQSEvent) error {
    for _, r := range e.Records {
        var ev Event
        if err := json.Unmarshal([]byte(r.Body), &ev); err != nil {
            return err // retried by SQS/Lambda, ends up in DLQ if persistent
        }

        // TTL ≥ retry window; we used 24h.
        seen, err := idem.Seen(ctx, ev.ID, 24*time.Hour)
        if err != nil { return err }
        if seen { continue }

        if err := apply(ev); err != nil {
            return err // retry; isolated to this target
        }
    }
    return nil
}

Data projections & reads

We kept records of truth where they fit best and projected what the UI needed.

  • DynamoDB Streams for projections. Where a table is authoritative (e.g., counters, idempotency, some hot entities), Streams emit change events that update search indexes, analytics, or a read-optimized view.
  • Hot read path. UI reads follow: cache -> key lookup (DynamoDB) -> fallback to source of truth (often Aurora or the monolith during transition). Cache TTLs match how stale each endpoint is allowed to be.
  • Cache busting. Any state transition that changes what the UI renders triggers cache invalidation. We use small, explicit helpers-no "magic auto-busting."

The result: p95 fast-path reads dropped from 48 ms to 11 ms. Most pages now hit a projection or cache instead of stitching five joins under pressure.

Release strategy on ECS

We didn’t forklift deploy. We pared the system into boundaries and rolled them out one at a time.

  • Service-by-service rollout. Each new Go service ran in ECS with its own task definition, autoscaling policy, and health checks. We didn’t cram everything into one cluster service.
  • Shadow consumers. Early consumers ran in shadow for a sprint. They processed the real event stream and wrote results next to the old path. We diffed until we trusted them.
  • Feature flags. When a consumer replaced an old synchronous call, we shipped the flag first, validated in staging, then flipped for a subset of tenants in production.
  • Rollback plan. We kept the synchronous path alive for a while. If the consumer misbehaved, we flipped the flag off and investigated with a hot DLQ replay.
  • Blast-radius limits. We routed by detail-type, source, and sometimes accountId or region to control who got the new behavior.

A sketch of the EventBridge rule we used repeatedly:

{
  "Name": "proposal-published-v2",
  "EventPattern": {
    "source": ["app.solargenix"],
    "detail-type": ["proposal.published:v2"]
  },
  "Targets": [{
    "Arn": "arn:aws:lambda:...:function:proposal-emailer",
    "RetryPolicy": { "MaximumRetryAttempts": 185, "MaximumEventAgeInSeconds": 86400 },
    "DeadLetterConfig": { "Arn": "arn:aws:sqs:...:dlq-proposal-emailer" }
  }]
}

Each target had its own DLQ (e.g., dlq-proposal-emailer). Ownership was never ambiguous.

The small Go publisher

Producers do one thing: put a clean event on the bus. No tight loops. No cleverness.

type Event struct {
    ID         string      `json:"id"`
    Source     string      `json:"source"`
    Type       string      `json:"type"`    // e.g., "proposal.published"
    Version    string      `json:"version"` // e.g., "v2"
    OccurredAt time.Time   `json:"occurredAt"`
    Data       interface{} `json:"data"`
}

func Publish(ctx context.Context, bus, region string, ev Event) error {
    cfg, err := config.LoadDefaultConfig(ctx, config.WithRegion(region))
    if err != nil { return err }
    cli := eventbridge.NewFromConfig(cfg)

    detailType := fmt.Sprintf("%s:%s", ev.Type, ev.Version)
    payload, _ := json.Marshal(ev)

    _, err = cli.PutEvents(ctx, &eventbridge.PutEventsInput{
        Entries: []types.PutEventsRequestEntry{{
            EventBusName: &bus,
            Source:       aws.String(ev.Source),
            DetailType:   aws.String(detailType),
            Time:         aws.Time(ev.OccurredAt),
            Detail:       aws.String(string(payload)),
        }},
    })
    return err
}

That’s it. The repository or handler constructs the event with the right version and calls Publish.

Observability that catches real problems

We avoided dashboards that look great in demos and tell you nothing on-call. The few signals that consistently matter:

  • DLQ depth and age per target. Alerts when DLQ depth ≥ 5 for 3 minutes, and when max age > 10 minutes. That split catches both bursts and stuck poison messages.
  • EventBridge target failure rate. Metric math on Invocations vs. FailedInvocations per rule target. We page on a non-zero failure rate sustained for 5 minutes.
  • Read p95 for hot endpoints. Because that’s what users feel. We annotate deploys so we can correlate regressions with changes.
  • Projection lag. If the read model is stale beyond our acceptable TTL, we want to know before users do.

We also added one "boring but lifesaving" alarm on idempotency table write failures. If the table throttles or permissions drift, duplicates slip through.

Results and one scar-tissue story

Numbers first:

  • p95 fast-path reads: 48 ms -> 11 ms after shifting hot reads to a key lookup or cache.
  • DLQ rate: < 0.1% over the last 30 days. Replays are scripted and dull (which is the point).

Now the scar tissue.

During one cutover we assumed global ordering for a certain flow. It wasn’t. A burst produced duplicates in a consumer that sent emails. Some customers received two messages. The fix was straightforward once we accepted the delivery model: we keyed idempotency by eventId + recipient and set a 24h TTL. The consumer became "boring." On-call became boring with it.

What we’d do again

  • Clean envelopes. That little JSON contract carried more weight than any library choice.
  • type:version in detail-type. Routing by major version without parsing payloads is a gift to operations.
  • Per-target DLQs. Ownership and blame become obvious. It shortens incidents.
  • A few boring alarms. DLQ depth/age, failure rate per target, p95 reads, projection lag. No noise.

What we’d change next time

  • Earlier "projection first" thinking. We could have moved hot reads to projections sooner and earned the p95 win earlier.
  • Stricter contracts for optional fields. We allowed too many "maybe present" fields that crept into business logic. I’d lock those down sooner.
  • Cost/ops trade-off, acknowledged. Managed EventBridge beats self-hosted when you’re small or moving fast. At higher scale, some teams roll their own bus for cost and control. For us, the ops time saved was worth the bill. Your curve may differ.

Notes on Clean Architecture (brief, practical)

We didn’t treat Clean Architecture as a religion. We kept it to two rules:

  1. Domain code doesn’t depend on the transport. Handlers call a use-case; use-cases call repositories; repositories hide storage and messaging.
  2. No cross-domain imports. If two domains need to coordinate, they publish/consume events. They don’t import each other’s packages and reach in.

It kept the Go services small and readable, and it made moving logic between processes almost trivial.

We didn’t stop the world; we changed how it moved. The monolith is smaller now, the teams are less entangled, and the hot paths are faster. The nice part is that none of this requires heroics-just consistent patterns and the discipline to apply them.

If there’s interest, I’ll follow up with the exact fan-out patterns we use and how we handle backfill/replay safely. If this was useful, follow my profile and the SolarGenix.ai page to catch the next deep dives and benchmarks as they land.


This content originally appeared on DEV Community and was authored by Voskan Voskanyan


Print Share Comment Cite Upload Translate Updates
APA

Voskan Voskanyan | Sciencx (2025-10-25T14:36:48+00:00) Breaking the Monolith: How We Split a Node.js Backend into Go Microservices on AWS ECS-Without Stopping the World. Retrieved from https://www.scien.cx/2025/10/25/breaking-the-monolith-how-we-split-a-node-js-backend-into-go-microservices-on-aws-ecs-without-stopping-the-world/

MLA
" » Breaking the Monolith: How We Split a Node.js Backend into Go Microservices on AWS ECS-Without Stopping the World." Voskan Voskanyan | Sciencx - Saturday October 25, 2025, https://www.scien.cx/2025/10/25/breaking-the-monolith-how-we-split-a-node-js-backend-into-go-microservices-on-aws-ecs-without-stopping-the-world/
HARVARD
Voskan Voskanyan | Sciencx Saturday October 25, 2025 » Breaking the Monolith: How We Split a Node.js Backend into Go Microservices on AWS ECS-Without Stopping the World., viewed ,<https://www.scien.cx/2025/10/25/breaking-the-monolith-how-we-split-a-node-js-backend-into-go-microservices-on-aws-ecs-without-stopping-the-world/>
VANCOUVER
Voskan Voskanyan | Sciencx - » Breaking the Monolith: How We Split a Node.js Backend into Go Microservices on AWS ECS-Without Stopping the World. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/10/25/breaking-the-monolith-how-we-split-a-node-js-backend-into-go-microservices-on-aws-ecs-without-stopping-the-world/
CHICAGO
" » Breaking the Monolith: How We Split a Node.js Backend into Go Microservices on AWS ECS-Without Stopping the World." Voskan Voskanyan | Sciencx - Accessed . https://www.scien.cx/2025/10/25/breaking-the-monolith-how-we-split-a-node-js-backend-into-go-microservices-on-aws-ecs-without-stopping-the-world/
IEEE
" » Breaking the Monolith: How We Split a Node.js Backend into Go Microservices on AWS ECS-Without Stopping the World." Voskan Voskanyan | Sciencx [Online]. Available: https://www.scien.cx/2025/10/25/breaking-the-monolith-how-we-split-a-node-js-backend-into-go-microservices-on-aws-ecs-without-stopping-the-world/. [Accessed: ]
rf:citation
» Breaking the Monolith: How We Split a Node.js Backend into Go Microservices on AWS ECS-Without Stopping the World | Voskan Voskanyan | Sciencx | https://www.scien.cx/2025/10/25/breaking-the-monolith-how-we-split-a-node-js-backend-into-go-microservices-on-aws-ecs-without-stopping-the-world/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.