Why I chose Go and Postgres for an ETL pipeline (Part 1)

Go and Postgres may be an uncommon stack for data pipelines, but here's why I chose them and how I made them work

Ngoc Doan

Data Engineer Things

· ~7 min read · December 15, 2025 (Updated: December 15, 2025) · Free: Yes

When thinking of data infrastructure, people may think of Python for data pipelines, Kafka for message queues, Snowflake or Databricks for data warehouse/data lake, and tons of other emerging data tools. However, in a context of a fast-growing startup, those advanced technologies may be an overkill, considering the big learning curve, higher complexity, and cumulative costs.

My startup is an EnergyTech company that collects data from energy IoT devices (solar inverters, smart meters, etc.). Data is powerful and can be utilized in many different ways. But for us, the first and most important use case is data for billing, where not only every decimal digit to be exactly right, but also duplicates or missing data are unacceptable. That means not only I had to move fast, but I also had to move right and ensure that I was laying a reliable groundwork so that future engineers will continue keeping things tidy and follow my set conventions. As such, that pushed me to favor a monorepo codebase with a strongly typed language and a relational database over maximum flexibility. Out of all options, I chose Golang over Python and PostgreSQL with the TimescaleDB extension over other databases.

I'm not claiming that this stack is "right" or "optimal" optimal, I'm explaining the rationale behind my choices, what wins and losses I gained, and how I improved things over time. If you're an engineer under the pressure and constraints of a startup, I hope this helps you make better calls.

For some business context, we had around 30–40 installation sites at the time, each equipped with 1–2 IoT devices. Each device produced one reading every 5 or 30 minutes. Some devices had proper or almost proper APIs, some only exposed data on old-fashioned dashboards, so scraping was my only option.

My initial design was deliberately simple: an ETL pipeline in Go + one DB. In the pipeline, each device runs its own runner that ingests and processes data, pushes records onto a shared Go channel, and a single database writer drains that channel and persists everything to Postgres. The Go channel, along with goroutines, was my lightweight substitute for a full complex asynchronous Kafka queue at this stage.

            device A         device B         device C
                │                │                │
                ├────────────────┼────────────────┤
                ▼                ▼                ▼
                         [ shared Go channel ]
                                  │
                                  ▼
                             [ DB writer ]
                                  │
                                  ▼
                               PostgreSQL

Why I chose Go

The startup was growing 10–20x. As such, I needed an ETL pipeline that also could accommodate the scale. With the requirements for decimal accuracy and strictly no duplicates for billing purposes, I wanted a strongly typed language. Go came to mind as an outstanding choice with its static typing, simple commands to build and run, excellent performance with simple concurrency model.

Type safety at compile time: As the only engineer who built (and rebuilt many times) the entire codebase, I didn't want my entire codebase to fail because of petty silly mistakes, like wrong field names, wrong types, or missing commas that can take me to scan the entire codebase to find out. This particularly proved to be useful in transform layer, where each data source required different validations, normalizations, and calculations.
Simple deployment: Since I have up to 9 different devices, I have each job to pull data from each source. Thus, I built multiple main programs for the jobs. Good thing about Go is that its build is simple and lightweight enough that my Dockerfile was clean and easy for me to add new jobs as well. I first download all dependencies through go mod tidy && go mod download, then build the programs through go build, and finally copied the binaries to keep the images minimal and lightweight

FROM golang:1.24-alpine AS builder

WORKDIR /app

COPY go.mod go.sum ./
COPY cmd/pipeline ./cmd/pipeline
COPY internal/pipeline ./internal/pipeline
COPY internal/storage ./internal/storage

RUN go mod tidy && go mod download
RUN go build -tags=job1 -o pipeline-job1 ./cmd/pipeline
RUN go build -tags=job2 -o pipeline-job2 ./cmd/pipeline
RUN go build -tags=job3 -o pipeline-job3 ./cmd/pipeline

FROM alpine:3.19

WORKDIR /app

RUN apk add --no-cache tzdata

COPY --from=builder /app/pipeline-job1 /app/pipeline-job1
COPY --from=builder /app/pipeline-job2 /app/pipeline-job2
COPY --from=builder /app/pipeline-job3 /app/pipeline-job3

Simple concurrency: After ingesting from multiple data sources, I loaded them all into one central database. Thus, I needed parallel fetch/transform without much hassel. The combo of goroutines + channels let me fan out quickly while keeping the code readable.

var (
    // two shared channels are set globally for the jobs to access
    meterWriterChannel chan schema.MeterData
    inverterWriterChannel chan schemama.InverterData
)

func main() {
    ctx, cancel := setup()
    defer cancel()

    createMeterRunners()
    createInverterRunners()

    // Start producers, the jobs will pass the data into shared Go channels
    for _, r := range meterRunners { go r.RunJob(ctx) }
    for _, r := range inverterRunners { go r.RunJob(ctx) }

    // Block until shutdown
    waitForShutdown(ctx)
}

func setup() (context.Context, context.CancelFunc) {
    db, _:= openDB() 
    meterWriterChannel = make(chan schema.MeterData, 3000)
    inverterWriterChannel = make(chan schema.InverterData, 3000)
    
    // the channels will receive data and write into the DB
    for {
        select {
        case data, ok := <-meterWriterChannel:
            if !ok { return }
            _ = db.MeterTable.Insert(rec) 
        }
        case rec, ok := <-inverterWriterChannel:
            if !ok { return }
            _ = db.InverterTable.Insert(rec)
        }
        // handle breaking loop
    }
    return context.WithCancel(context.Background())
}

Things I didn't like about Go

1. Null/error handling was more verbose than expected

Go functions typically return a value and an error, which compels users to handle the error. The intent, I believe, is good, since it kind of reminds you to handle all errors and edge cases. However, along the way it added more noise as I had to spend 3 extra lines handling errors for every line of logic.

data, err = ingest()
if err != nil {
   // handle error here
}

2. Channels are not a durability boundary.

Channels were fine at first; however, as we scaled, I started seeing gaps in the data. It was quite hard to catch because, unless the process crashed or died unexpectedly, nothing surfaced as an error. After all, a channel is just an in‑memory queue inside one process. If the process gets interrupted for any reason, the items sitting in the channel at that moment are gone.

3. Too many database connections

We kept adding more data sources and new schemas, but all feeding to one central database. That means the volume kept growing and the queues started to backlog, while inserts were slow and expensive. Go's standard database layer (database/sql) automatically opens more connections to relieve the backpressure problem if we didn't explicitly cap it. The pool fanned out, the database struggled to take more concurrent work, and I began seeing latency spikes and transient errors like "too many clients."

4. Circular Dependencies

Go is(in)famous for disallowing circular dependencies, a mistake that many people, even experienced devs, may still run into. That means you have to be careful with how you lay out packages. Initially, my layout was simple: a classic ETL with ingest → transform → load. However, as the logic was getting more complex, for example, comparing incoming data with already-loaded data to detect missing/malformed/wrong records, I started to run into circular dependencies.

How I improved from the cons

I didn't take these problems as disadvantages, instead I believed the friction were needed to push me into better structure and more reliable patterns. Here's how I improved:

1. Only handling `err` checks if necessary

I stopped handling the checks everywhere, but focus on where adding values only. I also used go formatter so that the code would look cleaner.

2. Keep channels for simple setup, but add guardrails

I still kept the use of channels as we haven't reached a scale that we could afford Kafka (yet). Instead, I added multiple retries for one job, while making sure the database would have primary keys set so that no duplicates existed. I also added in-between missing data detections, for example, comparing expected vs observed timestamps.

3. Utilize batch inserts and cap the connections

Single-row inserts were slow and costly. Thus I switched to batch inserts, grouped either by time or by record count. This also strengthened my data guardrails earlier as by wrapping each batch in a transaction, all writes either succeed together or fail together, no half-written batches.

batch := make([]energy.MeterData, 0, batchSize)
var batchStart time.Time

flush := func(reason string) {
   if len(batch) == 0 { return }
   if err := db.MeterTable.InsertBatch(ctx, batch); err != nil {
      logger.Errorf("flush failed reason=%s count=%d err=%v", reason, len(batch), err)
    }
   batch = batch[:0]
   batchStart = time.Time{}
}

for {
   select {
   case <-ctx.Done():
       flush("shutdown"); return
   case data, ok := <-channel:
       if !ok { flush("channel_closed"); return }
       if len(batch) == 0 { batchStart = time.Now() }
       batch = append(batch, data)
  
       if len(batch) >= batchSize { flush("size") }
       if time.Since(batchStart) >= maxWait { flush("time") }
   }
}

4. Embrace "no circular dependencies" and layer the code

I gradually found the no-circle constraint to be a good thing. When thinking of packages as layers and sublayers, things started to make more sense. When I needed to compare incoming data with what's already stored, I stopped thinking storage as the final component of the ETL flow but instead as a foundational layer that downstream apps can read from. The ETL pipeline is just one consumer of that layer, same as any other app.

repo/
├─ cmd/                # main programs (no business logic)
│  ├─ job1/main.go
│  ├─ job2/main.go
│  └─ job3/main.go
├─ pipeline/           # ETL layers
│  ├─ ingest/          # fetch/scrape/readers 
│  ├─ transform/       # validate/normalize/calc 
│  └─ runner/          # orchestration, batching, retries
├─ storage/            # DB boundary (SQL, models, upserts)
└─ downstream_apps/    # read models/APIs from storage

In the next part, I'll cover the other half of the stack, PostgreSQL as both the ingestion sink and the central source of truth. It's definitely not the usual "data warehouse," but I chose it under the same constraints that led me to Go. PostgreSQL handles time-series surprisingly well with the Timescale extension. I'll walk through where the trade-offs showed up and how I made that work.

#data-engineering #etl #golang #postgresql