Goread2 - Chapter 3: Wherein This All Costs Real Money

This is the third in a series of posts wherein I attempt to recount the history of Goread2 as it approaches a state in which I might actually try to share it more broadly. Previous installments can be found here and here.

I bet there is a moment in the lifecycle of most cloud-hosted apps when a quick glance at the billing dashboard causes your relationship with the project to permanently change. The moment reframes all those architectural decisions made in the abstract, all those database calls that seemed fine locally, etc. and gives them a cost in actual dollars per month. For GoRead2, that moment occurred in October.

Google Cloud Datastore is a NoSQL document store, and it bills primarily by operation—specifically, by the number of entity reads. The free tier is generous enough that a hobby project with light traffic stays well within it. GoRead2 was not going to stay within it, because GoRead2 had a design decision baked in from the very first commit: GetAllArticles().

GetAllArticles() did exactly what it said: it fetched every article in the database. No pagination, no limit, no filter—just all of them, in one query. It was the kind of method you write when you’re building a single-user app and your test data is twenty articles you added yourself. But when you are a multi-user app on Datastore and some of your users subscribe to active news feeds, “all articles” becomes a number that arrives with an invoice attached.

The October cost-reduction push hit on multiple simultaneous fronts. Session data was being fetched from Datastore on every authenticated request. A commit titled Add session caching to reduce Firestore read costs by 40-60% laid out the math with the specificity of someone who had just seen a billing dashboard: “With 100 requests/day per user, this caused ~43M reads/year.” An in-memory session cache with a 10-minute TTL fixed this. A few hours later: Add unread count caching to reduce Firestore reads by 50-70%. The commit message tracked the improvement down to the individual page load: “First page load: 121 reads (cache miss). Subsequent navigation: 0 reads (cache hits).”

This is the kind of precision you develop when your shit starts costing real money.

Also in October: min_instances in app.yaml was set to 1, so App Engine kept an instance running at all times, even when nobody was using the app. This costs money continuously whether anyone is logged in or not. Changing it to 0 and letting the app scale down when idle was a one-line fix: min_instances: 0. But there’s a tradeoff: the first request after the app has been idle triggers a cold start, adding a second or two of latency. The solution to this was a loading screen that appears during cold starts, which is either elegant product thinking or the infrastructure equivalent of putting a rug over a hole in the floor, depending on your mood.

Then, November 2nd was a focused demolition of everything wasteful, conducted with the energy of someone who had just opened four billing alerts in a row. Remove unbounded GetAllArticles() method to reduce costs. The commit message estimated the expected savings at “$300-500/month in Datastore costs.” For context, this was a method that, upon investigation, was not actually called anywhere in production, only in a test. A test method that nobody had ever deleted was, in theory, costing hundreds of dollars a month merely by existing and being occasionally invoked during test runs against a staging environment. Out it went.

Optimize feed refresh strategy to reduce Google Cloud costs halved the cron job frequency, disabled a background scheduler loop that had been keeping instances warm all day, and added smart feed prioritization so feeds that post frequently get checked more often, and feeds that haven’t posted in months get checked less often. Estimated savings: “$30-60/month.” Some other winners: Remove wasteful +50 article fetching from pagination query, Add caching for GetAllUserFeeds(), and Implement cursor-based pagination to reduce Datastore costs—cursor-based pagination being the correct solution for Datastore, which charges per entity read whether you use the results or not, meaning “fetch 50 and discard 20” costs the same as “fetch 70 and keep all of them,” which is a fun billing model.

Three days later: Implement smart feed prioritization to reduce networking costs by 20-40%. This one used a hash of the feed ID to spread updates evenly across a time window, so every feed wasn’t being checked simultaneously at the top of each hour. Let’s try not to annoy the servers of every blog and news site to which the app subscribes. You’re welcome, internet.

By mid-November, the app had a caching layer, bounded queries, intelligent feed scheduling, and a billing dashboard full of graphs trending in a reassuring direction.

And then, in February, it happened again.

The February incident is a masterpiece of cascading unintended consequences. The commit that uncovered it was titled Fix catastrophic cost explosion by eliminating background goroutines, and the root cause analysis in the message identified three separate changes that had each, independently, made the same mistake in different ways.

First, our migration to App Engine Standard (from Flexible) with the stated goal of cost reduction had accidentally removed a CPU utilization target from app.yaml, causing App Engine’s auto-scaler to spin up way more instances than necessary at the slightest sign of traffic.

Second, an earlier fix had changed session cleanup from synchronous to asynchronous (go sm.DeleteSession() instead of sm.DeleteSession()) to avoid blocking requests. Sensible in principle. What it actually did was spawn a goroutine on every logout, and goroutines keep their host instance alive. App Engine Standard scales to zero when there’s nothing running. “Nothing running” does not include “a goroutine that’s almost done cleaning up a session.”

Third, the SessionManager had three background cleanup loops running on tickers—one for expired database sessions, one for the session cache, one for OAuth state tokens. All three fired up when the app started. All three kept the instance alive indefinitely, preventing any scale-to-zero. This had always been true; it just hadn’t mattered until the Flexible-to-Standard migration, because Flexible doesn’t scale to zero anyway.

The result: costs for a single user went from $6.23/month to $20.53/month. The fix was removing all three background goroutines and moving cleanup to cron jobs instead, reversing the async session deletion, and adding the CPU utilization target back. One line in app.yaml, a handful of goroutine removals, and a new cron entry.

The app was back to $6/month per user, which is still probably too high for commercial success but at least under control. The obvious lesson here is that keeping an App Engine Standard instance “just a little bit busy” costs the same as keeping it fully busy, and that goroutines are the easiest way to accidentally do this. This lesson is now logged in a commit message that will outlast the billing receipts.

Cloud billing is a good teacher. It charges per lesson, but you don’t forget what you learned.

Next: we learn all about security.