Or should it called How OpenX and Revive Adserver sucks hard?

TL;DR

Don't use things that you don't know properly in production without ensuring that everyone knows it, or the risk are properly mapped.

I've past the whole past week trying to solve a major problem on Revive Adserver -formerly known as OpenX- backed mobile portal and things were just awful, from not been able to find enough resources online/offline, to failing in understand the architectural basics of the product without having read the whole github repo.

It was the first major deploy on AWS for our team so we were both anxious and excited for that, but not expecting major problems since it was all tested at least a week before.

Auto-scaling was in place and all went smooth. We've noticed that the adserver domain wasn't on Route53 so we've registered on more and moved to it. Revive Adserver is very finicky about that, after changing the domain which the ads are coming from, you've to change two config files that points to domain specifics.

For the following couple days we upgraded every user to the new portal and all went smooth, err ... kind of.

For unknown reasons, whenever we restarted or auto-scale the backend servers (the ones who served the ads) major crap happened, like ads broken, ad zones not loading and the whole portal became unusable while we kept getting more and more traffic.

First suspect: Cache locking

Since we've so many concurrent access (as much as 200 users each of the backend) probably the cache rewrite entered in a race condition and loaded garbled content. So we've added configuration to nginx fastcgi params to deal with it.

location ~* foo(html|wml)\.(?:php)${  
    ...
    fastcgi_no_cache $no_cache;
    fastcgi_cache_bypass $no_cache;              
    fastcgi_cache_lock on;
    fastcgi_cache_lock_timeout 3s;
    fastcgi_cache microcache;
    fastcgi_cache_key "$server_name|$request_uri|$query_string";
    fastcgi_cache_valid 200 5m;
    fastcgi_max_temp_file_size 1M;
    fastcgi_cache_use_stale updating;
    ...
}

Restarted and nothing changed.

Access and Error logging

We tried to look the nginx error log looking for an answer for the cause of the mysterious malfunctioning, the huge amount of 500 / 200 with sinister length of 31 bytes of response. The problem was that we haven't included access_log off or log_not_found off to most of the locations, so there were like thousands of garbage on the logs.

After fixing it we're still unable to see any of the 50x or 200 with puzzling response length of the magical 31 bytes.

Possible causes

So we're asking ourselves which possible thing could break the whole server, after a couple of tests we were able to produce the response of 31 bytes and answer the first question, those bytes where same response for "ad not found/properly configured".

After checking one more time the ad/zone/advertiser inventory I was skeptical that I would be able to fix the issue.

Being creative/desperate

After redrawing the architecture and re-explaining to a coworker (Rubber Ducking or Teddy Bear in techniques of debugging, section: Explain your code to someone else.), I've remembered what we did on the last Friday, the domain moving from a to b, so I've quickly wrote grep -rHl 'a.domain.com' . on the root for the adserver www folder and voila, there we're thousands of cache files referencing the old domain. One quick rm -rf * proved the point, the whole Revive AdServer were back from the dead just with that.

I've ended up filling up a bug for the Revive Adserver @ github for it.

Have you ever encountered yourself in a situation like this? How have you overcome that?