It was 2:00 AM on a Tuesday. I was on call, nursing a cold brew and watching the dashboards for Stratus Finance , a global payment processor. Our web cluster was pristine: six origin servers humming behind three Web Application Proxy (WAP) servers. The WAPs handled SSL offloading, pre-authentication, and acted as a reverse proxy for our customer-facing APIs.
A cluster is only as strong as its weakest node. Redundancy isn't about keeping every machine breathing; it's about keeping the right machines healthy. Sometimes, removing a server isn't a loss of capacity—it's an amputation of a chronic disease.
That 0.5% of failed payments? It wasn't random packet loss. It was the cluster waiting for a dead zombie to vote. remove web application proxy server from cluster
"Yes. Also, we have a rogue monitoring script you should know about."
The business didn't see 0.5%. They saw "99.95% uptime." But I saw the angry tweets. I saw the support tickets: "Card declined. Please try again." Those weren't bank declines. Those were wap-03 swallowing the requests whole. It was 2:00 AM on a Tuesday
As I prepared to shut down the virtual machine, I decided to tail the legacy logs one last time. tail -f /var/log/wap/traffic.log on wap-03 .
At 2:17 AM, I drained the traffic. The F5 showed wap-03 's connection count dropping from 1,200 to 0. Beautiful. Sometimes, removing a server isn't a loss of
I ran the stop command: Stop-WebApplicationProxy -Node wap-03
Tonight was the night. I had a change ticket: CHG-0421 – Remove wap-03 from cluster and decommission.
Or rather, two of the WAPs did the heavy lifting. The third one, wap-03.internal.stratus.com , was the problem child.
The remaining two WAPs ( wap-01 and wap-02 ) recalculated their session tables. CPU usage on wap-01 jumped from 18% to 32%. Well within limits. Memory stable. Error rate on the payment API… held steady at 0.01% (baseline noise).