• dan@upvote.au
    link
    fedilink
    arrow-up
    1
    ·
    edit-2
    9 months ago

    I broke the home page of a big tech (FAANG) company.

    I added a call to an API created by another team. I did an initial test with 2% of production traffic + 50% of employee traffic, and it worked fine. After a day or two, I rolled out to 100% of users, and it broke the home page. It was broken for around 3 minutes until the deployment oncall found the killswitch I put in the code and turned it off. They noticed the issue quicker than I did.

    What I didn’t realise was that only some of the methods of this class had Memcache caching. The method I was calling did not. It turns out it was running a database query on a DB with a single shard and only 4 replicas, that wasn’t designed for production traffic. As soon as my code rolled out to 100% of users. the DBs immediately fell over from tens of thousands of simultaneous connections.

    Always use feature flags for risky work! It would have been broken for a lot longer if I didn’t add one and they had to re-deploy the site. The site was continuously pushed all day, but building and deploying could take 45+ mins.

    • jjjalljs@ttrpg.network
      link
      fedilink
      arrow-up
      1
      ·
      9 months ago

      Always use feature flags for risky work! It would have been broken for a lot longer if I didn’t add one and they had to re-deploy the site. The site was continuously pushed all day, but building and deploying could take 45+ mins

      This reminds me of the old saying: everyone has a test environment. Some people are lucky enough to have a separate production environment, too.