I broke production. What now?

Feature image: I broke production. What now?

Let me start by telling you how I broke a production database.

A long time ago, I worked for a hosting company. We were building our own CRM, a complex tool that centralized all communication: emails, chats, texts—everything. I was working on a new feature and needed to test it with a large dataset, like the one we had in production. So I decided to copy the production messages table into my local database.

I was using this new MySQL client for Windows XP. It had an Export Table feature that looked perfect for the job. I clicked around, selected the messages table, confirmed the export, and watched the SQL file drop into my downloads folder. "Alright, time to head out for my dentist appointment," I said, packed my things and walked out of the office.

And there I was—prone in that awkward dentist's chair, with my mouth wide open, my eyes scorched by the blinding lamp, and my body frozen under that serious don't move warning—when I knew something was wrong. My phone rang. Over and over and over again. Finally, I signaled the dentist to stop so I could pick up.

It was the sysadmin. "Nico, you broke the CRM," he said. "I checked the logs. You dropped the messages table."

I got chills. "Wait... WHAT?" I asked, still confused.

"The app's blank, man. Error 500. Everyone's freaking out. I checked the logs: twenty minutes ago, your user ran DROP TABLE messages. Was that you?"

Cold sweat. Facepalm. "Hey... can't you restore a backup?" I mumbled.

"I tried," he said, "but it's throwing inconsistency errors. Looks like the backup is corrupt."

"I'm on my way," I said, leaving the dentist's office even faster than usual.

Oh my god. It was that Export Table thing. It didn't just export the table, it somehow deleted it from the production database. I felt dizzy. What was the boss going to say? Just when I was trying to get that raise. Years and years of customer messages, wiped out. Gone.

Wait, not quite gone. I had a copy on my laptop, in the downloads folder, in a stupidly large SQL file.

Could I make it back to the office safe and sound? Could I restore it? Could this story actually have a happy ending?

Well... stick around to find out!

We All Make Mistakes

It's true, we all make mistakes. All the time. So many, every day, that we don't even keep count. Forgot a semicolon in your PHP code? Just add it and refresh your browser. Typo in a function name? Your IDE underlines it with a squiggly line. Heck, maybe it even fixes it for you automatically!

We make mistakes constantly because we're human and because web development is hard. It's easy to forget that, especially in this age where social media is full of people humblebragging about making thousands from an app they vibe coded while waiting in the supermarket checkout line.

But it is hard! There are a million little things to pay attention to and, spoiler alert, someday you'll miss one. You're going to...

  • Delete a production database
  • Send a test email to thousands of real customers
  • Leak an API key in a public repo
  • Wipe out someone's work with a forced push
  • Forget to renew a critical domain
  • Push a change that breaks the signup flow
  • Point the production website to the staging API
  • Introduce a caching issue that takes days to trace

... or make any other big mistake, with big consequences.

For more examples, check out the Halloween episodes of the Syntax.fm podcast, where listeners share spooky stories of their worst web development mistakes, from typos taking down online shops and causing thousands of dollars in losses to debug messages with profanity slipping into live sites or emails. There is even the story of how buggy math led to the elimination of a fan-favorite contestant on a little reality show called Big Brother. Some stories are scarier than any Goosebumps episode!

Think about this: can you avoid making these mistakes? Can you keep adding days to that sign of zero incidents? Can you and your whole team be flawless? You can certainly work to reduce the chances of screwing up, but nobody is bulletproof.

And that's why today I want to talk about how to react in these moments of crisis.

A Practical Guide for the One Who Messed Up

This is a simple guide with clear steps to follow when something goes wrong. The website is down, everyone is worried, and messages are coming in. What do you do?

In the middle of a crisis, you will not remember complex procedures or long checklists, so we'll keep it simple. Five concrete steps:

  • Diagnose
  • Alert
  • Kludge
  • Optimize
  • Report

This is the DAKOR system (yeah, sounds like the name of a Saturday morning cartoon villain). Let's get started.

Step 1: Diagnose — Know What You're Actually Dealing With

It is a quiet Wednesday evening when, suddenly, you receive a message from your boss: "THE SITE DOESN'T WORK." Yes, all caps and all. What is the first thing you do?

Start by replying to the message. A simple "Checking" is enough. That short reply lets the reporter know you are on it and helps lower the anxiety level immediately.

Next, your immediate goal is to understand how widespread the problem really is.

  • Even if you are greeted by a big, scary 500 Internal Server Error when you reload the page, the issue might not affect everyone.
  • Even if you receive a report that "the app is broken," it might not mean the entire system is down, but rather that a specific action is not working as expected.

Thus, your goal is to determine the exact scope of the issue. Focus on replicating the specific scenario in which it occurs.

Don't go down the rabbit hole of finding the buggy line of code that causes it. You'll do that later. For now, just focus on finding a clear set of steps that anyone could follow to encounter the failure.

Open a fresh browser window, log in with a real production account, try to reproduce the bug, and check the logs. Get a clear view of what is broken and, more importantly, what still works.

  • Maybe a server is down while the others are responding as expected.
  • Perhaps only specific pages are affected while the rest of the site has no issues.
  • Or only certain users are impacted: those who logged in recently, those in a specific region, or those with an admin role (which explains why your boss is panicking).

Once you understand the situation, note it down. Use concrete and precise language:

Guests can't register for the event: CAPTCHA on the registration page always returns failure

Now ask yourself a simple question: how bad is this, really? To answer that, context matters a lot.

Let's take that example, "guests can't register", in two different scenarios:

  1. The event is free, the deadline is months away, and page traffic is negligible.
  2. This is a major, year-long promoted conference on its final registration day. Paid ads redirect to the broken form, frustrated users leave the page and most likely never come back.

Same bug... completely different severity! Your job is to understand the impact of the issue to determine the best course of action.

Summary of Step 1 - Diagnose:

  • Send a quick "Checking" reply when someone reports the issue
  • Determine the scope and severity of the problem
  • Describe it in short, simple terms

“It works on my machine”
Please never respond to a report with "it works on my machine" or a similar phrase, even if that is true. Statements like that can come across as "your problem, not mine" and can be frustrating for the person experiencing the issue.

If the functionality fails for a given user but not in your tests, describe the steps you followed (a video recording can help) and ask them for more context to better understand what is going on.

For example, “I visited the site in Chrome, logged in with a seller profile, and was able to create a discount coupon. I could not reproduce the error you mentioned. Could you tell me which user you tried with? What coupon code did you attempt to create?”

Step 2: Alert — Tell People What's Happening

Once you understand what's broken, let people know. Depending on the situation, that might be your boss, your manager, your clients, or other relevant people. The important part is to do it early.

Granted, this is usually an uncomfortable moment. A few thoughts might immediately pop into your head:

  • What if I can fix it quickly before anyone notices?
  • I made a change recently… maybe I can just roll it back and hope for the best!
  • Oh no, the boss cannot find out. Last time a dev messed up like this, it did not go well...

Still, silence almost always makes things worse.

Clear communication builds trust, keeps everyone aligned, and gives other people a chance to help in very practical ways:

  • Marketing can pause ads before more money is wasted.
  • Support can prepare for a wave of confused or angry users.
  • Social media can decide whether to acknowledge the issue.
  • Leadership hears it from you, not through rumors or screenshots.
  • And someone on your team might recognize the problem immediately.

You do not need to have the solution yet. Just share what you know so far. Take the notes from Step 1 and share what is broken, who is affected, when it started, and the severity level. Tag the people who need to see it, and tell them when you will check back in.

A good report is short, factual, and calm. For example:

CRITICAL ISSUE: mybigevent.com

Guest users unable to register for the event.

CAPTCHA on registration page is consistently failing.

Last successful registration: 3 hours ago (10:11 AM)

John and I are investigating. Update in ~5 minutes.

That's it! Just the facts and a clear next step.

Summary of Step 2 - Alert:

  • Communicate early, even if you do not have the fix yet
  • Share what you know clearly and honestly
  • Set expectations by saying when you will update people again

Step 3: Kludge — Make It Usable

Now you can get to work and focus on stabilizing the system. Notice I said stabilizing, not fixing. Your first goal is not to reach the ideal state, but a reliable one in a short amount of time.

How? You kludge it. A kludge is a solution that technically works, but in the same way duct tape technically fixes a broken car mirror.

Let's go back to the CAPTCHA example: your first instinct might be to disable it. And honestly, that can be the right move! Yeah, it is not perfect, because you are opening the door to bots registering and taking spots meant for real people. But if you leave it as is, nobody can register at all. Not bots, not humans.

In most cases, applying a quick workaround is better than chasing the root cause right away. Focus on the fastest path to restore critical functionality, even if the solution is imperfect.

That might mean:

  • Rolling back a recent deploy
  • Toggling a feature flag to disable the broken feature
  • Hardcoding a value to bypass a failing API call
  • Running a raw SQL query in production to patch bad data
  • Editing a Blade template directly on the prod server to hide the broken section

The goal is to get the system into a workable state by isolating or disabling what is broken.

Sometimes the issue is so severe that even your best efforts to make the site usable will not be enough. As a last resort, you can:

  • Put the application into read-only mode. Not ideal, but it might work for sites with more reads than writes.
  • Or flip the switch into maintenance mode. The site will not work at all, but it might be the only option when, for example, some strange bug allows users to purchase any product with a 100% discount.

Whatever you attempt, keep your technical teammates in the loop. Tweaking an Nginx rule, flushing a cache, changing a driver? Let them know before you do it. This helps everyone stay on the same page, avoids people undoing each other's fixes, and leaves a clear breadcrumb trail in the message log when it is time to roll things back.

Once you have stabilized the situation, follow up on your initial report:

Team, we pushed a hotfix to prod to disable the CAPTCHA.

The "I'm not a robot" box will NOT show up.

Impact: Guests will be able to register. Expect some bot registrations as well.

We'll try to restore it ASAP. Next update in 10 minutes.

Summary of Step 3 - Kludge:

  • Do whatever you can to get the app into a stable, usable state.
  • Do not chase the ideal fix yet. You will have time later to dig into the root cause and bring the system back to 100 percent.
  • Keep people in the loop to avoid duplicate work and keep track of everything you do.

Step 4: Optimize — Restore Full Functionality

You did it. The site is stable. Things are (mostly) working. Take a deep breath, grab some water, refill your coffee. Now the real work begins: actually fixing the issue.

You've bought yourself time with that stabilization work. Use it: look for the faulty logic, the bad line of code, the unfortunate config change, the missing file, the stray process that ate all your RAM. Whatever knocked things over, take your time and find it.

When the fix is complex, consider restoring functionality in layers. Take short steps instead of making one big leap. It's slower, but it's safer: you can test each change and catch problems before they compound.

However you approach it, take time to test your solution. On your local environment, on staging, on whatever's closest to production. The last thing anyone needs during a crisis is a deploy that makes things worse. You've got time now. Don't rush it. Don't spam commits to main. Stay calm and make it right.

Once the fix is deployed and the site is back to full capacity, let people know:

The issue has been fixed. CAPTCHA is now working as expected.

Full incident report coming tomorrow.

Now, finally, take a break. You've earned it!

Summary of Step 4 - Optimize:

  • Take your time. You've bought breathing room, so use it.
  • Test before deploying: the last thing you need is a fix that breaks things further.
  • Let your team know when things are back to normal.

Step 5: Report: Understand to Prevent

The app is online, customers are happy, and you can finally relax knowing everything is back in shape.

Now comes a crucial part: learning from what happened. To do that, you need to analyze the incident and write a clear report. This is sometimes called a post-mortem, but I find that term a bit creepy. Incident report is fine!

Your report should include a detailed description of the problem, what caused it, how long it lasted, the steps you took to contain it, how stable the system was during that containment period, and finally, the exact solution that fully restored functionality.

You can finish with recommendations for preventing similar issues in the future:

  • Maybe the cause was a deploy with insufficient testing, so next time you tighten your review process.
  • Perhaps a manual step went wrong, so you add automation or confirmation prompts to reduce human error.
  • Or maybe an outdated technology is holding the system together with tape and hope, and the incident gives you a reason to propose replacing it.

You need to be honest, even if the mistake was yours or shared across the team. Taking responsibility is part of the job. Nobody is perfect, and pretending the error came from elsewhere only makes things worse. That said, do not be rude toward others, and avoid placing all the blame on one person.

Also, don't write this at 2 AM when you are exhausted and frustrated. Do it the next day, no rush. Just do not skip it. It is way too easy to grab the win, relax for the weekend, and forget everything. If six months later something similar happens, you will think, "How did we not learn from this?"

These lessons only matter if people absorb them and use them to prevent similar problems down the road, so encourage people to read the report and share feedback.

Here is an example of an excellent outage report from Resend. Note that, although the incident was directly caused by a Cloudflare outage, they do not deflect the blame. They acknowledge that relying exclusively on a single provider is a risk and outline a path to change that.

And... done! You made it through. You learned. And that is no small feat.

Summary of Step 5: Report

  • Write a detailed incident report and share it
  • Focus on what, not who: the report should improve systems, not assign blame
  • Make sure people actually read it and learn from it

Conclusion

I hope you enjoyed this article. See you next ti... Oh, of course. You want to know what happened with the table I erased. Alright, let's finish the story!

I rushed to the office with the backup on my laptop. The sysadmin made a copy and tried to restore it to production, but it didn't work. The download was incomplete or something like that, I can't remember exactly. So I couldn't fix my mistake. We just created a new, empty messages table to at least get the CRM back online. New communications worked fine, but the history was lost.

What followed was a week of work to restore the messages from email copies and chat logs. We built a script to read that raw data and re-create the messages in the table with their corresponding IDs. If you think that sounds like a nightmare, well... you're correct.

But I learned many lessons that still serve me well. When I need to connect to production databases, I always create a dedicated read-only user. I write my SQL by hand instead of relying on the GUIs that generate it for me. And beyond taking backups or snapshots, I always check that they actually work by regularly simulating a restoration.

We all screw up. The next time you do, stay calm and remember the tips we shared today.

You can fix it!

Get our latest insights in your inbox:

By submitting this form, you acknowledge our Privacy Notice.

Hey, let’s talk.

By submitting this form, you acknowledge our Privacy Notice.

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Thank you!

We appreciate your interest. We will get right back to you.