Let me start by telling you how I broke a production database.
A long time ago, I worked for a hosting company. We were building our own CRM, a complex tool that centralized all communication: emails, chats, texts—everything. I was working on a new feature and needed to test it with a large dataset, like the one we had in production. So I decided to copy the production messages table into my local database.
I was using this new MySQL client for Windows XP. It had an Export Table feature that looked perfect for the job. I clicked around, selected the messages table, confirmed the export, and watched the SQL file drop into my downloads folder. "Alright, time to head out for my dentist appointment," I said, packed my things and walked out of the office.
And there I was—prone in that awkward dentist's chair, with my mouth wide open, my eyes scorched by the blinding lamp, and my body frozen under that serious don't move warning—when I knew something was wrong. My phone rang. Over and over and over again. Finally, I signaled the dentist to stop so I could pick up.
It was the sysadmin. "Nico, you broke the CRM," he said. "I checked the logs. You dropped the messages table."
I got chills. "Wait... WHAT?" I asked, still confused.
"The app's blank, man. Error 500. Everyone's freaking out. I checked the logs: twenty minutes ago, your user ran DROP TABLE messages. Was that you?"
Cold sweat. Facepalm. "Hey... can't you restore a backup?" I mumbled.
"I tried," he said, "but it's throwing inconsistency errors. Looks like the backup is corrupt."
"I'm on my way," I said, leaving the dentist's office even faster than usual.
Oh my god. It was that Export Table thing. It didn't just export the table, it somehow deleted it from the production database. I felt dizzy. What was the boss going to say? Just when I was trying to get that raise. Years and years of customer messages, wiped out. Gone.
Wait, not quite gone. I had a copy on my laptop, in the downloads folder, in a stupidly large SQL file.
Could I make it back to the office safe and sound? Could I restore it? Could this story actually have a happy ending?
Well... stick around to find out!
It's true, we all make mistakes. All the time. So many, every day, that we don't even keep count. Forgot a semicolon in your PHP code? Just add it and refresh your browser. Typo in a function name? Your IDE underlines it with a squiggly line. Heck, maybe it even fixes it for you automatically!
We make mistakes constantly because we're human and because web development is hard. It's easy to forget that, especially in this age where social media is full of people humblebragging about making thousands from an app they vibe coded while waiting in the supermarket checkout line.
But it is hard! There are a million little things to pay attention to and, spoiler alert, someday you'll miss one. You're going to...
... or make any other big mistake, with big consequences.
For more examples, check out the Halloween episodes of the Syntax.fm podcast, where listeners share spooky stories of their worst web development mistakes, from typos taking down online shops and causing thousands of dollars in losses to debug messages with profanity slipping into live sites or emails. There is even the story of how buggy math led to the elimination of a fan-favorite contestant on a little reality show called Big Brother. Some stories are scarier than any Goosebumps episode!
Think about this: can you avoid making these mistakes? Can you keep adding days to that sign of zero incidents? Can you and your whole team be flawless? You can certainly work to reduce the chances of screwing up, but nobody is bulletproof.
And that's why today I want to talk about how to react in these moments of crisis.
This is a simple guide with clear steps to follow when something goes wrong. The website is down, everyone is worried, and messages are coming in. What do you do?
In the middle of a crisis, you will not remember complex procedures or long checklists, so we'll keep it simple. Five concrete steps:
This is the DAKOR system (yeah, sounds like the name of a Saturday morning cartoon villain). Let's get started.
It is a quiet Wednesday evening when, suddenly, you receive a message from your boss: "THE SITE DOESN'T WORK." Yes, all caps and all. What is the first thing you do?
Start by replying to the message. A simple "Checking" is enough. That short reply lets the reporter know you are on it and helps lower the anxiety level immediately.
Next, your immediate goal is to understand how widespread the problem really is.
500 Internal Server Error when you reload the page, the issue might not affect everyone.Thus, your goal is to determine the exact scope of the issue. Focus on replicating the specific scenario in which it occurs.
Don't go down the rabbit hole of finding the buggy line of code that causes it. You'll do that later. For now, just focus on finding a clear set of steps that anyone could follow to encounter the failure.
Open a fresh browser window, log in with a real production account, try to reproduce the bug, and check the logs. Get a clear view of what is broken and, more importantly, what still works.
Once you understand the situation, note it down. Use concrete and precise language:
Guests can't register for the event: CAPTCHA on the registration page always returns failure
Now ask yourself a simple question: how bad is this, really? To answer that, context matters a lot.
Let's take that example, "guests can't register", in two different scenarios:
Same bug... completely different severity! Your job is to understand the impact of the issue to determine the best course of action.
Summary of Step 1 - Diagnose:
If the functionality fails for a given user but not in your tests, describe the steps you followed (a video recording can help) and ask them for more context to better understand what is going on.
For example, “I visited the site in Chrome, logged in with a seller profile, and was able to create a discount coupon. I could not reproduce the error you mentioned. Could you tell me which user you tried with? What coupon code did you attempt to create?”
Once you understand what's broken, let people know. Depending on the situation, that might be your boss, your manager, your clients, or other relevant people. The important part is to do it early.
Granted, this is usually an uncomfortable moment. A few thoughts might immediately pop into your head:
Still, silence almost always makes things worse.
Clear communication builds trust, keeps everyone aligned, and gives other people a chance to help in very practical ways:
You do not need to have the solution yet. Just share what you know so far. Take the notes from Step 1 and share what is broken, who is affected, when it started, and the severity level. Tag the people who need to see it, and tell them when you will check back in.
A good report is short, factual, and calm. For example:
CRITICAL ISSUE: mybigevent.com
Guest users unable to register for the event.
CAPTCHA on registration page is consistently failing.
Last successful registration: 3 hours ago (10:11 AM)
John and I are investigating. Update in ~5 minutes.
That's it! Just the facts and a clear next step.
Summary of Step 2 - Alert:
Now you can get to work and focus on stabilizing the system. Notice I said stabilizing, not fixing. Your first goal is not to reach the ideal state, but a reliable one in a short amount of time.
How? You kludge it. A kludge is a solution that technically works, but in the same way duct tape technically fixes a broken car mirror.
Let's go back to the CAPTCHA example: your first instinct might be to disable it. And honestly, that can be the right move! Yeah, it is not perfect, because you are opening the door to bots registering and taking spots meant for real people. But if you leave it as is, nobody can register at all. Not bots, not humans.
In most cases, applying a quick workaround is better than chasing the root cause right away. Focus on the fastest path to restore critical functionality, even if the solution is imperfect.
That might mean:
The goal is to get the system into a workable state by isolating or disabling what is broken.
Sometimes the issue is so severe that even your best efforts to make the site usable will not be enough. As a last resort, you can:
Whatever you attempt, keep your technical teammates in the loop. Tweaking an Nginx rule, flushing a cache, changing a driver? Let them know before you do it. This helps everyone stay on the same page, avoids people undoing each other's fixes, and leaves a clear breadcrumb trail in the message log when it is time to roll things back.
Once you have stabilized the situation, follow up on your initial report:
Team, we pushed a hotfix to prod to disable the CAPTCHA.
The "I'm not a robot" box will NOT show up.
Impact: Guests will be able to register. Expect some bot registrations as well.
We'll try to restore it ASAP. Next update in 10 minutes.
Summary of Step 3 - Kludge:
You did it. The site is stable. Things are (mostly) working. Take a deep breath, grab some water, refill your coffee. Now the real work begins: actually fixing the issue.
You've bought yourself time with that stabilization work. Use it: look for the faulty logic, the bad line of code, the unfortunate config change, the missing file, the stray process that ate all your RAM. Whatever knocked things over, take your time and find it.
When the fix is complex, consider restoring functionality in layers. Take short steps instead of making one big leap. It's slower, but it's safer: you can test each change and catch problems before they compound.
However you approach it, take time to test your solution. On your local environment, on staging, on whatever's closest to production. The last thing anyone needs during a crisis is a deploy that makes things worse. You've got time now. Don't rush it. Don't spam commits to main. Stay calm and make it right.
Once the fix is deployed and the site is back to full capacity, let people know:
The issue has been fixed. CAPTCHA is now working as expected.
Full incident report coming tomorrow.
Now, finally, take a break. You've earned it!
Summary of Step 4 - Optimize:
The app is online, customers are happy, and you can finally relax knowing everything is back in shape.
Now comes a crucial part: learning from what happened. To do that, you need to analyze the incident and write a clear report. This is sometimes called a post-mortem, but I find that term a bit creepy. Incident report is fine!
Your report should include a detailed description of the problem, what caused it, how long it lasted, the steps you took to contain it, how stable the system was during that containment period, and finally, the exact solution that fully restored functionality.
You can finish with recommendations for preventing similar issues in the future:
You need to be honest, even if the mistake was yours or shared across the team. Taking responsibility is part of the job. Nobody is perfect, and pretending the error came from elsewhere only makes things worse. That said, do not be rude toward others, and avoid placing all the blame on one person.
Also, don't write this at 2 AM when you are exhausted and frustrated. Do it the next day, no rush. Just do not skip it. It is way too easy to grab the win, relax for the weekend, and forget everything. If six months later something similar happens, you will think, "How did we not learn from this?"
These lessons only matter if people absorb them and use them to prevent similar problems down the road, so encourage people to read the report and share feedback.
Here is an example of an excellent outage report from Resend. Note that, although the incident was directly caused by a Cloudflare outage, they do not deflect the blame. They acknowledge that relying exclusively on a single provider is a risk and outline a path to change that.
And... done! You made it through. You learned. And that is no small feat.
Summary of Step 5: Report
I hope you enjoyed this article. See you next ti... Oh, of course. You want to know what happened with the table I erased. Alright, let's finish the story!
I rushed to the office with the backup on my laptop. The sysadmin made a copy and tried to restore it to production, but it didn't work. The download was incomplete or something like that, I can't remember exactly. So I couldn't fix my mistake. We just created a new, empty messages table to at least get the CRM back online. New communications worked fine, but the history was lost.
What followed was a week of work to restore the messages from email copies and chat logs. We built a script to read that raw data and re-create the messages in the table with their corresponding IDs. If you think that sounds like a nightmare, well... you're correct.
But I learned many lessons that still serve me well. When I need to connect to production databases, I always create a dedicated read-only user. I write my SQL by hand instead of relying on the GUIs that generate it for me. And beyond taking backups or snapshots, I always check that they actually work by regularly simulating a restoration.
We all screw up. The next time you do, stay calm and remember the tips we shared today.
You can fix it!
We appreciate your interest.
We will get right back to you.