Whoops! I deleted the data! My DevOps disaster and how you can avoid your own

 

imageWhen you migrate your code or data from one environment to another, things can go wrong, with potentially disastrous consequences. In this post I share my disaster story and three things I learned to help reduce the chance of future migration disasters!

I remember it clearly. We were migrating a new application to production. The users had just finished entering all the data to support a new location. Part of my job was to migrate the new data to production. Easy right? Just delete the local test data, extract the new data to a file, copy it to the new server and import the data into the local database. Hmmmm, why am I not seeing any data? That’s odd, the data file is empty. I had better go extract the data again. That’s odd, the data file is still empty. Let me go check the database. 0 records? How could it possibly be 0 records. Whoops I accidentally deleted the data in the wrong database and blew away all the data that had just been entered by the users (which by the way had not been backed up yet).

Suffice to say, I was not very popular that day! The users had to re-enter all the data and happily paged me at 5 AM (yes, a pager, I know I just gave away my age) to come and re-do the migration. You can guess what the first thing my manager said to the team on Monday in the post-mortem.

“How are we going to make sure this doesn’t happen again?”

I’d like to share three things I learned we could have done to avoid this migration disaster. There is nothing ground breaking here, a lot of this is common sense, but unfortunately common sense is not always common practice.

Follow the principle of least permissions!

Some of you may have immediately wondered when you read that story: “Why did your account have the permissions to delete data in the central database?”

Good question.

My personal account only had read permissions for the databases. We often changed roles on the team and it was hard to keep up with who needed what permissions. So, I just logged in with the admin account, just like the person who did the role before me. Yeah, in retrospect, that was NOT a good idea! But I bet someone out there reading this right now has logged in with the admin account to do a task that didn’t require admin permissions.

Sometimes we need to execute a task that requires elevated permissions. Maybe you want to create a few folders or database tables. Maybe you have access to the administrator account, because sometimes you do require it and you just get into the habit of logging into the system with that elevated account.  Logging in as administrator to run tasks that do not require administrator permissions is an accident waiting to happen. That’s why we have server and database roles for databases. That’s why we have tools like active directory to help us manage permissions so you don’t end up using an account which has more permissions than you need and can cause more damage when you do make a mistake.

As a developer, I admit, I sometimes find myself arguing with the system and database administrators asking for more and more permissions so I can create and delete folders, or create tables in the database to hold quick backups of test data. But I also understand that when I have less permissions, there is a limit to the damage I can do when I accidentally type in the wrong command on the wrong server. So, if you have an administrator who locks everything down, you should in fact thank them because their prudence has probably prevented a number of incidents!

You will not be surprised to hear, after my incident, the database administrator changed the admin password, and my account was updated to have the minimum permissions required to perform the tasks I needed to do.

Automate your migrations

I only had to do  4 simple steps for my data migration. 1 – delete local test data, 2 – extract data from central database to a file 3 – copy the file to the local server 4 – load the data into the local database. You might think, why bother automating or scripting such a simple procedure? Well if we had automated the procedure I would never have issued the delete command on the wrong server. The automation would have known which server to use for the extraction and chances are I would have been prompted for the local server name or maybe edited a configuration file to specify the local server name. We executed these steps at 11 different sites. The only thing that changed from one site to the next was the name of the local server.

“Why bother automating or scripting such a simple procedure?”

The other thing to consider is that migrating the data was just one step in a larger migration process. We had just completed a round of user testing with on-site and received sign-off that we could go live. That sign off triggered the migration of the code, logins, data and a myriad other small tasks that all needed to be completed to go live. I screwed up the data migration, but I’ve seen other things go wrong as well. Sometimes the client code gets updated but the server code doesn’t. Sometimes the logins aren’t updated to reflect the different users. There are a lot of things to keep track of when you migrate between environments. Whether it’s from development to testing, or from testing to production. If you automate the steps, then once you have that automated process working, migration becomes faster and more foolproof (I don’t think you can make anything 100 foolproof, but automation helps you strive for that goal).

Having faster, reliable migrations between environments is even more important now because more and more companies moving to Agile methodologies and more frequent releases. Long gone are the days when we could spend two years developing an application and then take 2 months to roll it out to the user after all the coding was complete. In this day and age code is being released monthly, weekly, or even daily! So a manual process becomes a bottleneck in your release cycle.

The Visual Studio Release Management feature would have been a huge help for the migrations we were doing. It allows you to specify multiple staging  environments, approvals for each stage, actions to take when you deploy from one stage to another, and more!

Test your migration automation

Now I get to share another horror story. I was working on a project where we did use automation. We used a script to check code in and out because we found developers were making mistakes when they used the drag and drop user interface. Two months into the project, a coder checked one of his modules and noticed some recent bug fixes were missing. He checked out another module, bug fixes were missing there as well. Convinced he must have done something wrong he asked another developer to check out anything he had recently modified. The other coders changes were missing too! It turns out the directory where the code was stored by the version control tool had create permissions but not modify permissions. So, when a new module was checked in it worked fine, when code was checked out, modified, and checked back in an error message was returned which was ignored by our automated script (on a side note: NEVER use On error goto 0 in VBScript) So for two months all the modifications to code modules we thought were completed had not been saved. You can imagine the ensuing chaos.

When you use any automation tool such as SQL Server Integration Services (SSIS), or Release management, it is important that you learn how to make sure you have a robust process. Learn how to define the actions to take when an error occurs in the process. Who will be notified? What logging options can you enable so if I walk into the office at 8 AM and the migration failed I can figure out what went wrong and how to fix it? Do you need to define rollback steps so you don’t end up with a half completed migration? This is one of the advantages to using a more complete tool such as SSIS or Release Management instead of just writing up a few PowerShell scripts. I am not saying you shouldn’t user PowerShell. PowerShell is great, in fact many tools actually allow you to specify that at certain steps you should execute a PowerShell script. But PowerShell can’t do it all! (or if it can, the resulting script would be so confusing only one person on the team will be able to understand it)

I can’t be the only one!

Share your own disaster stories (or narrowly averted disaster stories) you can comment here or join the conversations at Canadian Developer Connection group on LinkedIn. If you want to learn more about the features in Visual Studio to help with DevOps check out DevOps – Visual Studio Release Management Jump Start on Microsoft Virtual academy.