If you haven't read The Pragmatic Programmer, by Andrew Hunt and David Thomas, I'd highly recommend it. It puts a fresh coat of paint on many of the concepts discussed in Steve Maguire's classic Writing Solid Code and Steve McConnell's CodeComplete (both of which are must reads for any software developer). Anyway, one of the things I really like about Hunt and Thomas' book is that they distill software development into three fundamental laws. A recent code review got me thinking about them again, so I decided that this might be a good time to revisit them in my blog.
Law #1: Do Not Repeat Yourself (DRY)
This law is very simple to understand: do not duplicate code, ever. Duplicating code causes two problems. The first is that if there is a bug in the duplicated code, it must be fixed in every copy of the code--which is time consuming. The second problem is that in order to fix the bug in every copy of the code, the developer must be aware that the code has been duplicated in the first place. This latter issue is a fairly serious one. Although good developers will try to evaluate any bug in the context of the system and consider if there are other manifestations of the bug that could occur, there is no way to account for duplicate code. Typically, when a bug is found in a piece of code that has been duplicated, the developer will fix the bug where it was found, but the fix will not be made to the other areas of duplication. This is often the case even when the developer who caused the duplication in the first place is fixing the bug; he or she will often not remember that the duplication exists and needs to be addressed. The end result is that the same fundamental bug may turn up several times more in different contexts--in each case wasting time and effort to investigate the issue and redevelop the same fix.
The solution to this problem is to simply avoid it. However, in order to do that, we must recognize when we are duplicating in the first place. Some forms of duplication are easier to recognize than others. Hunt and Thomas mention several types of duplication:
- Imposed duplication--where duplication occurs intentionally because developers don't think they have a choice.
- Inadvertant duplication--where duplication happens without developers realizing it.
- Impatient duplication--where developers duplicate code because it is the easiest thing to do.
- Interdeveloper duplication--where duplication occurs because two different developers are writing the same code without being aware of what the other is doing.
The first three forms of duplication are what I'm interested in. The bottom line with all of them is to quit using copy and paste when you are writing code. Seriously. If you find yourself copying and pasting code, stop and ask yourself what you are doing and whether you are introducing duplication into your code. If it turns out the answer to the question is yes, rewrite your code to remove the duplication.
As it turns out, some duplication is easier to detect and remove than others. I was recently reviewing some code where the author was generating file names and directory names by appending a number to the base name. The logic was to simply create a name by appending a number and check if the name existed on disk. If it did, the number was incremented and the name tested again. The first available name was used. The problem that the developer faced was that they were dealing with both directory and file names. So they needed to use File.Exists in one case and Directory.Exists in the other. As a result, they created two different methods which differed only by the mechanism used to check the existence of a filename. Although the two methods were not strictly identical, the key problem is that the algorithms were duplicated. If a bug were to turn up in the algorithm, it would need to get fixed in multiple places.
In this case, the duplication was intentional--the developer was aware of it, but didn't see how to avoid it. As it turns out, the solution is fairly simple. The duplication could be avoided by simply passing the name checking function as a delegate to a single method that implemented the algorithm. If necessary, top level methods could be created to call the implementation method with the correct context. In any event, intentional duplication is almost never acceptable. If you find yourself in a position where you feel you cannot avoid duplication, get help. Keep pulling people in until a solution is reached that doesn't involve duplication.
The only places where duplication is (in my opinion) at least tolerable is in generated code and in test cases. In the former, duplication is acceptable because generated code is never modified directly. If there is a bug in generated code, it must be fixed in the generator itself. So as long as the generator itself contains no duplication, the bug can be fixed in one place and then the corrected code can be regenerated.
I believe duplication in test cases is acceptable for a few reasons. First, and foremost, attempting to eliminate duplication in test cases can make it difficult to understand what the test case does. To me, this somewhat trumps maintenance. More importantly, test cases generally are not (and should not be) heavy-weight. They generally do not contain much logic, so typically they will not contain bugs and will generally be easy to refactor. Even if readability weren't an issue, at the end of the day I'm still willing to risk some extra maintenance cost simply so I don't have to spend time trying to figure out how to avoid duplication. To me, rapid development of test cases also trumps maintenance. Again, though, test cases should be extremely light weight. I'm a firm believer in enforcing the law of DRY in the framework that the test cases are built on.
Law #2: Preserve Orthogonality
Two things are orthogonal if they are independent of one another. Essentially, the Hunt and Thomas' concept of orthogonality is just a new-school way of talking about the concepts of cohesion and to a lesser extent, coupling. The bottom line here is that we should always strive to limit methods, classes, and modules to a singular purpose as well as limit the dependencies between pieces. If you need more of a refresher on these topics, Wikipedia has a good overview on cohesion here: http://en.wikipedia.org/wiki/Cohesion_%28computer_science%29 and coupling here: http://en.wikipedia.org/wiki/Coupling_%28computer_science%29.
The idea behind preservation of orthogonality is to make maintenance simpler. As long as two methods or classes are orthogonal, there is no chance that a change in one could affect the other. In other words, it is less risky to make changes when orthogonality is preserved since unintended side-effects are less likely.
While Hunt and Thomas address many aspects of preservation of orthogonality, I want to touch on one aspect that they don't. Namely, that if statements are often an indication that orthogonality is breaking down. While there are certainly many valid uses for if statements, I personally tend to treat them with a great deal of skepticism because they generally have the effect of reducing cohesion. One warning sign that an if statement is not a good approach is when you see the same conditional spread throughout a class or method. For example, I was recently looking at a class that was responsible for automating the creation of project items in Visual Studio. One of the methods in the class consisted of a switch statement, which depending on whether the project item was created new or added from an existing object would do the right thing to create the project item. As written, the method would create the project items via automating the user interface. However, a change was being made to allow project items to be created via object model automation. The proposed fix was to simply add an if statement at every level of the switch statement that would use the object model if the caller required it, otherwise it would fall through to the original implementation.
There are a couple of problems with this approach. The first is just the duplication of the if statement. Remember DRY? If the conditional expression needed to change (or there was a bug in it), every instance of the expression would have to get fixed. This problem could be mitigated by abstracting the conditional into its own function; i.e. if (UseObjectModelAutomation).
The second problem is that code can be added both at the top and the bottom of the method outside of the conditional expression. This makes it easy to introduce bugs. If someone isn't aware (or thinking about) the fact that there are two code paths here, they could easily add code that either depends on one or the other, or interferes with one or the other. Its a common scenario; the developer gets a bug that says if I do A then B is broken. The developer fixes the bug (and tests the fix), but never realizes they have broken C. And their response will invariably be, "I didn't realize I had to test C".
There are some mitigations here. The first would be to ensure the conditional can't be missed. That means using else clauses and making sure all code within the method is within the if or the else. While doing this can't prevent the above scenario, it will make it a whole lot less likely. In this case though, the problem is the switch statement. The only way to have a clean if-else statement would be to duplicate the switch--which really isn't a good option (that DRY thing again).
As it turns out, there is a better option: use an object-oriented solution. The trick is to refactor the project item class. Start by extracting a class that contains all of the project item creation code from the switch statement (which may require doing an extract method first). This gives you a class that knows how to create project items using UI automation. Then create a second class that contains the implementation for creating project items via object model automation. Depending on what you are trying to accomplish, this class can either derive from the UI Automation project item creator class, or it can derive from a base class that is common to both classes. In any event, the original method implementation would simply spin up the correct project item creator as an instance of the base type. The switch statement would call the appropriate method via the interface defined by the common base class. In other words, the details of which automation mechanism will be used are no longer known to the switch statement.
The other thing I look for with if statements, is just the size of the enclosed block. Generally, if it is anything more than a function call, I'll look for an opportunity to do an extract method. This isn't really an orthogonality issue per se, but it helps keep things readable and helps cut down on duplication.
Law #3: All Decisions Must Be Reversible
Reversibility is simply the ability to easily undo any decision you make. What Hunt and Thomas are really talking about here are the larger decisions that comprise architecture. Software must be able to adapt to change. For example, the database system in use today, may be suplanted by a better faster version from a different vendor. Can your system easily handle the switch? If not, it will probably be discarded.
One of the best ways to ensure reversibillity is to build a good abstraction layer around all dependencies. Doing this ensures that not only could you change your dependencies at the drop of a hat, but it also makes it easier to handle the case when your dependencies break you. With an abstraction layer in place, all changes are local to the layer. There is no need to search your code base for all of the places where code uses the dependency. In addition, the layer defines your requirements. The layer's interface tells you exactly what you would need from any potential replacement technology in order to make a switch.
I was guilty of violating this rule recently. I was trying to overhaul our system of running tests which was extremely complicated and and involved several diferent files. I wasn't fond of the test harness we were using, but since I was told we were committed to it, I figured, "why not leverage it?" Since that test harness required a data file of its own, I decided that things could be greatly simplfiied if we just moved all of our test data into that file--which we had to use anyway. That approach certainly streamlined our system, but at the cost of reversibility. Recently, our testing requirements changed and it turned out we need to use a different harness. Unfortunately, my decision to move our test data into the original harness's file left us with no simple way to get our data into the new harness. The ironic thing about this is that I had taken great pains to ensure that our system would allow you to easily swap in whatever test harness you needed. I just got blinded by the other problem I was trying to solve and didn't think about how my solution would impact reversibility.
Well, there you have it--software development distilled into three simple laws. As complex as software development is, I'm a big believer in these laws. Think about them when you are writing and/or reviewing code and they will help make you a better developer.