Lately I’ve been spending a lot of my time introducing people to TDD. One of the best mechanisms I’ve found for this is a Coding Dojo, documented best in Emily Bache’s excellent Coding Dojo Handbook. In the Coding Dojo, we spend an hour or so with a group of engineers writing test-driven code to solve very specific, tractable problems (katas, in the lingo) that are independent of their day-to-day work.
As a result, I’ve been getting to vicariously relive the process of learning TDD for the first time. One thing that has really stood out from the sessions is how hard it is to know not when we should test something, but when we shouldn’t.
There’s a lot to say on this subject, but I’ll start out with a very simple observation: traditional test driven development is an example-based approach to verifying system behavior.1 In logical terms, we’re writing there-exists checks, not for-all checks. Making this simple distinction clear has proven a critical piece in getting over the hump of “I don’t know when to stop writing tests”.
The clearest example of this I’ve seen is when TDDing the FizzBuzz problem. Quickly stated, FizzBuzz is a child’s game wherein one has to count from 1 to 100, replacing any number divisible by 3 with “Fizz”, any number divisible by 5 with “Buzz”, and any number divisible by both 3 and 5 with “FizzBuzz”. In at least 50% of the tests I see written for this problem, I’ll find some variant of code that looks like the following:
1 2 3 4 5 6 7 |
|
There are a few things that are problematic with this approach that we can quickly point out (looping in a test, duplication of conditional checks between test and production code, conditional execution of assert, multiple assertions), but the real problem is both more fundamental and simpler: the test is attempting to establish a for-all condition over the currently-specified domain of the problem.
So, why don’t we want to do this? Let’s think about what we want out of our tests:
- They must be fast
- They should be decoupled/cohesive: a change in requirements should affect as small a number of tests as possible
- They should provide error locality: a test failure should be easy to pin down to a specific fault in production code
What happens with the looping approach is that we focus on the fact that we can achieve point 1, and forget about the importance of 2 and 3. We know our domain is 100 numbers, we know we can run that loop thousands of times a second, so we go ahead and throw it in there.
When we do that, though, we’re left in a bit of a quandary. To test
every value, we need to know what each value should be, so we slip in
a little conditional check to see if, for a given i
, we should be
printing out the word “FIZZ”. It’s a simple check, so we think
nothing of it. We add the assert, run the test, it passes, and we
move on to the next test:
1 2 3 4 5 6 7 |
|
It’s red, so we change our code, run our tests, and boom! Our earlier
numbersDivisibleByThreePrintFizz
test blows up. What happened? Of
course, we didn’t account for the 3 and 5 case in our earlier test,
so we hop back over and change it:
1 2 3 4 5 6 7 |
|
And we’re back to green. But wait a minute…now our two tests are almost identical: they both have a loop, condition on variations of the same properties, and assert a very similar result. What’s going to happen if our boss comes in with a new requirement and says that the children now are going to have to say “WOOF” every time a number is divisible by 7? Every test is going to have to be changed! We’re also starting to notice that there’s lots of duplication between our test and our production code, which is also doing evenly-divisible-by checks: yet more places to change!
This is clearly not what TDD promised us. This is, in fact, the exact reason I hear from many people about why they don’t do TDD: if I have all of these tests and try to change my code, I now have twice the work to do because I have to fix the tests too! Yuck!
So, what’s the solution? This is where we go back to the argument that we’re testing by example and not against all possible inputs. We’re attempting to establish properties of the system for specific inputs that we, as developers, feel are valuable in assessing whether the behavior of the system under other inputs will be predictable and as expected.
For this to work well, we have to recognize that TDD is inherently white-box testing: we’re feeding knowledge from the tests into the code, and knowledge from the code into the tests. The two are in a symbiotic relationship, and we can use our understanding of what corners we are and will be exposing in the code to drive what tests we write next.
For example, rather than testing that all numbers divisible by 3 are going to print FIZZ, we can just test a couple of points with different characteristics that we think are interesting:
1 2 3 4 5 6 7 8 9 |
|
Although we’ve not checked any of the numbers between 6 and 93, we’re
pretty confident that our ability to implement an evenlyDivisibleBy3
behavior is good enough that if we’re doing the right thing for these
two cases, we’ll be doing the right thing for all the cases in
between. Put another way, our assessment of the risk of our code
being wrong for those unverified values is low enough that we don’t
feel it worthwhile to expend the effort to write additional tests for
them.
What it really comes down to is this: good programming requires good judgment. TDD offers excellent feedback on whether your judgment is taking you in a good direction or a bad direction, and whether your assumptions hold or do not hold. What it does not do is let you stop thinking (constantly) about what decisions you’re making and why. You have to decide when you feel you’ve driven out enough examples to be confident your code is correct. Knowing when you’re at that point is a mixture of experience, educated guesswork, and diligence.
-
For the purposes of this article I’m focusing on TDD as is commonly practiced, and not looking at techniques like property or theory based testing, which attempt to establish for-all properties over the SUT.↩