My takeaways from Effective Software Testing by Maurício Aniche

13 minute read

Introduction

I recently finished reading Effective Software Testing by Maurício Aniche as part of a weekly book club with some teammates at work. This post summarizes my biggest takeaways from the book. I also took the liberty to mix in my own related musings about software testing.

My Background

It might be useful to quickly mention my background, as it affects my opinion of the book:

  • I work primarily on cloud services for energy systems, using a combination of Scala, Akka, Postgres, Kafka, Kubernetes, and a long-tail of other technologies.
  • The cloud services I work on exist in large part to model and interact with hardware, but I have very little firmware experience. So I can’t say much about firmware testing, other than I do know enough to appreciate it can be a different beast.
  • I enjoy writing automated unit and integration tests. I value the peace of mind it gives me about the services I build and maintain, especially as they evolve. I also enjoy the challenge of finding a way to test and verify something particularly tricky.

Overall Impressions

First, I recommend this book to any professional software engineer.

To summarize it in one statement: the book provides thorough coverage of testing techniques, with practical tips and examples for practitioners, based on a foundation of research and experience.

I found the book gave precise terms for some of the best practices I’ve learned from experience. For example, I had an intuitive understanding that I can efficiently test a complex logical expression by perturbing each parameter such that it affects the result. The book taught me that this idea has a name: modified condition/decision coverage.

I especially recommend the book to those who dread testing. It presents a first-principles motivation for testing and a set of reliable approaches and tools for effective testing.

The book uses Java for all examples, so it’s particularly valuable for engineers already in that ecosystem, but it’s also accessible to those working in other languages. Having worked primarily in Scala and Python, I can say the techniques translate to those languages, too.

For those in a rush, I recommend reading the introduction and summary of each chapter, and then reading chapters 2, 6, and 7 (specification-based testing, test doubles and mocks, and designing for testability) as they were particularly information-rich.

Takeaways

Test effectively and systematically

In chapter 1, the author establishes that we should focus on effective and systematic testing, and presents methods for both throughout the book.

How do we know we’re testing effectively?

I find it helpful to consider the antithesis to effective testing: ad-hoc testing.

When we practice ad-hoc testing, we implement a feature and then toss in a few test cases just before submitting it for review. We test whatever comes to mind – maybe an example from the Jira or Github ticket, maybe something we copy-paste from another test. Over time, we end up with a hodgepodge of miscellaneous tests that people happened to think of. It’s bloated, difficult to evolve, and leaves us with either a feeling of pointlessness or a false confidence in the correctness based only on the quantity of tests. I’ve found this is also what leaves engineers with a dislike for testing.

In contrast, effective testing is the process by which we arrive at a set of tests that verify the correctness of an implementation, are maintainable, and yield an efficient ratio of time spent testing to number of bugs prevented.

Another way to think about it is to consider the information or signal-to-noise ratio of each test. Effective testing reminds me of the 20 questions game I played as a kid. We get to ask our software 20 questions. Based on the answers, we choose to deploy to production. We better pick questions with a high signal-to-noise ratio!

How do we know we’re testing systematically?

One of my favorite heuristics presented in the book is that two engineers from the same team, given the same requirements, working in the same codebase, should arrive at the same test suite.

We rarely get a chance to run this experiment, but we can also consider our reaction to tests in code review. If we find it hard to follow tests, or find a teammate’s tests arbitrary or surprising, it might mean the team needs to make its testing practices more systematic.

Exhaustive testing is intractable, but that’s no excuse

In chapter 1, the author establishes that exhaustively testing all possible paths through any interesting piece of software is an intractable problem.

As a simple example, the author presents a system with N boolean configurations. This system requires 2^N test cases for exhaustive coverage. At N=266 we exceed the number of atoms in the visible universe.

So we should accept we can’t test exhaustively. Does this make testing pointless? No.

The author shows us how we can write fewer, but more effective tests through a combination of techniques: partitioning, boundary analysis, pruning unrealistic test cases, and the kind of creativity that results from thoroughly understanding the domain.

If we understand how inputs affect behavior, the boundaries at which inputs change behavior, and which inputs are nonsensical, we can prune the explosion of test cases down to a handful that actually matter. The author walks through an example of this type of pruning in chapter 2.

A brief musing: I find it interesting that, in some cases, property-based testing (sometimes called randomized testing) is an antidote to the intractability of exhaustive testing. As a trivial example, imagine we’re testing a method that multiplies two positive integers. For any pair of inputs, we can verify the method’s output using addition. If we use new randomized inputs every time we run the test suite, we can, in the limit, explore and verify the entire input space. I particularly enjoyed this talk about randomized testing in Apache Lucene and Solr.

Code coverage is a tool, not a goal

The author introduces code coverage in chapter 3.

The rough idea of code coverage is that we can run a tool to identify parts of our software (classes, methods, branches, etc.) which are untested by our test suite. We can review the untested parts and adapt or add tests to cover them.

Like any simple metric, code coverage can be abused. If we consider coverage as the ultimate goal, we risk wasting time on pointless tests and arriving at a false sense of confidence about the correctness of our software.

However, as the author argues in chapter 3, we can still use code coverage to offload the cognitive burden of identifying the parts of software we might have forgotten to test.

I use a code coverage tool anytime I’m implementing or refactoring a substantial feature. I periodically run the tool and examine the outputs to catch any blind spots. When I find untested areas that are particularly risky or interesting, I revise or add test cases to cover them.

In some situations it’s fine to omit a test case. For example, I generally trust the correctness of a language’s standard library or an established open-source library.

At times, I’ve also integrated code coverage as a requirement in continuous integration. I still have reservations about this, as some tools can be flaky or misleading. At the very least, code coverage in CI is a good way to set a lower-bound “safety net” for coverage.

Mutation testing seems like a powerful complement to code coverage

The author briefly introduces mutation testing in chapter 3.

The rough idea of mutation testing is that we can automatically generate mutations of our codebase and run the test suite against each mutation. A mutation can be as simple as flipping < to > or == to !=. If our tests are effective, then at least one test should fail for each mutation. If no tests fail, then we should consider adding a test case to cover the mutation.

This seems like a particularly powerful complement to code coverage. It’s easy to get perfect code coverage with mediocre tests: just call each method and make an inconsequential assertion. It seems more difficult to get perfect code coverage and mutation coverage with mediocre tests.

I’ve experimented with some libraries in the Stryker Mutator ecosystem for toy projects, but still need to run them on something more interesting.

The role of specification-based testing and structural testing

The author covers specification-based testing and structural testing in chapters 2 and 3.

Prior to reading the book, I had an intuitive understanding of specification-based testing but zero knowledge of structural testing as a distinct concept.

Here’s how my current understanding looks.

In specification-based testing, we start with a non-code specification and implement tests to demonstrate our system adheres to the specification. These tests usually sound something like, “some HTTP call returns a 401 when the caller’s token has expired.” With a sufficiently literate testing framework, specification-based tests can be read by a non-technical teammate and impart confidence that a feature is implemented as specified.

Structural tests, on the other hand, largely ignore the specification and tailor specifically to the implementation details. Maybe there’s something particularly interesting about the way we choose to return a 401 response vs. a 403 response, but it’s never stated explicitly in the specification. Structural tests are a good place to exercise and document this kind of detail.

At the risk of painting in broad strokes, maybe we can distinguish the tests by audience? Specification-based tests are for the product owners who provided the specification, and structural tests are for the other engineers who will continue to maintain and evolve the implementation.

Language design matters

The author covers contract design in chapter 4.

This gets into the details of input validation, pre-conditions, post-conditions, exceptions vs. return values, and so on.

Seeing the concrete implementations of these concepts in Java made me particularly grateful for the Scala language and how its primitives simplify contract design. Some quick examples:

  1. Scala’s case classes are far less verbose than Java POJOs. This makes it simple to quickly define and provide an informative return type or a test data type.
  2. Scala has native Try and Either constructs. These make it simple to provide known errors as return values instead of throwing exceptions.
  3. Scala has had Option since day one. This means null is virtually non-existent (though, sadly, technically still legal) in Scala code.
  4. Scala can verify a pattern-match is exhaustive at compile-time.
  5. Scala has libraries that encode invariants in a type. For example, NonEmptyList and NonEmptySet in the cats library.

So, even though I agree with the author that “Correct by Design” is a myth (chapter 1), certain languages offer primitives that get us closer to this mythical goal.

Stubs vs. mocks and testing state vs. interaction

The author covers dummies, fakes, stubs, mocks, spies, and fixtures in chapter 6.

I found the distinction between stubs and mocks particularly useful.

In summary, a stub is an object that adheres to some API with a simplistic implementation (e.g., methods return hard-coded data). A mock does the same thing, but also provides mechanisms to verify how it was used (e.g., the number of times a specific method was called).

An analogous distinction is that of state testing vs. interaction testing. My understanding is that state testing lets us verify the ultimate state (the result) of some method, whereas interaction testing lets us verify the specific interactions (of classes, methods, etc.) which led to this state. Stubs facilitate state testing, whereas mocks facilitate interaction testing.

In most cases we really only care about the ultimate state, so stubs are sufficient. In other cases, we might actually care that a specific method was invoked with a specific input a specific number of times, etc. In these cases, we need mocks.

I think I’ve only ever needed to verify interactions for very performance-sensitive code. Even then, a broader benchmarking harness was more useful for catching performance regressions.

Writing testable code is important and dependency injection helps

The author covers techniques for writing testable code in chapter 7.

The fundamental argument for writing testable code is simple: if code is difficult to test, we won’t test it, and we’ll inevitably introduce bugs. I agree with this, both in principle and from experience.

One pattern seems particularly useful for writing testable code: dependency injection.

The core idea of dependency injection (DI) is pretty simple. Define an abstract interface for each area of our domain and for each external dependency. Implement a concrete implementation for each interface. Critically, don’t let the concrete implementations communicate directly with other implementations. Instead, they can only communicate through the interfaces.

When executed well, the benefit of DI is that we can test each implementation in isolation. We do this by injecting a controlled test double (dummy, fake, stub, mock, etc.) for each of the interfaces required by a particular implementation.

There are dedicated libraries for DI in any mainstream language. In Scala, I’ve found it’s usually sufficient to just use a trait for the interface, a final class for the implementation, and constructor parameters for the actual dependency injection.

Like any other technique, DI can be taken to a counterproductive extreme. There are definitely cases where we want to test one or more concrete implementations together, covered in chapter 9.

TDD is a method for implementing, not a method for testing

The author covers test-driven development (TDD) in chapter 8.

I found the most interesting point in this chapter was the distinction of TDD as a method to guide development, not a method for testing.

In other words, we can use TDD to arrive at an implementation, but we still need to re-evaluate the tests. We might keep some or most of the tests. Or we might realize the tests were effective as a means to an end (the implementation), but are not actually effective tests.

This was a refreshing perspective. I’ve found it challenging to follow TDD as I previously understood it, simply because the tests I want to write while implementing are usually different from the tests I want to write to verify the implementation. When I first start on an implementation, I’m generally satisfied testing with some scratch code in a simple main method that I know I’ll later discard. When I verify the implementation, I want to optimize for readability and maintainability.

Only integration test the dependencies that you exclusively own

The author covers large tests, including system and integration tests, in chapter 9.

Based on my experience, and the author’s perspectives, I’ve arrived at a heuristic for integration testing: I only integration test the external dependencies that my service exclusively owns.

For example, if my service exclusively owns a Postgres database, I’ll write integration tests against a realistic, local Postgres container. If my service also depends on some other shared service, I stub or mock all interactions with that service.

If I find that I actually need to think about the implementation details of another service (e.g., the underlying database), there’s probably a bug or a leaky abstraction in the other service. Instead of leaking the detail into my service, I work with the owner to fix the underlying issue. The alternative is that every service knows implementation details about other services, which is clearly untenable.

In other words, every service owner should write tests to ensure the service can be trusted to work as described in the API contract.

Keep stubs small and specific

The author covers some test smells and anti-patterns in chapter 10.

One that I was particularly happy to see covered was fixtures that are too general. This test smell consists of a large fixture that is shared by many tests.

Another way I’ve seen this play out is a singleton stub shared by many tests. This approach is expedient when first implementing a test suite, as it lets us share the same stub in many places (DRY). However, it quickly becomes brittle: when many tests depend on the specific behaviors of a shared stub, a trivial change or addition in the stub will break several unrelated tests.

The solution is to keep stubs as close and specific as possible to the tests that use them. It might result in more test code overall, but it keeps the tests decoupled and easier to maintain.

Conclusion

Buy and read Effective Software Testing by Maurício Aniche. It’s well worth the time and cost!

Appendix

Discussion

Updated: