Programming Errors at Project Scale

Programming Errors at Project Scale

Errors in Multi-Person Projects

Errors in multi-person projects involve typical programming errors1 and, more annoyingly, interpersonal errors.

This article is a quick overview of both technical and interpersonal errors we encounter when trying to develop software with more than one person. No literature was consulted. All thoughts are my own delusions.

We don't cover extremely detailed programming errors like forgetting to explicitly sprunj your homomorphic monad iterators—we just cover errors reaching project-level visibility.

The Seven Errors (Y, Z, A, B, C, D, E)

Error Y

Problem: Your program doesn't compile or parse.

Error Y is the easiest2 to discover of all error types because you have immediate feedback provided by your compiler or interpreter.

Error Y is also the least excusable kind of error to release into a public repository (commonly known as Breaking the Build or just Err—YSODUMB).

Error Z

Problem: Your program has errors visible in warning levels or basic static analysis.

Error Z is the second easiest to discover of all error types because detecting Error Z is either built in to your compiler/interpreter or easily discoverable by running a tool to statically analyze your code.

Examples of guarding against Error Z include: -Wall, use strict;, dialyzer, lint, and scan-build.

Error Z is still inexcusable for released projects. Your warning or analysis phase may elicit false positives, so you need a stable way to filter through known not-errors. The easiest way to correct for not-errors is altering the behavior of your program to remove conditions your analyzers can't reason through.

Examples of Error Z include "goto fail" and that time you ignored the warning treating equality expression as boolean comparison, caused the condition to always evaluate to true, and lost two weeks of customer data before anybody noticed.

Error Z+ is a special case of Error Z where problems are discoverable by live evaluation of your system instead of by static ahead-of-time analysis. Examples of Error Z+ prevention tools include: property testing using QuickCheck, extensive fuzzing, or running your full-coverage test suite under memory allocation reporting tools (valgrind/dbx/lldb) to detect leaks and invalid accesses before exploits show up in the wild.

Error A

Problem: You incorrectly understand what your program is doing when you make a change. Result: Your change doesn't fix what you thought it fixed (or it even introduces more errors).

Error A is the saddest of all errors because the core problem is: you didn't take time to understand the section of code you were editing before changing it.

Reasons for Error A include: trusting an externally provided patch to work (without excessive scrutiny), fixing merge conflicts without reading the entire surrounding context (usually caused by code drifting over time, so textual replacement fixes may not be enough—you need to re-evaluate all your logic too), and programming when not awake enough to hold the larger program context in your head.

Programming interviews tend to entirely focus on Error A. They are attempting to weed out people who can't conceptualize a problem or understand a working system.

Also see: increasing downtime by introducing more errors in your fixes because you are stressed out by the current revenue-impacting outage.

Error A also includes goto fail and heartbleed (assuming non-malice from the authors). Error A is the typical "buggy program" error caused by evolved human brains being really bad at this whole computer multi-layered complexity thing.

Error A is a common root cause of Error Z and Error Z+ occurrences.

Error B

Problem: You think you know something, but you're objectively wrong. You're under the delusion your tools work differently than they actually do. Can be caused by trusting incomplete documentation. Can also be caused by incorrectly extrapolating local behavior to global behavior.

There is no short-term fix for Error B. You can only resolve Error B by encountering errors in the field—or—by subjecting your code to external review by people with different experience than you.

Long-term fixes for Error B include: obtaining (and remembering) experience over time, improving online quick reference documentation so others can correct their knowledge easier, and reading underlying source code of your platforms instead of trusting documentation or observed materialized behavior.

Examples of Error B include: cross platform API issues (works in platform J, same interface is broken on platform K), filing bug reports about known errors outlined in documentation3, and working behind a poorly specified API where your combination of inputs is objectively wrong, but you have no way of knowing without ongoing behavioral testing.

Error B is a common root cause of Error Z+ occurrences combined with insufficient multi-platform testing.

Error C

Problem: Your thought process is too narrow to remember The Big Picture. You're focused, you're in the zone, but you're too focused. You end up duplicating code or re-creating APIs used elsewhere because your focus on the immediate problem crowds out remembering the system as a whole.

Error C is actually a combination of Error A and Error B. Error B tells us we didn't think properly and Error A tells us we didn't think enough.

Error C is when you can't see your own problems because your brain is bouncing around in a local minimum solution space. Because of your current focus, you can't think of the Right Way to do things at the moment.

Error C is the most interesting error because it's unavoidable by the author at the time of writing, but the author may return an hour or a day later and see their own mistake. Alternatively, the author may never see their mistake alone, but they can instantly see it when pointed out by another person4. Error C is obvious upon observation, but may be difficult at solo contemplation time.

Examples of Error C include: forgetting a specific API exists in your project so you manually re-create exiting library code in your application, not thinking to look for an external well-written library for your problem and re-creating available battle-tested work in a shoddy way, or fixing problems by writing an immediate local solution without considering other common conditions.

Error C is caused by your current mental state. You can't always think your way out of Error C because it is, by definition, is transparent to the author. You just haven't thought big enough yet. You can try to minimize Error C occurrences by routinely asking yourself "is there a better solution out there so I don't have to write this?"

Examples of fixing Error C include: formal code review, pair/triple programming, or informal code review by emailing diffs around5.

Error C typically only introduces redundant code or less-capable functionality available elsewhere. Error C doesn't introduce any bugs or improper behavior.

Error D

Problem: You've been told you're wrong at review/deploy time, but you had no way of knowing before that moment. Also known as Failure to Read The Designer's Mind. Also related to Failure to Make Something People Want.

Error D is subjective, but you are not aware of the subjective criteria.

Error D is the most common error of entire organizations.

Error D is also the most common error of individuals in teams without proper communication in place. Error D is especially dangerous because it only shows up at integration time, not at any point during your own development process.

Error D is the reason fast-feedback/lean/math-based management exists: you must continually check your progress against what is expected of you.

If you build systems depth-first and reach 600 levels deep into creating an expensive project, but then find out somebody stopped wanting your project back when you were at level 7, you just wasted weeks or months of your time.

Error D is also known as the Hamming Non-Isolation Commitment6. Try to make sure all your work, even if you don't know people will want it, is reusable for other purposes. Re-use typically involves clean API boundaries, clean bundling by creating packages instead of adding more private source files to your project, and publishing adequate documentation with examples of proper usage and error scenarios.

Sometimes, you can forcibly push your Error D into a non-error by properly defending your position (or by marketing or evangelizing your thoughts/product), but often you are running up against either unstated guidelines or people set in their ways. Nobody ever changed their mind from the paint on a sign.

Small scale examples of Error D include: not using correct coding style for a project when there are no stated coding style guidelines (but the unstated style is still enforced by the project maintainers), creating an application around manipulating a data set when someone just wanted the raw data, or trying to automate a manual process when everybody else is happy with the current repetitious state of things.

A sub-case of Error D is "fixing" the flow of someone else's code because it doesn't make sense to you, but it makes sense to them. For example, you if you remove no-op code, the original author may reject your change because they were obviously using it as a placeholder template for future expansion.

Large scale examples of Error D include: Zune, Surface, Windows Vista, {every other Microsoft product since 2000}, all of HP since 2005, and most Google web projects since 2008.

Error D is not a code problem, but rather an expectations problem.

Error E

Problem: Your changes get rejected because someone else imagines you are wrong. Your non-errors are perceived as errors. Also see Bikeshedding.

Error E is the least critical, but worst feeling, of all errors. There's no right or wrong, there is only opinion.

Error E is a special case of Error D. Error D is a late binding rejection, but for provably good reasons just unknown to you. Error E is a late binding rejection, but due to opinions of another actor over whom you have no authority.

Error E is also known as "Internet Comment Rebuttals." When anybody reads any opinion piece online, they immediately fabricate a handful of reasons why what they just read is wrong and horrible and needs to be fixed.

Normally, Internet complaints devolve into pointless comment threads. But with code, the thing to be "fixed" is right there oh so close in front of you. It's tempting to just open up your editor and change the other person's clearly inferior wording, style, or formatting to fit your own personal, clearly superior, way of writing the retention email.

The problem with Error E is the person "fixing" the error is removing independent decision capability and agency from someone else. When fixing an Error E, you must weigh the improvement you make versus the overt disempowerment7 you broadcast back towards who originally created the focus of your change.

Did you let an employee draw up new office interior designs, but then you let your significant other review the plans and they re-wrote almost all of the employee's original plans? Just because they wanted their hands in everything? That sucks for the original employee who now has both wasted their time and feels less useful since their work can clearly just be thrown out on a whim.

The winner of an Error E contest is purely mandated by the power structure of your organization. This is where you see moronic things like CEOs designing logos or corporate menu decisions being made unilaterally with no outside consultation with people who actually know what they are doing8.

If people at the highest levels always have absolute final say, then employees are perceived to have no independent valid decision making skills. Employees become essentially assistants to enhance the glory of the higher ups in the power structure.

Error E is why the manager of the backlog doesn't work on the project directly (or else they would take all fun+easy+impactful work for themself and assign the other boring work to others).

Error E includes: deciding architecture endianness, rewording things that already make sense, dictatorial color changes, and reliance on old experiences and stereotypes without re-evaluating current possibilities.

You can't directly cause Error E. You can only be accused of causing Error E by someone rejecting your changes in favor of their own changes.

If you are accused of causing Error E, the worst you are guilty of is not being able to read someone else's mind. Error E is the most frustrating of errors because the other person is clearly wrong, but you either don't have the authority to fight it or don't care enough to spark a pointless argument just to correct them.

Error E is not a code problem, but rather an ego problem. You will sometimes Error E your own code when you have multiple valid options with no strong peference one way or another.

People who claim Error E on other people tend to be repeat offenders and horrible at improv—they never Yes And, but rather always "No, my way."

Conclusion

Errors are bad. Some are worse than others, and some are entirely subjective. If you get called out on a completely subjective error, try not to feel bad and just go with the flow.

If you see someone making clearly wrong errors repeatedly, try to help them. Figure out of they are just distracted and being careless or if they actually have difficultly conceptualizing the entire system at once. Everybody is good at something, but not everybody is good at everything. Sometimes you have to delegate or move out of the way so differently talented people can step in and help explore the crevices of knowledge your brain just isn't able to see fluidly.

Takeaway

Stop breaking the build, focus a little more to avoid introducing bugs resolvable with five minutes of thought, and most project-level errors are organizational problems, so organize your way out of it.

-Matt


  1. Textbook "programming errors" are: syntax errors, runtime errors, and logic errors. Those problems focus on code correctness with zero regard for the implementation as a whole. A 2,000 character minified JS function may have zero "programming errors," but if that were the only format you had for the code, it would be a giant honking project error.

  2. Unless you're doing cross-platform development. If you've got client-side JS and your resources depend on CORS or lenient browser permissions, you are in an entire other parallel dimension of error hell. Or, if you develop only on BSD then deploy to Linux, you also hit new error edge cases not discoverable at development time. This is where a good CI strategy can help you find errors as soon after commit time as possible.

  3. Though, if you have people filing errors against the design of your system, you either need to improve your documentation, improve your error messages, or allow tunable (🐟) design constraints for your system. You've designed something people can't understand, so you need to give them immediate guidance when they hit common error situations.

  4. The NP-Complete of errors.

  5. Never underestimate the power of an email post-commit hook sending every diff to your entire team.

  6. "Instead of attacking isolated problems, I made the resolution that I would never again solve an isolated problem except as characteristic of a class." "...do it in such a fashion that people can indeed build on what you've done"

  7. My spell checker thinks disempowerment should be a hyphenate, but it helpfully suggested 'disembowelment' (🔪), which also works in this context.

  8. This is also a sign of organizational dysfunction. You must never have a person in a position of authority who is a broken combination of petty with low self esteem. The broken person will constantly seek approval by trying to "disrupt" things in an attention-grabbing way, thereby secretly hopping for universal praise in response to everything they do.