One concept that Alan Turing is famous for is his test for evaluating artificial intelligence, know as the Turing Test. In the test, people attempt to identify by means of a conversation which "far end" conversationalists are people and which are computers.
I would propose an alternate version of the Turing test: the artificial intelligence side runs the source code through a tool and an engineer (fairly literally) implements the tool's recommended changes. The "control group" would be one or more experienced engineers who do a peer review of the same code and implements their improvements. If an observer cannot tell which code "improvements" were the result of the machine and which were the result of good engineering judgment, the tool has passed the Turing Test Take Two.
Conversely, if a tool cannot pass the Turing test, we must use an experienced engineer to filter the "recommendations" that the tool makes before we apply it.
Case study
In our contracts and in our work instructions, we have implicitly made our tools the "gatekeeper" and final judge of our code quality. The way that we fall into this trap is via our contracts or Plan for Software Aspects of Certification (PSAC), we specify that we will provide the artifacts that are generated by our tools to proved that our code is "good." The result is that the goal of running the tool is no longer to produce good code, but rather it is to produce clean printouts ("no faults found") for the customer.
The way back from this madness is to make engineers responsible for the code they write and the code they review. They should use tools to help them write good code and perform quality reviews, but the artifact that we take to the customer should not be "PCLint signed off on this code written by an anonymous cog" but "Gerald Van Baren wrote this code and is proud of it" and "Joe Competent Engineer reviewed this code and agrees that it is good." In other words, our engineers must taste the sausage. (In that article, map leaders => experienced engineers (aka. gatekeepers), sausage => code, broken machines that result in overtime => broken or misapplied tools that result in overtime.)
Our C Coding Standard (an internal standard consisting mostly of MISRA-C rules) is a classic example of a tool gone wrong. We sowed the seeds of a Turing Test Take Two breakdown in Rule 4 (of the internal standard): "Source code shall be run though an acceptable static source code analyzer." When we write in our PSAC that we will follow our C Coding Standard, we just jumped the shark and never saw it coming. While Rule 4 does not explicitly state that the engineer must implement the tool's "recommendations,"1 in practice it is easier to make the tool shut up than it is to explain and defend and defend and defend good engineering judgment that is contrary to the tool's "recommendation."
The case study in creating an "acceptable static source code analyzer" is the (first try internal C Coding Standard) checking tool. We spent (lots of money) contracting (elided) to implement it and then discarded it because it was hopelessly inadequate. We followed that by spending (a tenth as much money) on the (second try internal C Coding Standard) tool which was only moderately inadequate. We are now mainly using PCLint (thousands of dollars per seat) which is almost adequate, but is still incapable of passing the Turing Test Take Two.
We actually (inadvertently) did the Turing Test Take Two on the (first try internal C Coding Standard) and (second try internal C Coding Standard) tools: we assigned engineers to implement changes to project source code in a "sea of hands" fashion on the results of running the static analysis tool on that code. That was a disaster. Management quickly realized from the howling of anguish from the affected internal engineers that it wasn't working and backed off on that approach.
- When I discussed the (first try internal C Coding Standard) tool with an experienced, highly regarded engineer, he told me he ran the (first try internal C Coding Standard) tool on his code because the PSAC said he had to. He noted that the PSAC had a [X] checkbox for running the checking tool, but did not have a check box that said that the results were used for anything, so he did an incredibly practical thing: he simply discarded the verbosely bogus results. He then ran PCLint on his code, using it as a tool (not a judge), to identify problem areas in his code and applied his engineering judgment to determine which complaints were real and which were artificial.