My Poems, Stories, and Science: Testing AI Systems

Robot Agents replacing Manual Labor

Guido outsourced by Guru

T McCabe

January 2019

A true story.

Blanco Bank is located at 5254 Lincoln Avenue, Monterrey Mexico. Its motto of ‘rock solid service’ was shaken when it installed an ‘expert system’.

On the morning of October 14, 2016, Mr. Guido Garcia, a twenty-five-year Blanco Bank teller, sat down with the newly hired architect of an expert system. Mr. Garcia was less than excited; the expert, Dr. Guru, was viewed as the enemy and yet he expects to get the teller job nuances from about-to-get-fired Guido Garcia. Dr Guru generates a narrative description of Guido’s manual labor, walks Mr. Garcia to the outplacement office, whispers “yikes”, and runs off to do his ‘real work’ – – building the expert system.

Big surprise, it didn’t work.

Not really a surprise, there has been no testing of the requirements nor is there a plan for testing the ‘as built’ AI system. It’s easy for Guru – – he blames it on the ex-employee Guido; who is long gone --- and embittered. And Dr. Guru laborers on, painfully discovering missing nuance after missing nuance, one by one, building multiple failing AI versions. The budget is blown by a factor of ten. Typical. But not that unexpected, from Guru: ‘these AI systems are a challenge’.

Here is a better approach.

Let’s go back to the morning of October 14 with Mr Garcia describing his bank job. He can describe this as a business process – – there’s a place he starts, there’s work he does, there are decisions he makes, there are iterations he goes through, and then finally, at some point, he’s done. In practice this is often done creating a business process model (BPM) -- there are several such popular business modeling languages currently in use -- Business Process Modeling Notation (BPMN), XML Process Definition Language (XPDL) and Business Process Execution Language (BPEL)¹

Mr Garcia’s description done otherwise was overly verbose and just conversational – – not the rigor of a formal BPM. However, it is indeed an algorithm, in narrative form. And for an AI system to replace a manual worker, the expert system has to start with an algorithm.

Such a loosely described algorithm has inherent complexity. In fact, it intrinsically has the classical McCabe Cyclomatic Complexity. Which will tell both Guru and Guido about the inherent complexity of the job – – it can be compared to other jobs that have been automated in terms of complexity. The complexity will predict how much work it’s going to be to build such an AI agent. Also, importantly, the complexity will determine the requirements validation tests to run on both the narrative description and also on the AI system when it’s complete.

This requirements validation can be done straight from the narrative – – sloppy but effective. A better way is to use a Business Process Model (BPM) language to describe the teller’s job. It is common practice to compute the McCabe Complexity of our BPM job description – notice here we are getting the complexity of the requirements.

The McCabe complexity is the number of basis test paths within the bank’s BPM. It is common practice to limit the complexity of BPMs with McCabe complexity (see Reference 1) – what’s new here is using complexity to generate the BPM test paths. It will generate the bases test paths and data – to both validate the BPM and to get tagged learning data for the robot agent. More rigorously, the complexity delineated test cases form an equivalence partition of the universe of robot agent test data – see footnote.

At this point, Mr. Garcia and Dr Guru would walk through each BPM generated basis path – whereby flushing out errors, as Mr Garcia explains the nuances of each path. Even though this looks like unit testing Guido and Guru are in fact testing the requirements before building the forthcoming artificial intelligence system.

Requirements errors are very expensive, or the order of 270x the cost of ‘coding errors’. Best to catch them right here.

The very same bases test paths derived from the BPM description serve as a good foundation for an acceptance test of the as built AI system. Each equivalence class would be expanded with nuanced test data. The acceptance test team should include knowledgeable bank employees, including Mr. Garcia.

Ranking the portfolio of the Afirme Bank’s manual jobs by their McCabe complexity gives order of magnitude estimates of both the job of building an expert system and the inherent testing that must take place. Also, keeping track of the number of requirements errors up front will predict the reliability of the as build AI system.

What is not explained here, is the big pay off in rigor. The current state-of-practice does not include requirements testing and modeling of the upfront business process. There are many reasons and many excuses for not validating an AI system at the requirements stage. Here is a way to test a robot agent, from the requirements before building it, from the requirements after it’s built, and with the participation of the very workers who had been doing the job beforehand.

Not to mention, Mr. Garcia gets some respect.

------------------------------------

Footnote:

The use cases so derived from a BPM or job description become an equivalence partition of the test data universe for the robotic agent. It gives at least one test case per equivalence class to validate the requirements up front. What's more for the millions of AI data points – – called tagged data – – to teach and test the robot agent --the equivalence classes generated upfront become a classification scheme. This means that all subsequent training and test data for the agent is cleanly partitioned into said equivalence classes.

One corollary of this result is the possible machine generation of robotic test data within each equivalence class. A machine could fill out the data within each equivalence class and make the data robust and comprehensive for robotic training in robotic testing.

Appendix 1:

Beside a narrative job description or building a BPM another common practice to derive learning data for an agent robot -- such as our robot teller -- is to gather, analyze, and transform log data. Since this job is being done by a human and is partly automated, the log data will be a mix of hand created log entries and computer-generated log entries. The care and feeding of log entries is a messy dirty job -- often given to a data scientist. It involves collecting massive amounts of data, often terabytes, from a variety of databases --- log files can be transaction log files, event log files, audit log files, server logs. . This data is often spread across distributed databases each with unique formats. As you can see, this is messy dirty business. Our methodology of using the BPM test paths is much cleaner.

Appendix 2:

This article ignores the characteristic many AI systems share wherein the artificial intelligence learns as it goes. This issue gives rise to the notion of partial algorithms and to the derivative notion of the complexity of partial algorithms. This will be discussed in a sister article.

Appendix 3:

The requirements validation is a path by path walk-through of the BPM – – actually a walk-through of each of the bases paths of the BPM. This is a nontraditional but more effective way to conduct a walk-through. It's more rigorous than walking through the BPM line by line because we're going through paths one by one; in effect testing as the computer would execute them. We are indeed testing the requirements before writing any code.

Appendix 4:

‘Machine Learning’ being taught by log data is an alternative to our approach here. Typically done with log data that has to be resurrected from within corporate databases. There is a class of errors that using just transaction logs will miss. It's error by omission.

For example, it was recently reported that a hospital AI agent was built to diagnose and triage pneumonia patients. It worked well except it missed a major category of pneumonia – – when somebody also has asthma. Doctors and emergency room nurses know well that having asthma and also having pneumonia will send somebody straight to intensive care. The AI system missed this. The intent was to send people home with antibiotics quickly – – and save hospital time and money. It was a major flaw and that people could die as a result of it.

This is an example of error by omission. When you take existing log data for training machine learning there is always the possibility you're missing a category of data – – you miss an equivalence class of test data. Log data is messy, has to be cleaned up, and it’s easy to miss a whole category of data.

Using a BPM model upfront would not make the same mistake. An explicit equivalence class of patients with pneumonia and also with asthma would have been built into the test data.

Epilog:

Four months later, as you can tell by the picture above, hard times fell on the good Dr. Guru. Guru did not follow the methodology described above and delivered his expert system three months late; he claimed it had been thoroughly tested. The bank trusted his judgment and put his expert system into operation the next day.

Whereupon it failed. Not on just some boundary conditions; it failed spectacularly every time. Bonito Blanco, the founder and president of the bank, was enraged and fired Guru on the spot.

Bonito went to Garcia's home to beg him back to his old job. It took two months for Benito to locate Garcia, who had downsized and moved to less expensive El Barrio. When the two men finally confronted each other, Benito offered to double Garcia’s pay.

It was too late. Garcia had taken another job at a competitive bank. He got the job during the interview describing the horrific mistake Blanco Bank had made with that foolish expert system and that cranky Dr. Guru.

Garcia’s first task on his new job was to brief all the executives in the new bank – imploring them to avoid, at all costs, any expert system or anyone with a name like Guru.

Postscript:

A true story? Not in the historical sense.

But more than true in our technology lives. Hundreds of millions of dollars have been lost because of a lack of upfront testing of requirements. In this sense, the story is sadly more than true. It’s true as a modern-day allegory.

A Jewish proverb has it that 'story is truer than truth' --- also, in this sense the story is true.

Reference 1: See ‘Managing the Complexity of Business Process Models’, ftp://public.dhe.ibm.com/software/solutions/soa/newsletter/2010/newsletter-apr10-article_complex_bus_processes.pdf

My Poems, Stories, and Science

Pages

my writing and musings

Thursday, June 13, 2019

Testing AI Systems

No comments:

Post a Comment

Thomas J McCabe

What's this?

Labels

Blog Archive

Contributors