Lazy fall colors

my writing and musings

unpublished thoughts, poems, experiences, stories, science on their way becoming a book

Thursday, June 13, 2019

Design Refactoring


Object Oriented Design
Refactoring and Visualization

T McCabe
© March 2019
2nd Rough Draft




Introduction
The intent of this paper is to provide insight into improving the quality of object-oriented software architecture.  The focus is the coupling and binding that occurs between object-oriented classes. We will develop a methodology to refactor and un-couple an object-oriented architecture. The discussion includes the perspective of a maintenance programmer, who must incorporate his changes within a mal-structured existing architecture. We will also develop a metric that quantifies the coupling within an architecture.

This paper was fun to write because it connected several different disciplines. The focus is computer science, but to develop the subject we will lean heavily on mathematical graph theory. Also, the paper borrows from the algorithms and heuristics used in big data analytics, wherein very large social graphs or cybersecurity graphs are condensed to discern interesting behavior and patterns.

In the graphs presented here the nodes represent object-oriented classes or objects, and the edges represent dependency.


In order to develop the subject of class coupling, I thought it best to begin with the most extreme examples. For that reason, we will explore what are known in mathematical graph theory as complete graphs. The terminology is that Kn represents the complete graph with n nodes. To the left are the complete graphs K1…..K8.


Most of the focus here is assuming we have the reverse-engineered the source code to produce the class dependency graphs. However, most of the techniques discussed apply equally to the upfront design phase of an object-oriented architecture. This is not pointed out explicitly in the narration, but it is important to keep this in mind. Since the subject is good object-oriented architecture, the leverage point is obviously the upfront design phase. However, we will elaborate the things that go wrong when an architecture is compromised, and then we must deal with an imperfect world where the architecture is convoluted and complex.


In my 1976 article about Cyclomatic Complexity, I discussed the ‘essential complexity’ of an algorithm -- the extent to which an algorithm is mal-structured. I described a kernel or generating mal-structure construct of an algorithm. It is interesting to ask the analogous question here, within the framework of object-oriented architecture. The question is ‘What is the kernel or generating mal-structure within a OO class inheritance graph? ‘ 

We will use Mathematical Graph Theory to develop our computer science refactoring concepts. In Graph Theory, the graph K5 is a good place to start our discussion. The graph K5 has every class connected to every other class; it is extremely coupled. K5 also approximates a worst case that one would encounter in an existing object-oriented dependency graph. The techniques presented here will work on other graphs, but we will start with K5 to provide a concrete example.  

In this computer science discussion, the nodes of graph theory translate into classes or packages, and the edges of graph theory translate into dependency. The graphs K3 through K5 will be introduced as part of this narrative.

As we examine extreme coupling between object-oriented classes, it is interesting to look for a core, or a kernel stereotype of mal-structure.

After some reflection it appears that the complete graph K3 is this kernel mal-structure stereotype.

My colleagues and I describe this stereotype as ‘Philadelphia love’ – – a sibling inheriting from a brother. This stereotype as a design mantra is the root cause of elaborate poor structure with extreme coupling.

The bizarre situation where node 1 inherits from node 3 is known in graph theory as a ‘strongly connected component’ and it is known in computer science as a ‘circular dependency graph.’ Although all of the results in this paper apply to this case, a circular dependency graph has to be corrected as a first priority before considering the discussion below. 

As an illustration, consider the complete graph K5: 


If we look within K5, can we find a kernel set that repeats many times to form K5? Expressed another way: Is there a kernel subgraph stereotype repeated many times within K5? 

The answer is yes. The result is true in graph theory and by implication, it is also true in computer science object-oriented design. This is illustrated below where K5 is decomposed into six subgraphs of type K3, that collectively form K5. 



This is called a complete refactoring of K5 since it is refactored into its most atomic K3 kernels.

Actually, there are ten K3 subgraphs in K5; the other four are not rooted at node 1 so they are not of interest here.


If we think of this recursively, we can ask the same question of K4: 

Does K4 likewise have a kernel of K3 subgraphs? As illustrated below, the answer is yes, there are three K3 subgraphs within K4. 


Computer Science Implications
This illustrates that a kernel and fundamental OO design mal-structure is ‘siblings borrowing from each other.’ An architecture generated with this mindset or stereotype will result in elaborate mal-structure --- such as K5 or K4. So, the obvious advice is to avoid it in the first place -  avoid it in the OO design phase.

A secondary, and just as important issue is how to refactor a highly-coupled design such as K5. One strategy is to break it down into its most basic subgraphs. This result is given above where K5 is refactored into six K-3 subgraphs.

Another refracturing is to break K5 into less complex K4 subgraphs. As illustrated below, there are four such K4 subgraphs.

To complete this refactoring see the footnote at the end of this paper.

Refactoring Methodology
The graph decomposition shown above is purely from a Graph Theory point of view. A more substantive refactoring discussion embracing computer science issues follows.

To reduce coupling and to simplify the architecture, we are going to reverse the inheritance principle. We are going to disinherit.

We will use a notion from Graph Theory called a ‘bridge’ node. A bridge node connects a graph to the surrounding environment. In our K5 example, node 4 below is a bridge node.

Here is a method to refactor: 

In our K5 example, if we pick node 5 and disinherit its data and methods to its parents (nodes 2,3, and 4), we can then remove it. The resulting reduced graph is K4 -- as shown below. 


We can continue this process and reduce K4. If node 4 is not a bridge node, we would then disinherit its data and methods into node 2 and node 3, and then remove node 4. The resulting graph is K3 – our original kernel graph.

Maintenance Changes to Avoid
So-called ‘spaghetti code’ was one of the bugaboos of the 1970s and 80s. People wondered what kind of mentality would produce overly complex code that was so convoluted nobody could understand it. In most cases, such spaghetti code arose from patching into an algorithm. The original code was well structured enough, but a few patches could have a devastating effect on the algorithm and turn it into spaghetti code. 

An analogous malpractice is happening now in the world of object-oriented design. The following discussion will illustrate how a well-structured class dependency graph can be corrupted with one single change.

Consider the class hierarchy diagram on the right where the following modification is about to be implemented. Class Y wants to reference a protected method of class X. The ancestors that class Y is dependent of are shown in red.
The added red reference of class Y to class X put in as a patch creates a dependency ripple to all of the red classes on the left. The number of classes that X is dependent on went from 5 to 11, changing one line of code. The classes P, Q, and R are also affected -- the number of classes they are dependent on has doubled to include all of the red ancestors of class X.

The situation where K3 causes coupling can be extended beyond the literal K3. K3 appears when a sibling borrows from another sibling. The same phenomena of poor design happen across generations -- when, for example, grandchildren, cousins or great cousins borrow from each other. When this multi-generational borrowing occurs, you will see an extended K3 subgraph that is causing coupling. 

The blue edges and blue nodes here show an extended K3 graph. This was indeed the cause of the mal-structure that coupled the subgraph on the left to the subgraph on the right. 

Although highly coupled subgraphs are error-prone, they also present an opportunity viz a vie visualizing large and complex graphs. 

The technique is to identify complete or almost complete subgraphs and then condense them into one super-class. The net effect is a substantial reduction of complexity and noise in the overall graph, whereby visualization is enhanced. 

This is a common technique used on very large graphs – often with big data analytics.  The clustering and condensation methodology is used in sociology where the complete graphs are called cliques -- people who communicate with each other but are somewhat isolated.  Clustering and condensation are used in microbiology to cluster genes that are highly related. Clustering and condensation are also used in cyber warfare to identify sleeper cells of terrorists and to identify highly influential nodes. Lastly, Clustering is used in traffic analysis and network flow.

Complete graphs like K5 seem like an extreme situation. This discussion describes a metric that can be used to identify subgraphs where the coupling is a matter of degree -- it can be less than complete. This opens the domain for a wider class of graphs that can be analyzed using the techniques in this paper.

A natural coupling metric CM arises when one considers any graph or subgraph. That metric is the number of dependencies divided by the number of possible dependencies. 

The number of possible dependencies for a graph with n nodes is n x (n-1)/2.
Therefore we have CM= #edges/(n x (n-1))/2.

Notice that 0<CM<=1. So, CM measures the degree of coupling as a metric between 0 and 1.

For example, with the graph K3 we get CM(K3) = 3/(3x(3-1)/2) = 3/3 = 1.

With K5 we get CM(K5) = 10/(5x(5-1)/2 = 10/10 = 1.

This is the desired result since K3 and K5 are complete graphs.

Now consider a graph where one edge of K5 is missing – the edge between nodes 4 and 5 is
missing.

We get CM = 9/(5x(5-1)/2) = 9/10 = .9

Since this graph is ‘almost complete’ the metric CM is giving a number that tracks with our intuition. CM will pinpoint, within a large complex graph, the almost complete subgraphs that can be condensed. 

A single class’s coupling to its environment
In this section, we will look for single classes that are tightly bound to its neighbors, and are therefore candidates for condensation.

We will develop a metric called ‘coupling coefficient’ -- when a class has a high coupling coefficient the class with its neighbors can be considered as one it can be condensed to a reduced graph. 

This is different than the previous example where a whole subgraph had a coupling metric. In the previous example, the entire subgraph was a candidate to be condensed if the coupling metric was high. 

Here, we consider a coupling coefficient where the focus is on one class instead of the subgraph of many classes. And the action taken, if the coupling coefficient is high, is to condense the class in question together with its immediate neighbors. In social graphs, this is the fraction of friends who know each other.

Here is the definition of the coupling coefficient C with examples.

C =  # edges in C’s neighborhood/ max # edges in C’s neighborhood. 

Assume k is the number of edges in C’s neighborhood.

The maximum # edges in C’s s neighborhood is the quantity:  k x (k – 1)/2.

So, C = # edges in C’s neighborhood/ ((k x (k – 1)/2) = 2 x (# edges in C’s neighborhood)/k x (k-1)

In the top graph on the right, the blue node has 3 white node neighbors that have 3 edges connecting them. So k =3 and our formula gives:

 C= (2 x 3)/ (3 x 2) = 1.

With the middle graph on the right there is only one connection shared by the blue node’s neighbors. So k =1 and we get

C= (2 x 1)/ (3 x 2) = 2/6 = 1/3

With the bottom graph the neighbors share no edges, so k = 0 and we get

C= (2 x 0)/ (3 x 2) = 0

The coupling coefficient C gives us a hint as to what to condense on a complex class dependence graph. C is bounded by 0 and 1,

0 <= C <= 1.

When C is high, around .9, and the number of neighbors of a class is high it then is considered a good candidate to condense in order to simplify the overall graph. The graph below gives a sense of condensation. If the graph is condensed to one class per color it becomes comprehensible. 


Refactoring methodology to reduce coupling and enhance visualization
With a global refactoring project where the goal is to improve the overall quality of the architecture, it can be difficult to determine where to begin. The first thought is to begin at the lowest classes in the inheritance hierarchy because they are the easiest to modify. However, the metrics we discussed above can give a better starting orientation.

To minimize coupling one should start with the class X that has the highest coupling coefficient. This class X is binded to its environment more than any other. Rather than refactoring this class right away, it makes sense to look at its environment. Is there a subgraph Y containing class X that has high subgraph coupling -- where the metric is the subgraph  coupling metric versus the X coupling coefficient of a single class.

If there is a maximal highly coupled subgraph Y then there are two alternatives to simplify the architecture. The first is a nominal simplification where the subgraph Y is given an artificial name and condensed. This helps visualization and comprehension without changing any source code.

The second approach to simplify the architecture is to remove a highly coupled class like X from its environment Y. This is performed using the steps described above where the class X disinherits it's methods and attributes to his parents and distributes any unique methods to the most appropriate parent. Any reference made to class X are then made to the appropriate parent.

The poetic irony of this procedure is we are ‘un-designing’ the architecture. The top down design methodology that most people use is to keep specializing and creating new classes, inheriting both data and methods. This can result in too many classes and too much coupling. The surprise here is that we are reversing the design and unpacking specialized classes back into their more abstract parents. Un-designing a design.  Reverse engineering and refactoring in the literature proposes many specialized techniques, but this is the most fundamental of them all --- ‘un-designing a design ‘.

We are fortunate to be in the software business. Building architects have a more drastic procedure when a skyscraper has outlived is safety profile or utility -- they demolish the building and implode it into rubble. No reusable pieces.

In software we’re lucky, our version of architecture implosion can be achieved with selective ‘un-design’; we can salvage many reusable pieces. 


The beginning of automated re-engineering

As we view a legacy system over time, from the perspective of version control and commits, it would be fascinating to detect where coupling begins in a class hierarchy diagram. This would mean the first edge inserted in the diagram that had 'siblings borrowing from each other' and creating a closed region in the graph architecture.


To Illustrate, imagine through configuration control, we could detect the version where the inheritance from class X to class Y takes place.  As you can see, it causes a closed region in the inherence hierarchy and initiates coupling.

This first coupling is a hint at gleaning intelligence from the graph. It’s this cross-pollination or hybridization that gives a hint about the business intent of the architecture.

A more extreme example is a change that connects two hierarchies that are completely separate. It is noteworthy as a design artifact because it couples the two hierarchies. Without the coupling edge, the hierarchies of classes are only taxonomies -- they only create specialized classes from more abstract ancestries. The coupling edge also lends itself to business rule extraction.

A coupling edge connects two separate hierarchies. It gives a hint about the operational intent of the software. It's the link between two unrelated architectures, is the hybrid between two genetically distinct species.

One way to clean intelligence from this coupling edge is to simply look at the names of the classes that it connects. In addition, in some architectures the edges are given names, this also would help. One of the key driving concerns for software reengineering is to uncover what are known as ‘business rules’ within the source code. The business rules capture the intent of the software and can be understood by analysts and used by corporate executives -- the business rules are the raison d’etre of the software.

As a result of very complex software architectures, business executives often do not know the business rules of their own corporations. These business rules are embedded in highly convoluted source code that nobody understands -- yet the business operates day to to day, depending on mysterious software that no one understands. For this reason, massive software reengineering projects are initiated and attempt to fathom ‘what the hell is going on ‘.

These projects often fail and frustrate everybody. They use a buckshot approach when an elephant gun is needed to hit the ‘business rules’ bullseye. That bullseye is the coupling edges within class inheritance diagrams.

Highlighting a coupling edge with the names of the classes it connects can be a building block to extracting the business rules from the source code. This can be automated such that the two connected classes and the name of the coupling edge can be presented to a re-engineering analyst, who then inserts some ‘business speak’ narrative that is more understandable to managers and executives. This can be the first step and extracting intelligence within a refactoring project. The key is the coupling edge connecting two taxonomies.

By extracting business rules, we are initiating automated software re-engineering. What’s more, we’re not just automating what’s being manually done, we’re re-engineering with a surgical insight whereby business intelligence can be gleaned from the architecture.

This paper attempted to show the relationship between three separate disciplines -- Mathematical Graph Theory, big data clustering, and condensation techniques, and computer science object-oriented design.

The coupling notions described here, including mal-structure, are pertinent to each of the software development phases. Minimizing coupling is an important design issue; it’s crucial to observe when writing code, and coupling can have a dramatic impact on the maintenance phase.

Two refactoring strategies were presented. The first decomposes an architecture into separate components – purely from a Graph Theory point of view. The second strategy, which incorporates OO design issues, unpacks the data and methods of a non-bridge class and disinherits it to its parents.

Each of the strategies results in refactoring where an extremely coupled design is refactored into component pieces, each of which has less complexity.

Reducing the complexity of an architecture is always a concern.  Two techniques are presented to do this. Both are driven by metrics, they condense architectural complexity by removing highly coupled individual classes and highly coupled subgraphs.

The last section uses the first instance of coupling to derive intelligence from an architecture. It is a first step in extracting ‘business rules’ from source code.

The validity and usefulness of these techniques is compelling when somebody tries to modify software after the fact. As was illustrated, ignoring this issue during maintenance can cause the architecture to degrade in a nonlinear fashion --- a small change can have a large destructive effect. A negative singularity.

Footnote
To fully implement the complete refactoring of K5 the disinheriting method would have to be applied. 

For example, in the first K3 ( 2nd graph below with nodes 1,2,3) node 4 and node 5 would have to disinherit their data and methods to node 2 and node 3.

In the second K3 ( 3rd graph below with nodes 1,4,5) node 2 and node 3 would disinherit their data and methods to node 1.

The other four K3 subgraphs would follow the same ‘disinheriting’ principle.

Operationally the following nuance occurs. Inherited class data and methods are returned to
the sender parent. Unique class data and methods are distributed to the appropriate parent.

Notice that any of the ‘repacked’ K3 subgraphs would fully implement the original K5.