Refactoring and Visualization
T McCabe
© March 2019
2nd Rough Draft
Introduction
The intent of this paper is to provide insight into improving the quality
of object-oriented software architecture.
The focus is the coupling and binding that occurs between object-oriented
classes. We will develop a methodology to refactor and un-couple an object-oriented
architecture. The discussion includes the perspective of a maintenance
programmer, who must incorporate his changes within a mal-structured existing
architecture. We will also develop a metric that quantifies the coupling within
an architecture.
This paper was fun to
write because it connected several different disciplines. The focus is computer
science, but to develop the subject we will lean heavily on mathematical graph
theory. Also, the paper borrows from the algorithms and heuristics used in big
data analytics, wherein very large social graphs or cybersecurity graphs are
condensed to discern interesting behavior and patterns.
In the graphs
presented here the nodes represent object-oriented classes or objects, and the
edges represent dependency.
In order to develop
the subject of class coupling, I thought it best to begin with the most extreme
examples. For that reason, we will explore what are known in mathematical graph
theory as complete graphs. The terminology is that Kn represents the
complete graph with n nodes. To the left are the complete graphs K1…..K8.
Most of the focus here is assuming we have the reverse-engineered the source code to produce the class dependency graphs. However, most of the
techniques discussed apply equally to the upfront design phase of an
object-oriented architecture. This is not pointed out explicitly in the
narration, but it is important to keep this in mind. Since the subject is good object-oriented
architecture, the leverage point is obviously the upfront design phase. However,
we will elaborate the things that go wrong when an architecture is compromised,
and then we must deal with an imperfect world where the architecture is
convoluted and complex.
In my 1976 article about Cyclomatic Complexity, I discussed
the ‘essential complexity’ of an algorithm -- the extent to which an algorithm
is mal-structured. I described a kernel or generating mal-structure construct
of an algorithm. It is interesting to ask the analogous question here, within
the framework of object-oriented architecture. The question is ‘What is the kernel
or generating mal-structure within a OO class inheritance graph? ‘
We will use Mathematical Graph Theory to develop our computer
science refactoring concepts. In Graph Theory, the graph K5 is a
good place to start our discussion. The graph K5 has every class
connected to every other class; it is extremely coupled. K5 also
approximates a worst case that one would encounter in an existing object-oriented
dependency graph. The techniques presented here will work on other graphs, but
we will start with K5 to provide a concrete example.
In this
computer science discussion, the nodes of graph theory translate into classes
or packages, and the edges of graph theory translate into dependency. The
graphs K3 through K5 will be introduced as part of this
narrative.
As we examine extreme
coupling between object-oriented classes, it is interesting to look for a core,
or a kernel stereotype of mal-structure.
After some reflection
it appears that the complete graph K3 is this kernel mal-structure
stereotype.
My colleagues and I
describe this stereotype as ‘Philadelphia love’ – – a sibling inheriting from a
brother. This stereotype as a design mantra is the root cause of elaborate poor
structure with extreme coupling.
The bizarre situation where node 1 inherits from node 3 is
known in graph theory as a ‘strongly connected component’ and it is known in
computer science as a ‘circular dependency graph.’ Although all of the results
in this paper apply to this case, a circular dependency graph has to be
corrected as a first priority before considering the discussion below.
As an illustration, consider the complete graph K5:
If we look within K5, can we find a kernel set that repeats many times to form K5? Expressed another way: Is there a kernel subgraph stereotype repeated many times within K5?
The answer is yes. The result is true in graph theory and by implication, it is also true in computer science object-oriented design. This is illustrated below where K5 is decomposed into six subgraphs of type K3, that collectively form K5.
This is called a complete refactoring of K5 since it is refactored into its most atomic K3 kernels.
Actually, there are ten K3 subgraphs in K5; the other four are not rooted at node 1 so they are not of interest here.
Does K4 likewise have a kernel of K3 subgraphs? As illustrated below, the answer is yes, there are three K3 subgraphs within K4.
Computer Science Implications
This illustrates that a kernel and fundamental OO design mal-structure is ‘siblings borrowing from each other.’ An architecture generated with this mindset or stereotype will result in elaborate mal-structure --- such as K5 or K4. So, the obvious advice is to avoid it in the first place - avoid it in the OO design phase.
A secondary, and just as important issue is how to refactor a highly-coupled design such as K5. One strategy is to break it down into its most basic subgraphs. This result is given above where K5 is refactored into six K-3 subgraphs.
Another refracturing is to break K5 into less complex K4 subgraphs. As illustrated below, there are four such K4 subgraphs.
To complete this refactoring see the footnote at the end of this paper.
Refactoring Methodology
The graph decomposition shown above is purely from a Graph Theory point of view. A more substantive refactoring discussion embracing computer science issues follows.
To reduce coupling and to simplify the architecture, we are going to reverse the inheritance principle. We are going to disinherit.
We will use a notion from Graph Theory called a ‘bridge’ node. A bridge node connects a graph to the surrounding environment. In our K5 example, node 4 below is a bridge node.
Here is a method to refactor:
In our K5 example, if we pick node 5 and disinherit its data and methods to its parents (nodes 2,3, and 4), we can then remove it. The resulting reduced graph is K4 -- as shown below.
We can continue this process and reduce K4. If node 4 is not a bridge node, we would then disinherit its data and methods into node 2 and node 3, and then remove node 4. The resulting graph is K3 – our original kernel graph.
Maintenance Changes to Avoid
So-called ‘spaghetti code’ was one of the bugaboos of the 1970s and 80s. People wondered what kind of mentality would produce overly complex code that was so convoluted nobody could understand it. In most cases, such spaghetti code arose from patching into an algorithm. The original code was well structured enough, but a few patches could have a devastating effect on the algorithm and turn it into spaghetti code.
An analogous malpractice is happening now in the world of object-oriented design. The following discussion will illustrate how a well-structured class dependency graph can be corrupted with one single change.
Consider the class hierarchy diagram on the right where the following modification is about to be implemented. Class Y wants to reference a protected method of class X. The ancestors that class Y is dependent of are shown in red.
The added red reference of class Y to class X put in as a patch creates a dependency ripple to all of the red classes on the left. The number of classes that X is dependent on went from 5 to 11, changing one line of code. The classes P, Q, and R are also affected -- the number of classes they are dependent on has doubled to include all of the red ancestors of class X.
The situation where K3 causes coupling can be extended beyond the literal K3. K3 appears when a sibling borrows from another sibling. The same phenomena of poor design happen across generations -- when, for example, grandchildren, cousins or great cousins borrow from each other. When this multi-generational borrowing occurs, you will see an extended K3 subgraph that is causing coupling.
The blue edges and blue nodes here show an extended K3 graph. This was indeed the cause of the mal-structure that coupled the subgraph on the left to the subgraph on the right.
Although highly
coupled subgraphs are error-prone, they also present an opportunity viz a vie
visualizing large and complex graphs.
The technique is to
identify complete or almost complete subgraphs and then condense them into one
super-class. The net effect is a substantial reduction of complexity and noise
in the overall graph, whereby visualization is enhanced.
This is a common
technique used on very large graphs – often with big data analytics. The clustering and condensation methodology
is used in sociology where the complete graphs are called cliques -- people who
communicate with each other but are somewhat isolated. Clustering and condensation are used in microbiology to
cluster genes that are highly related. Clustering and condensation are also used in cyber warfare to
identify sleeper cells of terrorists and to identify highly influential nodes. Lastly,
Clustering is used in traffic analysis and network flow.
Complete graphs like
K5 seem like an extreme situation. This discussion describes a
metric that can be used to identify subgraphs where the coupling is a matter of
degree -- it can be less than complete. This opens the domain for a wider class
of graphs that can be analyzed using the techniques in this paper.
A natural coupling
metric CM arises when one considers
any graph or subgraph. That metric is the number of dependencies divided by the
number of possible dependencies.
The number of
possible dependencies for a graph with n nodes is n x (n-1)/2.
Therefore we have CM= #edges/(n x (n-1))/2.
Notice that
0<CM<=1. So, CM measures the
degree of coupling as a metric between 0 and 1.
For example, with the
graph K3 we get CM(K3) =
3/(3x(3-1)/2) = 3/3 = 1.
With K5 we get CM(K5) = 10/(5x(5-1)/2 =
10/10 = 1.
This is the desired result since K3 and
K5 are complete graphs.
Now consider a graph
where one edge of K5 is missing – the edge between nodes 4 and 5 is
missing.
We get CM = 9/(5x(5-1)/2) = 9/10 = .9
Since this graph is
‘almost complete’ the metric CM is
giving a number that tracks with our intuition. CM will pinpoint, within a large
complex graph, the almost complete subgraphs that can be condensed.
A single class’s coupling to its environment
In this section, we will look for single classes that are tightly bound to its neighbors, and are therefore candidates for condensation.
We will develop a metric called ‘coupling coefficient’ -- when a class has a high coupling coefficient the class with its neighbors can be considered as one it can be condensed to a reduced graph.
This is different than the previous example where a whole subgraph had a coupling metric. In the previous example, the entire subgraph was a candidate to be condensed if the coupling metric was high.
Here, we consider a coupling coefficient where the focus is on one class instead of the subgraph of many classes. And the action taken, if the coupling coefficient is high, is to condense the class in question together with its immediate neighbors. In social graphs, this is the fraction of friends who know each other.
Here is the definition of the coupling coefficient C with examples.
C = # edges in C’s neighborhood/ max # edges in C’s neighborhood.
Assume k is the number of edges in C’s neighborhood.
The maximum # edges in C’s s neighborhood is the quantity: k x (k – 1)/2.
So, C = # edges in C’s neighborhood/ ((k x (k – 1)/2) = 2 x (# edges in C’s neighborhood)/k x (k-1)
In the top graph on the right, the blue node has 3 white node neighbors that have 3 edges connecting them. So k =3 and our formula gives:
C= (2 x 3)/ (3 x 2) = 1.
With the middle graph on the right there is only one connection shared by the blue node’s neighbors. So k =1 and we get
C= (2 x 1)/ (3 x 2) = 2/6 = 1/3
With the bottom graph the neighbors share no edges, so k = 0 and we get
C= (2 x 0)/ (3 x 2) = 0
The coupling coefficient C gives us a hint as to what to condense on a complex class dependence graph. C is bounded by 0 and 1,
0 <= C <= 1.
When C is high, around .9, and the number of neighbors of a class is high it then is considered a good candidate to condense in order to simplify the overall graph. The graph below gives a sense of condensation. If the graph is condensed to one class per color it becomes comprehensible.
Refactoring methodology to reduce coupling and
enhance visualization
With a global
refactoring project where the goal is to improve the overall quality of the
architecture, it can be difficult to determine where to begin. The first
thought is to begin at the lowest classes in the inheritance hierarchy because
they are the easiest to modify. However, the metrics we discussed above can
give a better starting orientation.
To minimize coupling
one should start with the class X that has the highest coupling coefficient.
This class X is binded to its environment more than any other. Rather than
refactoring this class right away, it makes sense to look at its environment.
Is there a subgraph Y containing class X that has high subgraph coupling --
where the metric is the subgraph
coupling metric versus the X coupling coefficient of a single class.
If there is a maximal
highly coupled subgraph Y then there are two alternatives to simplify the
architecture. The first is a nominal simplification where the subgraph Y is
given an artificial name and condensed. This helps visualization and
comprehension without changing any source code.
The second approach
to simplify the architecture is to remove a highly coupled class like X from
its environment Y. This is performed using the steps described above where the
class X disinherits it's methods and attributes to his parents and distributes
any unique methods to the most appropriate parent. Any reference made to class
X are then made to the appropriate parent.
The poetic irony of
this procedure is we are ‘un-designing’ the architecture. The top down design
methodology that most people use is to keep specializing and creating new
classes, inheriting both data and methods. This can result in too many classes
and too much coupling. The surprise here is that we are reversing the design
and unpacking specialized classes back into their more abstract parents.
Un-designing a design. Reverse engineering and refactoring in the
literature proposes many specialized techniques, but this is the most
fundamental of them all --- ‘un-designing a design ‘.
We are fortunate to
be in the software business. Building architects have a more drastic procedure
when a skyscraper has outlived is safety profile or utility -- they demolish
the building and implode it into rubble. No reusable pieces.
In software we’re
lucky, our version of architecture implosion can be achieved with selective
‘un-design’; we can salvage many reusable pieces.
The beginning of automated re-engineering
As we view a legacy system over
time, from the perspective of version control and commits, it would be
fascinating to detect where coupling begins in a class hierarchy diagram. This
would mean the first edge inserted in the diagram that had 'siblings borrowing
from each other' and creating a closed region in the graph architecture.
To Illustrate, imagine through configuration control, we
could detect the version where the inheritance from class X to class Y takes
place. As you can see, it causes a closed
region in the inherence hierarchy and initiates coupling.
This first coupling is a hint
at gleaning intelligence from the graph. It’s this cross-pollination or
hybridization that gives a hint about the business intent of the architecture.
A more extreme example is a change that connects two
hierarchies that are completely separate. It is noteworthy as a design artifact
because it couples the two hierarchies. Without the coupling edge, the hierarchies
of classes are only taxonomies -- they only create specialized classes from
more abstract ancestries. The coupling edge also lends itself to business rule
extraction.
A coupling edge connects two
separate hierarchies. It gives a hint about the operational intent of the
software. It's the link between two unrelated architectures, is the hybrid between
two genetically distinct species.
One way to clean intelligence
from this coupling edge is to simply look at the names of the classes that it
connects. In addition, in some architectures the edges are given names, this
also would help. One of the key driving concerns for software reengineering is
to uncover what are known as ‘business rules’ within the source code.
The business rules capture the intent of the software and can be understood by
analysts and used by corporate executives -- the business rules are the raison
d’etre of the software.
As a result of very complex
software architectures, business executives often do not know the business
rules of their own corporations. These business rules are embedded
in highly convoluted source code that nobody understands -- yet the
business operates day to to day, depending on mysterious software that no one
understands. For this reason, massive software reengineering projects are
initiated and attempt to fathom ‘what the hell is going on ‘.
These projects often fail and
frustrate everybody. They use a buckshot approach when an elephant gun is
needed to hit the ‘business rules’ bullseye. That bullseye is the coupling
edges within class inheritance diagrams.
Highlighting a coupling edge
with the names of the classes it connects can be a building block to extracting
the business rules from the source code. This can be automated such that the
two connected classes and the name of the coupling edge can be presented to a
re-engineering analyst, who then inserts some ‘business speak’ narrative that
is more understandable to managers and executives. This can be the first step
and extracting intelligence within a refactoring project. The key is the
coupling edge connecting two taxonomies.
By extracting business rules,
we are initiating automated software re-engineering. What’s more, we’re not
just automating what’s being manually done, we’re re-engineering with a
surgical insight whereby business intelligence can be gleaned from the
architecture.
This paper attempted
to show the relationship between three separate disciplines -- Mathematical Graph
Theory, big data clustering, and condensation techniques, and computer science
object-oriented design.
The coupling notions
described here, including mal-structure, are pertinent to each of the software
development phases. Minimizing coupling is an important design issue; it’s crucial
to observe when writing code, and coupling can have a dramatic impact on the maintenance
phase.
Two refactoring
strategies were presented. The first decomposes an architecture into separate
components – purely from a Graph Theory point of view. The second strategy,
which incorporates OO design issues, unpacks the data and methods of a non-bridge
class and disinherits it to its parents.
Each of the
strategies results in refactoring where an extremely coupled design is
refactored into component pieces, each of which has less complexity.
Reducing the
complexity of an architecture is always a
concern. Two techniques are presented to
do this. Both are driven by metrics, they condense architectural complexity by removing
highly coupled individual classes and highly coupled subgraphs.
The last section uses
the first instance of coupling to derive intelligence from an architecture. It
is a first step in extracting ‘business rules’ from source code.
The validity and
usefulness of these techniques is compelling when somebody tries to modify
software after the fact. As was illustrated, ignoring this issue during
maintenance can cause the architecture to degrade in a nonlinear fashion --- a
small change can have a large destructive effect. A negative singularity.
Footnote
To fully implement the complete refactoring of K5 the disinheriting method would have to be applied.
For example, in the first K3 ( 2nd graph below with nodes 1,2,3) node 4 and node 5 would have to disinherit their data and methods to node 2 and node 3.
In the second K3 ( 3rd graph below with nodes 1,4,5) node 2 and node 3 would disinherit their data and methods to node 1.
The other four K3 subgraphs would follow the same ‘disinheriting’ principle.
Operationally the following nuance occurs. Inherited class data and methods are returned to
the sender parent. Unique class data and methods are distributed to the appropriate parent.
Notice that any of the ‘repacked’ K3 subgraphs would fully implement the original K5.