The objective of my work with Xerces XML parser is to generate analyses that count the instances of certain attributes found in source Java and C++ code. This can be done without an XML parser. This can easily be done by seeking out characters using a search tool (CTRL + F), and/or counting these manually in each of my 100 files I need to analyze. This can actually be done on the text file containing the source code itself.
However, I’m very interested in the structure of how developers write code as well, and I want to be sure I’m able to count where attributes appear in “odd spots.” The goal of my study is to research “how understandable code pages is to developers”, and the nice thing about the research I’ve collected is that there are many examples. To do that, I could count how many symbols appear, how many characters appear on a line. But I could also consider how deeply nested the most important symbols of a code snippet are.
Check this out. Both these blocks have a string toReturn, a list of characters in a string. It may be difficult to determine what this block is doing.
File file = new File(rawPath);
String toReturn = file.getPath();
toReturn =
toReturn.substring(0,toReturn.lastIndexOf(File.separatorChar)+1);
return toReturn;
Now check out this code:
for(int i = 0; i < line.length; i++) {
if(inside == false) {
if(line[i] == ',')
toReturn.add("");
else
toReturn.set(toReturn.size()-1, toReturn.getLast() + line[i]);
}
else {
if(line[i] == '"') {
inside = false;
}
}
}
Both these blocks have a string toReturn, a list of characters in a string. What the readers of the second block might find concerning is the depth at which we find the most important code of this method happening. We need to look inside a loop, an “if” block, and finally inside a method call to find the action necessary to understand what is being done to this all important string.
While knowing that toReturn is an important thing inside this code block to consider, it is not enough to know that toReturn is here to gauge the difficulty of understanding this code block. It is more important to realize that the depth a person must go to find important information, may make the block more elusive to helping people gain full understanding.
Every block in this design should be visible in our analysis. We can’t stop at the top level, and we can’t omit what’s inside each if statement or we’ll miss the important parts.
This omission is what I came across when using the Java JAXB Xerces parser. Something akin to these two examples below for the code snippets above.
File file = ???
String toReturn = ???
toReturn =
toReturn.substring(??,??);
The question marks demonstrate where data was missing. Something you notice from this parser as compared to the “raw” parser is that there is a step that creates the XML file in memory, and a different stage when the programmer can view each element or attribute read in. Just noting that the code file is read in, can allow one to completely miss these details until the point where the data need be used, as in the diagram, the question marks represent times where the read stage didn’t read content as intended from the XML file — where the “programmer view” stage turns up results from the code that have disappeared from the translation.
The solution, go back to the schema part of my XML file, and see what went wrong in generating the correct code to achieve the proper read. There is a middle step involved referred to in the post Schema where the code for reading in these files is generated. In previous iterations of using this method: generating new XML reader code, finding surprises when viewing the results, returning to step 1 – this was the route to take. This process takes time. Too much time.
A Spin-off Effort
It’s time to build an XML processor that will have me skip some of the hardships involved with generating new code. If only I could immediately after the read see exactly what was being worked with in memory, I could do damage on my research faster. Perhaps one that could,
- for each element in the XML document
- organize the returned results as pieces of the file in a predictable order
- allow me to view pieces in that order
- treat XML the element and the element attribute the same as we iterate
- give information as to which type I’m dealing with at the moment
The root in an XML document is the “outer rim” of the text file, and the branches from root are the elements and text that you see atop the document and in the elements that are not “parented” by any other elements, meaning they are nested inside no other XML elements. The “intermediate nodes” are parts of your XML that are “parented” by other XML nodes.
The idea is to support predictable and implicit adoption: where in the code I write in Java or C++, I have the option of making deeper leaps into the file beyond the root branches, or simply completing what I find important to the process atop one root branch and moving on to other branches I find more important. The concept of branches, roots, and trees is covered in computer science curriculums I’ve taught in the past. It’s a concept of Data Structures … it’s a concept I understand and can relate to. I have prescribed steps I can use and build into a solution of my own that can make use of roots and branches.
Using my own concept of trees, I can help avoid code disappearing on me in my analyses. The idea is to use Breadth First Search, and an idea called the Visitor Pattern. Let’s move on to the next step, writing the code.