Structured vs Semi-structured Data

My name is Bill Carroll and I am a member of the Microsoft HDInsight support team. The majority of my working career has been spent on SQL Server, a relational database. Little did I think about it all these years, but relational databases are structured data. When we create a table we define the structure with a data type and whether we allow null data or not. When we insert or update a record for the table, checks are done to determine if the data adheres to the predefined structure. Below is a simple example where the check fails and terminates the statement. SQL Server raises the 8152 message.


 create table table1 (c1 int not null, c2 varchar(10) null , c3 decimal(18,4) null) 
 insert into table1 (c1, c2, c3) values (1, 'A very long string ', 14.91918181) 
 Msg 8152, Level 16, State 14, Line 1 
 String or binary data would be truncated. 
 The statement has been terminated. 


This last month I worked an issue with a customer on HDInsight that drove home the difference between structured data of the relational database world versus semi-structured data in the big data world. I also found a new respect for the basic WordCount example and the wisdom of those that chose it as a starting point for mapreduce. We had a java mapreduce job that appeared to hang or in reality progress very slowly in the reducer. The reduces eventually completed after 255 hours.

The mapper function parsed a tab separated line into separate java String variables. One variable was an ID, other variables were concatenated together to form a part of a JSON document. The mapper wrote out the ID as the key and the partial JSON document as the value. In the reducer function, for each key, the parts of the JSON document were concatenated to form one JSON document. The document was then written out by the reducer.

The investigation revealed that in the mapper the parsing of the ID had some "invalid" data, but because we are dealing with semi-structured data no nice error message is raised like in a relational database. One of the ID's was actually a space with over 13 million partial JSON elements. What this meant was, in the reducer for the invalid key, a single JSON document was being created with 13 million concatenations to a String. There were several invalid ID's but this was the worst offender. Using the WordCount example, in the reducer, I did counts and sum of chars of the elements for each key and wrote it out without doing the string concatenation. After this I saw the invalid keys and the numbers of the counts, the problem became obvious. The techniques learned in the WordCount example are useful!

In the next part of the investigation I used two tools that are installed with java, jstack and jmap. With jstack I was able to get the stack while the reducer was "hung". I was able to line up the line of code on line=126 as s+="<partial JSON documents>";. I also used jmap heap dump to confirm that it was processing the invalid key. Below is part of the stack output from jstack.

 Thread 1: (state = IN_JAVA)
- java.util.Arrays.copyOfRange(char[], int, int) @bci=40, line=2694 (Compiled frame; information may be imprecise)
- java.lang.String.<init>(char[], int, int) @bci=60, line=203 (Compiled frame)
- java.lang.StringBuilder.toString() @bci=13, line=405 (Compiled frame)
- BigDataSupport.MyClass$Reduce.reduce(, java.lang.Iterable, org.apache.hadoop.mapreduce.Reducer$Context) @bci=70, line=126 (Compiled frame)
- BigDataSupport.MyClass$Reduce.reduce(java.lang.Object, java.lang.Iterable, org.apache.hadoop.mapreduce.Reducer$Context) @bci=7, line=103 (Compiled frame)


More information about jstack can be found here and jmap here. To review the output of jmap you can use jvisualvm.

Once we corrected the "invalid" key problem, I started to investigate the performance of Java string concatenation and did some simple tests using mapreduce. I was surprised. I took one of the largest keys and placed a stopwatch in the reducer to time it. I wrote out the elapsed time in MS, the length of the final string in char and the number of partial string elements that it was concatenating. I then tested using java String concatenation, StringBuilder's append() method, and StringBuffer's append() method. Java String concatenation is significantly slower. The results are below.

String s = ""; s+=;
Key = c1976429 reduce(): elapsed time = 61732 length = 1543427 ctr = 42556

StringBuilder s = new StringBuilder(); s.append(;
Key = c1976429 reduce(): elapsed time = 226 length = 1543427 ctr = 42556

StringBuilder s = new StringBuilder(16000000); s.append(;
Key = c1976429 reduce(): elapsed time = 225 length = 1543427 ctr = 42556

StringBuffer s = new StringBuffer(); s.append(;
Key = c1976429 reduce(): elapsed time = 227 length = 1543427 ctr = 42556


As I continue to learn about java, mapreduce, and HDInsight\Hadoop some of the simplest lessons are the best. Here are just a few of things I learned.

  • In dealing with semi structured data it is always a good idea to validate the keys and the size of the values you are writing out to the reducer. Modify the WordCount example to explore your data and get to know it.
  • Java has several utilities like JPS, JStack, JMap, JVisualVM and others that can help in your investigations.
  • If you are going to concatenate Strings use the append() method of StringBuilder or StringBuffer, it provides better performance. Performance matters when dealing with large amounts of data.
  • The world of big data and HDInsight\Hadoop is a world of semi-structured data. This provides us with a greater ability to explore insights in data, but we also need to be aware of its pit falls.



Comments (0)

Skip to main content