In a previous entry, I outlined how flat files with delimited records can be parsed by the flat file disassembler. Another post on microsoft.public.biztalk.general attracted my attention. Let's help Norbert disassemble his file.
The document to disassemble is presented below. I have inserted blank lines to facilitate reading the data. Keep in mind that these blank lines are not in the original data. I strongly suggest to navigate to the original post to better understand the original format:
1 112300004321003901015632011234562003120120031231 2000158EUR2003123101
200000000000100000000000000 000000022650000000000000000000001955080311Rutten/Verbeek G.W.H. 1231SZ00058 NL10 0002265 3000000000002Schakel LOOSDRECHT 01015632 40000000000030000000000 00000002265 00000000 2003111000808001000100000000000000000000000001700D
200000000000400000000000000 000000627160000000000000000000001964031811Heese M.J. van 4566BB00021 NL10 0062716 3000000000005Magdalenastraat Heikant 01000112 40000000000060000000000 00000062716 01000112 2003111000808001000100000000000000000000000001700D
200000000000700000000000000 000000627170000000000000000000001966041821Heese K.E. van 4566BB00021 NL10 0062717 3000000000008Magdalenastraat Heikant 01005227 80000000000090001Betaler20062716 1M.J. van Heese Magdalenastraat 00021 4566BBHeikant 1803196400000000000 40000000000100000000000 00000062717 01005227 2003111000808007000100000000000000000000000004765D
200000000001100000000000000 000077777770000000000000000000001929072911Jans R. 4602PC00013boveNL10 7777777 3000000000012Algemenestraat OOSTERHOUT 01015632 80000000000130001Betaler27777777 1R. Jans Castricummerallee 00035 4607ADOOSTERHOUT 2907192900000000000 40000000000140000000000 00007777777 00000000 2003111000808001000100000000000000000000000001700D
I have no knowledge of the data so there will be some guessing going on here. However, Norbert knows the format of the data so he should be able to adjust what I wrote here to his application. It looks like the file has a header (in blue), a body with two records of type A (in green) and two records of type B (in orange). Sharp readers have seen that the last two records have one more data than the first two. There is also a footer (in purple). But copying the data into a decent text editor, you can easily see that the file has positional records. This means that fields have a fixed width and they start always at the same offset from the beginning of the record. During this pre-investigation, I also noticed that every line in the file is exactly 160 characters.
In practice, for flat file with positional records, it is easier to start by building a flat file disassembling schema that takes a line of input and puts it into one element in the output XML. This approach simplifies the task of getting line terminators right. When you have got the line parsing right, you can then break each individual line down to individual fields without having to worry about line terminators (almost). Anyway, let's get started. If you wish to download the whole schema, click here
1 - A strawman schema:
Assuming that the flat file posted was representative, we can already end up with a schema that breaks up data line by line: first a Header taking the whole header field (160 characters) to a string (there is exactly one line for the header). Then there are two records and each of them is composed of three lines. Then, two other records, but there are four lines long. And finally, a header which is one line long. Since lines are separated by <CR LF>, we can start building a strawman schema as shown below:
The Root node should be a delimited record, where Child Order appears in PostFix order. This is because the data of a field appears before the line terminator. The delimiter will be set to 0x0D 0x0A and its mode will be "Hexadecimal". This way, we can get all lines of the files correctly parsed under Root.
Under the Root node, a first sequence is used to define that there will be exactly one "Header" record (minOccurs and maxOccurs on the sequence are set to 1). This element is a Positional record. By default, the BizTalk schema editor for flat file disassembler assumes delimited records. We need to switch the mode to "Postional" and this is easy: just change the "Structure" type to be "Positional" in the record's properties page. For now, we will set the "Postional length to 160 to grab the whole header's line into one attribute.
The second sequence is there only to specify that there are exactly two (minOccurs = maxOccurs = 2) records of type "ARecord". Under this structure, I have broken the record into three lines called "ARecordA", "ARecordB" and "ARecordC". Under every record, I have put one single attribute for now with a positional length of 160 to grab the whole line in one field.
The same approach can be done with the second set of records: exactly two records of type BRecord. This record has four lines so I have four elements under the sequence. The trailer as its own sequence (minOccurs = maxOccurs = 1) and one item under it to grab the whole line.
2 - Testing the strawman schema:
One can use FFDasm,.exe (see a previous post) to see how well we did:
FFDAsm input.txt -bs FF.xsd -c
While the output is too large to be reproduced here (download the schema if you want to try it), we observe that indeed, every line was parsed and put into an attribute of the previously mentioned elements.
3 - Breaking down lines into specific fields:
We were successful in taking full lines of input into one attribute (or element) of the XML output. This is good but this is not enough. What we really want to do (especially if we are going to do more processing on this business document) is to break each individual field into an attribute (or element). This is in general a tricky process but I'll try to give you a few tricks to achieve this quicker.
For the purpose of this example, I will only concentrate on "ARecordA" above. You can always download the whole schema here. The key here is to figure out exactly how long a field is. Remember that in our file, lines are 160 characters long. So if the sum of the size of all your fields under a record ends up being different from 160, then the disassembler will flag an error. So every time you change the positional length of a field, make sure that all the fields in a record (which maps to a line in our example) add up to a full line of 160 characters. I have found convenient to copy the field and paste them into word (with spaces) and use the "Count characters (with spaces)" feature to figure out the exact width of the field.
So here we go. For ARecordA, we have 7 fields. The first one is 32 characters long, the second one is 42 long, the third one is 25 long, the fourth one is 16 long, the fifth one is 15, the sixth one is 9 and the last one is 21. To check that I did not make any mistake, I can switch to the "Flat File" tab in the BizTalk 2004 schema editor and observe that the length of RecordA is indeed 160.
4 - So how about those white spaces in fields?:
Yes, this is true. There are white spaces at the end of Rutten/Verbeek in the third field or RecordA. The good news is that in this example, we do not have anything do to. By default, positional record parsing is setup to ignore the padding characters and the character to ignore is space.
Hopefully, this helped Norbert to get his issue resolved.