Equivalence class partitioning – Part 2: Character/String data decomposition


 Again, I am remiss in my postings...too many irons in the fire these days. Two weeks ago, I posted a challenge to decompose a set of character data (The ANSI Latin 1 Character Set) into valid and invalid equivalence class subsets in order to test the base filename parameter of a filename passed to COMDLG32.DLL on the Windows Xp platform from the user interface using the File Save As... dialog of Notepad.


As illustrated below the filename on a Windows platform is composed of two separate parameters. Although the file name parameter of the Save As... dialog will accept a base filename, a base filename with an extension, or a path with a filename with or without an extension, the purpose of the challenge was to decompose the limited set of characters into equivalence class subsets for the base filename component only (the part outlined with green). (Of course, complete testing will include testing with and without extensions, but let's first focus on building a foundation of tests to adequately evaluate the base filename parameter first, then we can expand our tests from there to include extensions.)


Windows Filename


As suggested in the earlier post, in order to adequately decompose this set of data within the defined, real world context (and not in alternate philosophical alternate universes) a professional tester would need to understand programming concepts, file naming conventions on a Windows platform, Windows Xp file system, basic default character encoding on the Windows Xp operating system (Unicode), some historical knowledge of the FAT file system, and even a bit of knowledge of the PC/AT architecture. The following is a table illustrating how I would decompose the data set into equivalence class subsets.











Input/Output
Parameter
Valid Class
Subsets
Invalid Class
Subsets
Filename

V1 – escape sequence literal strings
      (STX, SOT, ETX, EOT, ENQ, ACK, BEL,
      BS, HT, LF, VT, FF, CR, SO, SI, DLE,
      DC1, DC2, DC3, DC4, NAK, SYN, ETB,
      CAN, EM, SUB, ESC, FS, GS, RS, US,
      DEL)


V2 – space character (0x20) (but not as
       only, first, or last character in the 
       base file name)

V3 – period character (0x2E) (but not as
       only character in the base file name)

V4 – ASCII characters
       punctuation (0x21, 0x23 – 0x29, 0x2B –       0x2D, 0x3B, 0x3D, 0x40, 0x5B, 0x5D, -
       0x60, 0x7B, 0x7D, 0x7E)
       numbers (0x30 – 0x39)


       alpha (0x41 – 0x5A, 0x61 – 0x7A)


V5 – Ox80 through 0xFF

V6 – 0x81, 0x8D, 0x8F, 0x90, 0x9D


V7 – Component length between 1 – 251
       characters (assuming a default 3-
       letter extension and a maximum path
       length of 260 characters)

V8 – Literal string CLOCK$ (NT 4.0 code 
       base)

V9 – a valid string with a reserved
         character 0x22 as the first and
         last character in the string


I1 – control codes 
     (Ctrl + @, Ctrl + B, Ctrl + C, Ctrl + ], Ctrl + N, 
     etc.)


I2 – escape sequence literal string NUL

I3 – Tab character

I4 – reserved words
      (LPT1 – LPT4, COM1 – COM4, CON, PRN, AUX,
      etc.)

I5 – reserved words
       (LPT5 – LPT9, COM5 – COM9)

I6 – reserved characters (/ : < > | )
       (0x2F, 0x3A, 0x3C, 0x3E, 0x7C) by
      themselves or as part of a string of 
      characters



I7 – reserved character 0x22 as the only
      character or > 2 characters in the string


I8 – a string composed of > 1 reserved character
      0x5C

I9 – a string containing only 2 reserved
       characters 0x22

I10 – period character (0x2E) as only
       character in a string


 I11 – two period characters (0x2E) as only
       characters in a string


I12 – > 2 period characters (0x2E) as only
       characters in a string



I13 – reserved character 0x5C as the only 
         character in the string

I14 – space character (0x20) as only character in 
       a string


I15 – space character (0x20) as first character in
        a string

I16  – space character (0x20) as last character in a
        string

I17 – reserved characters (* ?)  (0x2A, 0x3F)

I18 – a string of valid characters that contains at least one reserved characters (* ?)  (0x2A, 0x3F)

I19 – a string of valid characters that contains at
        least one reserved character 0x5C but not
        in the first position

I20 string > 251 characters

I21 character case sensitivity


I22 empty


Discussion of valid equivalence class subsets



  • Valid subset V1 is composed of the literal strings for control characters (or escape sequences) between 0x01 and 0x1F, and including 0x7F. The literal strings for control characters may cause problems under various configurations or unique situations. The book How to Break Software: A Practical Guide to Testing goes into great detail explaining various fault models for these various character values. The literal strings in this subset should be tested as the base filename component and possibly in a separate test as an extension component. However, on the Windows platform the probability of one particular string in this subset behaving or being handled differently than any of the others is very low negating the need to test every string in this subset; although the overhead to test all would be minimal and once complete would not likely require repeated testing of all literal strings in this subset during a project cycle.

  • Valid subset V2 provides guidance on the use of the space character in valid filenames. On the Windows operating system a space character (0x20) is allowed in a base filename, but is not permitted as the only character as a file name. Typical behavior on the Windows platform also truncates the space character if it is used as the first character of a base filename or the last character of a base filename. However, if the extension is appended to the base filename in the Filename edit control on the Save or Save As… dialog a space character can be the last character in the base filename. Also note that a space character by itself or as the first character in a filename is acceptable on a UNIX based operating system. Also, although we can force the Windows platform to save a file name with only a space character by typing “ .txt” (including the quotes) in the Filename edit control on the Save/Save As… dialog this practice is not typical of reasonable Windows users’ expectations.

  • Valid subset V3 is the period character (0x2E) which is allowed in a base filename, but it is not a valid filename if it is the only character in the base filename (see Invalid subset for the period character).

  • Valid subset V4 is composed of ‘printable’ ASCII characters that are valid ASCII characters in a Windows filename. The subset includes punctuation characters, numeric characters, and alpha characters. We could also decompose this subset further into additional subsets including valid punctuation characters, numbers, upper case, and lower case characters if we wanted to ensure that we test at least one element from the superset at least once.

  • Valid subset V5 is the set of character code points between 0x80 and 0xFF.

  • Valid subset V6 is a superset of subset V5 and are separated only because they are code points that do not have character glyphs assigned to those code point values. These would be interesting especially if we needed to test filenames for backwards compatibility on Windows 9x platforms.

  • Valid subset V7 is the minimum and maximum component length assuming the filename is saved in the root directory (C:\). 

  • Valid subset V8 is a probably a red-herring. On the NT 4 platform the string CLOCK$ was a reserved word. On an older application first created for the Windows NT 4 platform that does not use the system Save/Save As dialog we might want to test this string just to make sure the developer did not hard code the string in an error handling routine.

  • Valid subset V9 is an interesting case because this invalid reserved character (0x22) is handled differently when used in first and last character positions of a base filename. When used in the first and last positions of a base filename the characters are truncated and if the remaining string is valid the filename is saved. If only one 0x22 character is used, or if two or more 0x22 characters are used in a string other than the first and last character positions the result will be an error message.


Discussion of invalid equivalence class subsets



  • Invalid subset I1 consists of the control code inputs for escape sequences in the range of 0x01 through 0x1F, and also includes 0x7F. Pressing the control key (CTRL) and any of the control codes keys will cause a system beep.

  • Invalid subset I2 is the literal string “nul”. Nul is a reserved word but could be processed differently than other reserved words on the Windows platform because it is also used in many coding languages as a character for string termination.

  • Invalid subset I3 is the tab character which can be copied and pasted into the Filename textbox control. Pasting a tab into the and pressing the save button will generate an error message.

  • The invalid subset I4 includes literal strings for reserved device names on the PC/AT machine and the Windows platform. Using any string in this subset result in an error message indicating the filename is a reserved device name.

  • Invalid subset I5 also includes reserved device names for LPT5 – LPT9 and COM5 – COM9. However these must be separated into a unique subset because using these specific device names as the base filename on the Windows Xp operating system result in an error message indicating the filename is invalid.

  • Invalid subsets I6, I7, and I8, include reserved characters on a Windows platform. When characters in this subset are used by themselves or in any position in a string of characters the result is an error message indicating the above file name is invalid.

  • Invalid subsets I9, I10, I13, also include reserved characters and the space and period characters. When these subsets are tested as defined no error message displayed and focus is restored to the File name control on the Save/Save As… dialog.

  • Invalid subsets I11, I12, also include the reserved character (0x2E) as 2 characters in the string and greater than 2 characters in a string. The state machine changes are different.

  • Invalid subsets I15 and I16 define the space character when used in the first or last character position of a string. These are placed in the invalid class because Windows normal behavior is to truncate a leading or trailing space character in a file name. If the leading or trailing space character was not truncated and saved as part of the file name on a Windows platform that would constitute a defect.

  • Invalid subset I17 and I18 contains two additional reserved characters; the asterisk and the question mark (0x2A and 0x3F respectively). If these characters are used by themselves or as a character in a string of other valid characters a file will not be saved, and no error message will occur. However, the state of the Save/Save As… dialog does change. If the default file type is .txt and there are text files displayed in the Folder View control on the Save As… dialog the files with the .txt extension will be removed after the Save button is depressed. If the default file type is All files then all files will be removed from the Folder View control on the Save As… dialog after the Save button is depressed.

  • Invalid subset I19 is a string of valid characters which contains at least backslash character except as the lead character in the string. (Of course, this assumes the string is random and the position of the backslash character in the string is not in a position which would resolve to a valid path.) The backslash character is a reserved character for use as a path delimiter in the file system. An error message will appear indicating the path is invalid.

  • Invalid subset I20 tests for extremely long base file name lengths of greater than 252 characters. Note that an interesting anomaly occurs with string lengths. A base file name string length which tests the boundaries of 252 or 253 valid characters will cause an error message to display indicating the file name is invalid. However, a base file name string length of 254 or 255 characters will actually get saved as file name but is not associated with any file type. Any base file name string longer than 255 characters again instantiates an error message.

  • Invalid subset I21 describes the tests for case sensitivity. The Windows platform does not consider character case of characters that have an upper case and a lower case representation. For example, a file name with a lower case Latin character ‘a’ is considered the same as a file name with the upper case Latin character ‘A’.

  • Invalid subset I22 is, of course, an empty string 

Of course, this is a partial list of the complete data set since the filename on a Windows Xp operating system can be any valid Unicode value of which there are several thousand character code points, including surrogate pair characters.


The first and by far the most complex step in the application of the functional technique of equivalence class partitioning is data decomposition. This requires an incredible amount of knowledge about the system. Data decomposition is an exercise in modeling data. The less one understands the data set, or the system under test the greater the probability of missing something. Next week we will analyze the equivalence class subsets to define are baseline set of tests to evaluate the base filename component.

Comments (7)

  1. I. M. Testy says:

    In the last post we decomposed the set of characters in the ANSI character set into valid and invalid

  2. ndiamond says:

    I don’t understand your explanation about distinguishing I[5] and I[4].  Notepad gives identical error message if I try to save to either COM4 or COM5.

    V[5] includes Latin 1’s code point for a version of the yen sign, but Windows XP sometimes has trouble with that character.  Are you sure it’s supposed to be a valid character instead of invalid?

    I[19] is defective.  In Notepad, I could specify the following path in Save as:

    "C:x.txt"

    It worked.

    Seeing some of the overlaps and omissions in your I sets, I tried an experiment.  In Notepad, I set the filter to all files, and tried to save a file to:

    "\?C:txt."

    For comparison, in a command prompt window I can copy to that pathname to create the file, and can delete that file, as long as I specify the path that way.  So today, I tried it in Notepad’s "Save as" dialog box.  Oh… while I was typing this, Notepad finally retorted that the pathname is invalid.  I thought I was going to have to kill it from task manager.

    (Off-topic:  Sometimes the Windows shell says that semicolon is an invalid character, though I don’t think that happens in the basename parameter in an API call.)

  3. I.M.Testy says:

    Hi Norman,

    Actually, when one types the literal string COM4 into the File name combobox control on an English version of Windows Xp an error message from COMDLG32.DLL appears stating "com4 This file name is a reserved device name. Please choose anothher name." That is Invalid class subset I[4]. When one types the literal string COM5 into the File name combobox control on an English version of Windows Xp an error message from COMDLG32.DLL appears stating, "com5 The above file name is invalid."

    Honestly, I am not sure how anyone can confuse those two error messages, but that’s not really important. What is important is that I discovered this anomoly in error handling and the issue is resovled in Windows Vista.

    V[5] includes the Unicode character code point U+00A5 which is the Yen sign. I suspect that you are confusing the U+00A5 Unicode character code point with the glyph that appears on a Japanese language version when one types the backslash character on a keyboard which actually generates a U+005C Unicode character code point (but looks like the Yen sign). (BTW…on a Korean version the glyph displayed for the U+005C Unicode character code point looks like the Won sign.) But, again, at the GUI on a Japanese language version the U+005C and U+00A5 glyphs look the same, so I can understand how it might be confusing.

    Actually, I[19] is not defective. I suspect that you incorrectly misinterpreted I[19]. By entering C: you are actually entering a path component to the base file name component which is also a common mistake. It also assumes a bit of common sense.

    Experiementation is good, but you appear to have strayed from the primary purpose of systematically evaluating the basename parameter and moved to the next level of trying various combinations and permutations of invalid and valid subsets that exercise the file name components including the path, the base filename, and the extension components. This is a common occurance, in systematic testing. People sometimes become more engaged in finding wild off the wall random stuff, or they easily get distracted by "hey ya’ll let’s try this" type mentality that basic functionality of a parameter is sometimes overlooked occasionally resulting in defects that have greater impact to the user.

    Oh…and sometimes meteors fall to earth, but the semicolon is not an invalid character in the file name basename parameter which of course is the stated purpose for this systematic decomposition of the specified data set.

  4. ndiamond says:

    "com4 This file name is a reserved device name. Please choose anothher name."

    "com5 The above file name is invalid."

    You are right, the common dialog box does distinguish those.  I wonder why.  I had tested com4: and com5: with colons, and the common dialog box gives identical error messages in those cases.  Also in my experience with other areas of XP, error messages for com4 and com5 are identical (except of course when com4 or com5 exists).

    "I suspect that you are confusing the U+00A5 Unicode character code point with" [with the yen sign ANSI 0xC5].  Well, the yen sign ANSI 0xC5 (which exists in ANSI code page 932 only) is indeed U+00A5, no confusion on my part.  Windows has to convert this to U+005C instead of U+00A5 because it’s more important to interpret this as a path separator instead of the actual character, no confusion on anyone’s part.  But last year, investigating a bug report from a beta tester, I had to do some experimenting with filenames in XP.  I created some pathnames that included U+00A5.  Windows Explorer indeed displayed it with the same glyph as the ordinary yen glyph for path separator.  Meanwhile some parts of Windows XP were mighty confused by those pathnames.  Some users don’t even get their Start menus displayed properly.

    "I suspect that you incorrectly misinterpreted I[19]. By entering C: you are actually entering a path component to the base file name component which is also a common mistake."

    I see your point, but I’m still confused.  By entering a at all I am entering a path component.  Even if is in the first position it is a path component, specifying the current drive’s root directory instead of the default current directory.  I still don’t see how it can be correct for I[19] to distinguish a in the first position from a in any other position.  I do understand (being reminded by your comment) that the basename begins after the last .

    "the semicolon is not an invalid character in the file name basename parameter"

    You and I agree, but XP’s Windows Explorer sometimes disagrees.  I couldn’t get the same error from a common dialog box though.

  5. ndiamond says:

    In my previous message I typoed 0x5C as 0xC5.  I wonder how I managed to do that twice.  Anyway all three of those should be 0x5C, which Windows interprets as a path separator when it occurs as a single-byte code point in any ANSI code page.

    The strange things that happen with U+00A5 still happen with U+00A5; I typed that one correctly.

  6. ndiamond says:

    Security Considerations for Character Sets in File Names

    Windows code page and OEM character sets used on Japanese-language systems contain the Yen symbol (¥) instead of a backslash (). Thus, the Yen character is a prohibited character for those file systems. When mapping Unicode to a Japanese-language code page, conversion functions map both backslash (U+005C) and the normal Unicode Yen symbol (U+00A5) to this same character. For security reasons, your applications should not typically allow the character U+00A5 in a Unicode string that might be converted for use as a FAT file name.

    The above is a quotation.  My observation is that security isn’t the only reason to hesitate on whether to allow the character U+00A5.  As mentioned before, even in cases where security isn’t (or shouldn’t) be a problem, Windows sometimes has problems.

    Here is where that quotation came from:

    http://msdn2.microsoft.com/en-us/library/ms776406(VS.85).aspx

Skip to main content