# Parsing a string as a 64-bit integer, somehow

Today's Little Program takes a string and tries to parse it as a 64-bit integer in formats that a programmer would likely encounter.

Here's a first stab:

using System;
using System.Globalization;

class Program
{
static long ParseLongSomehow(string s)
{
if (s.StartsWith("0x", StringComparison.OrdinalIgnoreCase)) {
return long.Parse(s.Substring(2), NumberStyles.HexNumber);
} else {
return long.Parse(s);
}
}

public static void Main(string[] args)
{
long value = ParseLongSomehow(args[0]);
Console.WriteLine(value);
Console.WriteLine("0x{0:X}", value);
}
}

If the string begins with 0x, then we treat the rest of the argument as a hex value; otherwise, we treat it as a decimal value.

Unfortunately, this doesn't work if the input is 9223372036854775808, which is the value of 1 << 63, a value that is representable as a 64-bit unsigned value but not a 64-bit signed value.

Our problem statement was pretty vague, so let's write a functional specification. It helps to know what problem you're solving before you start to solve it. Otherwise, you're just flailing around writing code before you have a plan. When I tried to solve this problem, I flailed around a bit until I realized that I didn't have a spec.

What formats would a programmer be likely to encounter as the string representation of a 64-bit integer?

• 0x1234: 64-bit number in hex format, case-insensitive. The value can range from 0 to UInt64.MaxValue.

• 12345: 64-bit unsigned number in decimal format. The value can range from 0 to UInt64.MaxValue.

• -12345: 64-bit signed number in decimal format. The value can range from Int64.MinValue to Int64.MaxValue.

• Other formats may be permitted, but you need to support at least the above.

Writing down exactly what I was doing and what I wasn't doing was the part that solved my flailing. I had been worrying about things like -0x12345 and -9223372036854775809 and 9999999999999999999, even though those numbers would not be something a programmer would be likely to encounter.

From the specification we can develop our algorithm.

• If the string begins with 0x, then parse what's left as an unsigned 64-bit hexadecimal number.

• If the string begins with a minus sign, then parse it as a 64-bit signed number in decimal format.

• If the string does not begin with a minus sign, then parse it as a 64-bit unsigned number in decimal format.

And that is pretty easy to implement.

static long ParseLongSomehow(string s)
{
if (s.StartsWith("0x", StringComparison.OrdinalIgnoreCase)) {
return long.Parse(s.Substring(2), NumberStyles.HexNumber);
} else if (s[0] == '-') {
return long.Parse(s);
} else {
return (long)ulong.Parse(s);
}
}

Note that we are a little sloppy with our treatment of whitespace. We accept leading and trailing spaces on decimal values, and allow trailing spaces on hex values (and even allow spaces between the 0x and the first hex digit). That's okay, because the spec allows us to accept formats beyond the ones listed.

Now, for bonus points, let's revise the functional specification a little bit, specifically by adding another case:

• 0x12`3456789A: 64-bit number in hex format, case-insensitive, with backtick separating the upper 32 bits from the lower 32 bits.

This is the format used by the Windows debugger engine.

static long ParseLongSomehow(string s)
{
if (s.StartsWith("0x", StringComparison.OrdinalIgnoreCase)) {
return long.Parse(s.Substring(2).Replace("`", ""), NumberStyles.HexNumber);
} else if (s[0] == '-') {
return long.Parse(s);
} else {
return (long)ulong.Parse(s);
}
}

We'll leave it here for now. Next time, we'll start putting some blocks together.

Tags

1. anonymouscommenter says:

Possible issues:

– In the hex case you use long.Parse instead of ulong.Parse, which will throw an exception for numbers larger than Int64.MaxValue

– The bonus points spec is sloppy and doesn't specify that the lower 32 bits must have leading zeroes. If they don't have leading zeroes then just removing the backtick would be the wrong thing to do.

2. Brian_EE says:

I would add that support is missing for 64-bit SIGNED integers presented in hexadecimal format. Perhaps you don't run across or deal with them regularly. If you use the sign to determine how to parse decimal input, you could use a MSB check to do the same for hexadecimal input.

3. anonymouscommenter says:

@Brian EE

I think that omission was deliberate.  And how is checking the MSB going to tell you whether a number is a large positive number or a negative number?

4. anonymouscommenter says:

One must consider WHERE the data is coming from.  If the number is coming from IBM Mainframe or AS/400, the data will be encoded in EBCDIC.  Also, hex numbers are prefaced with &h or &H in that world as opposed to using 0x.  I know because I had to do that code before.  The routine had to take into account both ASCII & EBCDIC and had to handle both literal as well as hex values.

5. anonymouscommenter says:

@12BitSlab

> &h or &H

Look what you have done. Now I’m having flashbacks of Vilnius BASIC.

And no, it had a (mostly) ASCII-superset encoding.

6. anonymouscommenter says:

Doesn't the cast to long negate the point of supporting "64-bit unsigned number in decimal format"? ParseLongSomehow("9223372036854775808") returns -9223372036854775808L without a way to tell if the string was "-9223372036854775808" or "9223372036854775808".

7. anonymouscommenter says:

I'm glad you didn't add octal support.  The only time I've ever seen octal used intentionally is when dealing with chmod permissions; more often than not it just causes buggy code when people add leading zeros to their decimal constants.

8. anonymouscommenter says:

@Wear:

Distinguishing "-9223372036854775808" and "9223372036854775808" isn't really necessary, and you can't even do it given the API because the function just returns a long. Treating those numbers as the same is useful if you know you have a 64-bit number that you want to treat as signed, but maybe it was printed as unsigned and maybe not.

The same comment basically applies to @Brian EE and @David T's discussion. Recognizing "-0x1234" might be occasionally useful, but you don't have to do anything to determine the sign based on the MSB (e.g. that 0xF…F will be -1) because that will fall out naturally.

9. anonymouscommenter says:

@Evan So less "64-bit unsigned number in decimal format" and more "64-bit signed number in unsigned decimal format"? I guess I could buy that.

10. Brian_EE says:

@David T: If the most significant bit is a logic '1', then the number is considered negative as that's where the sign bit is stored.

I did defer that perhaps Raymond doesn't deal with those types of numbers in the normal course (and little programs tend to grow out of things to fit *his* needs). Just pointing out that others may have use for that if they are doing things more arithmetic based and less pointer based.

11. anonymouscommenter says:

@Brian EE:

Raymond's code will already work correctly in almost all instances of the situation you're talking about. If you give it the hex representation of a 2's complement negative 64-bit number, you'll get that negative number out of his function. That just falls out naturally from parsing as an ulong and casting to long.

The only time you'd need to explicitly check the MSB and do something different is if you want to parse the hex representation of a negative number in 1's complement or sign-and-magnitude format.

12. For bonus points, throw a parsing error if the backtick is not in the expected position (e.g., because the developer didn't copy/paste the whole symbol from the debugger.)

13. Ben Voigt says:

@Evan: Except that the code given *doesn't* parse hexadecimal numbers as ulong.  The only use of ulong is for decimal numbers without leading minus sign.

14. anonymouscommenter says:

@Ben Voigt:

Um, "return (long)ulong.Parse(s);".

Granted, it's long.Parse in the first version, but that was just the "first stab" before figuring out what the function should do.

15. anonymouscommenter says:

Oh, whoops, I'm the one who can't read. Sorry. You're right.

16. Brian_EE says:

@Evan: You're right about it handling negative hex numbers. I focused on the written requirements and failed to observe the implementation.

@Zarat: I believe that falls under the "Garbage In, Garbage Out" principle.

17. anonymouscommenter says:

@Ben Voigt @Evan It's long.parse that's doing it for hex numbers. The cast to long is creating negative numbers for the large decimal case

i.e.

long.Parse("8000000000000000", NumberStyles.HexNumber) -> -9223372036854775808

long.Parse("800000000000000", NumberStyles.HexNumber) -> 576460752303423488

(long)ulong.Parse("9223372036854775808") -> -9223372036854775808