My original version of the histogram-generating code that I whipped up for the previous episode of FAIC contained a subtle bug. Can you spot it **without going back and reading the corrected code?**

private static int[] CreateHistogram(IEnumerable<double> data, int buckets, double min, double max)

{

int[] results = new int[buckets];

double multiplier = buckets / (max – min);

foreach (double datum in data)

{

int index = (int) ((datum – min) * multiplier);

if (0 <= index && index < buckets)

results[index] += 1;

}

return results;

}

Note that of course if this were production code, instead of demo code I whipped up in five minutes, it would be a lot more robust in its error detection; the bug that I am looking for is a bona fide error in the logic of the method, rather than things like “the method does not verify that min is smaller than max”, and so on.

A hint: the first time I ran this code and displayed the results, the generated histogram looked fine. Then I made a small change to the arguments and the resulting histogram image was obviously wrong. Can you spot the defect?

It seems a bit odd that you are casting an int to an int inside the array index. It looks like you may have mistakenly included a hint.

You are correct; that was an editing error. I've removed it. Thanks! — Eric

unchecked

{

var value = (Int32)4294967296.0;

Console.WriteLine(value == 0);

// And so on…

}

I'm not following you. How does this relate to the bug in my code? — Eric

CreateHistogram discards all data outside the [min, max) range – specifically, any datum equal to max is discarded. That may be intentional?

That is intentional; the intention is to display the portion of the histogram between the ranges. However, you're on the right track. Does it in fact discard all data outside the desired range? — Eric

Ah tricky. I admit that it took me a few more minutes to see the issue even after I looked up the correct code.

(Spoiler: I knew it would somehow be about negative numbers but the code was correct even when min is negative, at that point I lost hope and cheated)

Hmph. And I thought I knew C#.

Eric, please explain why that doesn't behave the way I expected it to.

Is it related to the relationship of buckets to max-min? If there were only one bucket (which is permitted even if silly) it looks like you could get an index that matches no bucket.

I need to read these kind of puzzles on something other than my iPhone.

The issue I have in mind will come up if there is only one bucket, yes, but the issue I have in mind affects histograms with any number of buckets. — Eric

[Addition to my first comment]

Double is a 64bit type and Int32 is 32bit. The cast will erase a part of the index before we ensure that the initial operation is in range.

Indeed, this is a problem; good catch. However, in this particular case all the data are reasonably sized and the program still has a bug. — Eric

Agreeing with Lars… any datum equal to the maximum incorrectly gets discarded.

And what's the problem with that? It's a histogram of

tens of thousandsofrandomdata; the odds of one of them being exactly the maximum are tiny, and if it is, who cares if one datum in a hundred thousand is accidentally not displayed? We're already not displaying any data outside of the given range. — EricIsn't this skewed by 0.5 in the cast to int?

You tell me. Is it, or isn't it? — Eric

I'm stuck on the cast to int… it looks like you should be performing some actual rounding logic there to make sure index is the appropriate value when dealing with a min and max that are negative.

You're getting close. — Eric

Data that would be in the first bucket before the histogram will be included in the first bucket of the histogram. (Because "(int)x" will evaluate to 0 for x in ]-1,1[)

You got it! — Eric

Short and sweet:

any time (datum-min)*multiplier gives you a result between -1 and 0, not inclusive, the cast to int will always result in a zero.

Thus not all data is excluded, the bucket at index 0 gets one buckets worth of extra data, however much that may be. If you do the cast after checking the range, you will exclude those values.

You got it! Beaten by mere seconds by jhominal. — Eric

Not sure about the C# semantics on the (int) cast from double. My first assumption was that it would truncate, so that eventually, with large enough (or small enough) values of datum, index would wrap around back into the allowed range.

Also, I'm agreeing with the others that if datum == max, it's discarded.

When I tested casting with gcc, I got this behavior

(int)4294967297 -> -2147483648

In fact, casting any large double to int gave me -2147483648.

If I cast it to a long, then cast the long to an int, I got 1, as I would expect with truncation.

Eric, If I were to write this code, I would automatically write it as:

foreach (var datum in data)

{

if(datum>=min && datum<max)

{

bla bla

}

}

and there can be no bug even if I'm not careful. Is there any particular reason you did not choose to write that way?

The only time it goes into the last bucket is when the value is equal to max. Did you mean to use a rounding function when going from double to int instead?

I'm not following your train of thought here. Suppose min is 0, max is 5 and buckets is 10. Multiplier is 2. If one of the data is, say, 4.7 then we multiply by 2 to get 9.4 and then round off to 9, which is the index of the last bucket. But 4.7 is not equal to 5. So I've provided a counterexample to your claim; would you care to clarify the claim? — Eric

I just watched a presentation by Bret Victor last night called Inventing on Principle, and he talks about an interface that gives algorithm developers immediate insight into the results of their functions. If you haven't seen it, you should watch this section of it: http://www.youtube.com/watch

Watching the video and then reading this post made me think: Man, if Eric had this tool, he may not have had that bug.

@Ali I don't think that has anything to do with the bug, does it? Mathematicians would normally write something like (min<=value<max) to more clearly express how value falls inside a range with a lower and an upper bound. Indeed some languages such as Python actually allow that syntax. In languages like C# that don't allow that syntax, (min<=datum && datum<max) is the closest you can get.

This all sounds very familiar. This bears some relation to one of Eric's previous posts, "What's the difference? Remainder vs modulus":

blogs.msdn.com/…/what-s-the-difference-remainder-vs-modulus.aspx

In fact, I think in the comments I linked to an instance of a very similar bug.

@weeble0 I think you misunderstand me, my comment was "first check if the current point is in range, and if it is, only then calculate the index" seems a more natural way to code the algorithm. I wondered if there is some disadvantage I cannot see. (Except that, then there would be no interesting bug/puzzle to talk about)

@Ali Ah, sorry, I did misunderstand you.

private static int[] CreateHistogram(IEnumerable<double> data, int buckets, double min, double max)

{

int[] results = new int[buckets];

double multiplier = buckets / (max – min);

foreach (double datum in data)

{

int index = (int) ((datum – min) * multiplier);

if (0 <= index && index < buckets)

results[index] += 1;

min = datum;

}

return results;

}

The moral of all this: static typing does not replace testing ðŸ˜‰

@Ali

The disadvantage of that method is, that you now need to be *very* careful about rounding errors. I would not be surprised if there are values min<=x<max for which the calculated bucket is outside of the array range. Floating point rounding is a tricky beast.

So I prefer Eric's code (with a fix for his bug of course).

—

Since IMO the build in Random class in .net sucks, I'm currently creating my own RNG. A similar issue crept up:

One feature is generating a uniform double that satisfies min<=x<max from a uniform double that satisfies 0<=x<1.

The naive implementation if this is simply: `return min+(max-min)*Uniform()`. Unfortunately if `Uniform()` returns `1-Epsilon`, the result might get rounded to `max`, violating the contract.

So I've added a do-while loop ensured the result is smaller than max.

(max – min) / buckets ?