Yes, this is possible. Occasionally it is practical, when you need not the best, but the worst solution to a problem.
And I think there are more uses for it. It seems to be helpful with maintaining genetic pool diversity, with avoiding being stuck in local maximum – and therefore with finding better solutions to problems.
Some variations of this idea are apparently known, but not widely. And there is certain bridging between this method and simulated annealing. So I figured it is worth sharing my observations -- and the code accompanying them.
***
Genetic (or evolutionary) algorithms are not first class citizens in the world of optimization methods. The primary reason for that is their ability to make tough problems look deceivingly simple. A basic implementation of a genetic algorithm needs less than a screen of code; the idea behind it is obvious and intuition-friendly. With that, it is very tempting to ignore nature and inherent complexity of a problem being solved and let the CPU cycle, hoping it would eventually crank out a solution.
Only when it fails after millions of seconds one realizes that subtle details rule everything here. How is a solution represented as a vector? What are crossbreeding and mutation strategies? How do you choose the winner and how (or if) do you run a tournament? What do you do to avoid convergence to a sub-optimal maximum? These hyper-parameters are difficult to quantify to enable a systematic search within their space. Yet wrong choices slow things down by orders of magnitude. That’s why genetic algorithms often remain an art (if not a magic) to a large degree -- larger than many alternatives.
Yet they aren't useless. They tolerate poorly defined or implicit fitness functions. They are somewhat less inhibited by the curse of dimensionality. People do solve real problems with them, sometimes quite productively:
They do have their uses, and they are fun to play with.
And one fun aspect is that you can run them backwards.
That’s it, you can replace “good” with “bad” – and run not for the best, but for the worst solution. Besides interesting philosophical implications, this also offers a strategy of dealing with local optima convergence.
***
This issue – converging not to a global, but to a local sub-optimal solution – is inherent to many optimization methods. When a problem has multiple solutions, an algorithm can arrive to *a* solution, and then sit there forever, despite a better solution possibly existing couple valleys away. Steven Skiena described that situation very vividly:
“Suppose you wake up in a ski lodge, eager to reach the top of the neighboring peak. Your first transition to gain altitude might be to go upstairs to the top of the building. And then you are trapped. To reach the top of the mountain, you must go downstairs and walk outside, but this violates the requirement that each step has to increase your score.”
Genetic algorithms are not immune against this. Large body of scientific research addressing it is developed, to name a few examples:
Sometimes the strategy is to run the algorithm multiple times from new random seeds, hoping to converge to a different solution each time. Sometimes people introduce artificial "species pressure" whereas solutions too similar to each other can't cross-breed and converge to the same point. Or they artificially separate the search space into isolated zones. Or punish for staying too close to an already learned local maxima.
***
So how does running algorithm in a “reverse mode” may help with avoiding local maxima?
To illustrate, let’s pretend we are looking for the highest point on the US elevation map:
Assume, for a moment, that somehow (maybe because of an unlucky initial placements, maybe because the map is more complex than this) all genetic pool solutions congregated around a wrong local maximum of Appalachian Mountains:
[Important: the arrows do not imply continuous movement. Genetic algorithm does not work like a gradient descent. The arrows serve purely illustrational purposes of indicating (most likely discontinuous) state transitions]
What happens if you switch the evolution sign at this moment? When the least fit wins, and the worst solution is rewarded, the genetic pool starts running away from the local maximum, initially in all directions:
That diversifies it. Soon, the pool may have very little in common with the solution it initially described. After a while, it may even converge to a local minimum, which I would guess is near the Mississippi delta (though there is no point necessarily in keeping the evolution “in reverse” that long enough):
Now let’s flip the sign of the evolution back to normal. Again, the solutions will start scattering away in all direction from the local minima:
And, if some of them arrives to a point elevated enough, eventually everyone would re-converge there:
In a two-dimensional case, this is almost trivial. But in higher dimensionality, where mountains are more difficult to discover, these fourth-and-back cycles may need to continue until you are happy with the solution. Or at least happier than in the beginning -- I'm not promising miracles here :))
Why would this work better than restarting each time from a random seed?
It may not always do, though my testing suggests that sometimes it does. The main benefit here is in having the control over the degree of the digression. If other maximum is expected far away or nearby, the backwards-evolution can run for longer or shorter period, as needed. In that respect, this approach is similar to simulated annealing which is able to adjust annealing schedule to throw just the right degree of chaos into the computation.
But all the above are just my expectations. Would they work in practice?
Lets find out by playing it against a toy problem that is easy to visualize.
Welcome the Scary Parking Lot Problem.
***
Imagine that this is a map of a parking lot at night. There are a few lights producing illuminated patches here and there, and the rest of it is mostly dark:
The darker the area, the scarier it is in it. Our goal is to find the least scary path from the top left to the bottom right. To be precise, we want to minimize the integral of Fear = 1/(1 + Brigthness) along the path. If asked, I would’ve probably charted it like this:
This problem is well suited for genetic algorithm. Encode a path as a sequence of <x, y> points. A genome is represented then as a sequence <x_{0}, y_{0}, x_{1}, y_{1}, ..., x_{N}, y_{N}> (I used N = 20). Initialize the pool randomly with K =16 instances colored in green (blue shows the best solution found so far), and kick off the algorithm.
Iteration 0:
In just 500 iterations, the algorithm already hints at a decent approximation:
1000 iterations:
4000:
By 9000, it is mostly straightening out small knots:
18000 generations – final or nearly final:
[The red figure at the bottom left is the "cost", or the total fear factor, of the solution -- the smaller, the better].
***
I tried it on a few dozen randomly generated “parking lots” and discovered that genetic algorithm solves this problem so well that it is actually difficult to find a field that would produce a grossly imperfect result. It took quite many experiments until I discovered this “foam bubbles” setup with intentionally harsh contrast and narrow illuminated zones. Constantly forced to make difficult right-or-left choices, and having no gradient feedback, the algorithm responded with some mistakes here.
Start:
500 iterations:
2000:
7000:
18000 (final):
The solution is stuck. Wrapped around a bubble, it won't change to anything better no matter how long you run it. Not coincidentally, it has also lost its genetic diversity.
***
Now let’s try to untangle it by enabling the reversals. Every 6000 iterations we will put the evolution in the “backwards mode” for 210 steps.
For the first 6000 generations it is the same game, converging to the already familiar solution:
But then, at 6000, the reversal begins. “Bad” solutions are in favor now. Lets’ watch how they develop, spreading away from the original path.
6100:
6200:
At 6210 the sign changes back to normal and we start re-converging – to a different maximum this time!
At 1200:
After 12000, another reversal cycle follows, ending up with the 3rd solution at 18K:
Of all three, the 2^{nd} is the best:
Its cost is 63279.6. It is still not perfect, but the control (without reversals) was 79896.5. The new solution is 20.8% cheaper!
***
But maybe this works just for this one parking lot?
Let’s repeat the experiment 40 times with different “bubble sets”, each randomly generated:
Control (18K iterations) | Experiment (18K iterations, with 2 reversals 210 steps each) | |
---|---|---|
Average cost of the best solution relative to a random path | 0.13448 | 0.11605 |
Standard deviation | 0.034179 | 0.024659 |
Student t-score of the improvement | N/A | 2.76637863954974 |
How many times the solution found was better than control? | N/A | 30 (out of 40) |
A one-sided t-score of 2.77 with 40 degrees of freedom corresponds (https://en.wikipedia.org/wiki/Student%27s_t-distribution#Table_of_selected_values) to roughly 99.5% chance that the difference is not accidental. In other words, that the experiment with reversals indeed performed better than the same 18K iterations straight. And for 30 parking lots (out of 40), the solution found was better than the control.
How does it compare to simply doing three runs with 6K steps, each starting from a random seed, and then selecting the best solution? It is about the same, or maybe very slightly better. Restarts from the scratch produced an improvement t-score of 2.5538656109400, with 24 runs out of 40 doing better than control.
***
Finally, why 210 reversal steps? Why not more or less?
This is where the power of the idea materializes. Reversals let you control their duration – and adjust it as needed for the task being solved. By playing with those durations, I produced the following plot of improvement t-score (relative to the control):
Blue curve are the results with reversals, orange is the result of random reset (for fixed RNG seeding). Obviously, there is fluctuation here, so I can’t strictly say that reversals are better than random resets. But it does not take away the fact that reversals produced some solutions better than anything else tried did.
Finally, here is another plot, showing the number of “parking lots” where solutions with reversals or random reset were better than control:
***
So, is this method a magic bullet? Probably not. It is just a tool, among many other invented by people to deal with practical problems. But maybe it would be helpful for some.
To summarize:
And how is that question related to Data Science?
For sure, physical properties of Pluto do not change if we call it a "planet", a "dwarf planet", a "candelabrum", or a "sea cow". Pluto stays the same Pluto regardless of all that. For physics or astronomy, naming does not really matter.
Yet we don't expect to find publications on Pluto in the Journal of Marine Biology. They belong to astronomy or planetary science. So, for information storage and retrieval naming does matter, and does a lot.
From that standpoint, the distinction is material. When we study "real" planets like Mars or Venus we often mention features that only "real" planets tend to have -- such as atmospheres, tectonic activity, or differentiated internal structure. Small asteroids and "boulders" rarely if ever have any of that. Reflecting those physical differences, the vocabularies would likely be different as well.
Thus, we may classify something as a "planet" if language used with respect to that object follows the same patterns as language used with "real" planets. Simply because it would be easier to store, search and use the information when it's organized that way.
But is that difference detectable? And if yes, is it consistent?
That calls out for an experiment:
I ran that experiment. The results suggest that Pluto indeed is a planet with respect to how we talk about it. The details are provided below.
Algorithm Outline
Assuming the documents are represented as a collection of .txt files, for each planetary body:
When these collections (A[a] and idf[a]) are produced for each topic (i.e., one pair for Mars, another for Pluto), compute corpus pairwise cosine similarities:
[Again, several other variations were tested, with this one producing the best classes discrimination in tests]
The resulting metric, J(A,B), is the vocabulary similarity between the sets of documents A and B, ranging from 0 (totally different) to 1 (exactly the same).
The code is here if you want to examine it, except for the Porter algorithm which as I mentioned was adopted verbatim from C# implementation by Brad Patton.
OK. The code is written. How do we know it really works? That's what testing is for.
Test 1. Classical Music vs. Chemistry
For this test, two bodies of text were chosen.
The first consisted of Wikipedia articles on inert gases Helium, Neon, Argon, Krypton and Xenon, sections starting after the table of content, and continuing (but not containing) to "References" or "See Also". The second corpus consisted of articles on classical music composers Bach, Beethoven, Mozart and Rachmaninoff, similarly pre-filtered.
Two test subjects were Wikipedia article about Oxygen (gas), and Wikipedia article about Oxygene, a famous musical album of Jean Michelle Jarre, the composer. The question was: can this code properly attribute those test articles as belonging to gases or music?
After fixing couple of bugs, some fine-tuning and experimentation, the code kicked in and produced the following output:
Gases | Composers | Oxygen (gas) | Oxygene (album) | |
Gases | 100% (0.237) | 2.33% (0.092) | 23.3% (0.197) | 1.38% (0.244) |
Composers | 2.33% (0.092) | 100% (0.225) | 3.61% (0.152) | 5.74% (0.348) |
Oxygen (gas) | 23.3% (0.197) | 3.61% (0.152) | 100% (0.347) | 2.25% (0.189) |
Oxygene (album) | 1.38% (0.244) | 5.74% (0.348) | 2.25% (0.189) | 100% (0.587) |
The percentage is the degree of similarity J(A, B). The number in parenthesis is the support metric of how many unique words entered the intersection set, relative to word count of the smaller set. Highlighted is the largest similarity (except to self) in each row. As you can see, the method correctly attributed Oxygen (gas) to gases, and Oxygen (musical album) to Music/Composers.
And yes, I'm explicitly computing the distance matrix twice -- from A to B and from B to A. That is done on purpose as an additional built-in test for implementation correctness.
OK, this was pretty simple. Music and chemistry are very different. Can we try it with subjects somewhat closer related?
Test 2. Different Types of Celestial Bodies, from Wikipedia again
Now we will use seven articles, with all the text up to "References" or "See also" sections:
The resulting similarity matrix:
67P | Betelgeuse | Halley Comet | Mars | Pluto | Sirius | Venus | |
67P | 100% (0.481) | 0.76% (0.259) | 2.8% (0.23) | 1.78% (0.286) | 0.696% (0.234) | 0.297% (0.202) | 1.13% (0.252) |
Betelgeuse | 0.76% (0.259) | 100% (0.307) | 1.27% (0.183) | 2.66% (0.133) | 1.66% (0.159) | 6.9% (0.204) | 2.86% (0.167) |
Halley Comet | 2.8% (0.23) | 1.27% (0.183) | 100% (0.393) | 2.05% (0.195) | 2.12% (0.159) | 1.48% (0.15) | 1.97% (0.164) |
Mars | 1.78% (0.286) | 2.66% (0.133) | 2.05% (0.195) | 100% (0.289) | 3.87% (0.175) | 1.76% (0.19) | 9.11% (0.191) |
Pluto | 0.696% (0.234) | 1.66% (0.159) | 2.12% (0.159) | 3.87% (0.175) | 100% (0.326) | 1.07% (0.155) | 2.68% (0.15) |
Sirius | 0.297% (0.202) | 6.9% (0.204) | 1.48% (0.15) | 1.76% (0.19) | 1.07% (0.155) | 100% (0.448) | 2.01% (0.171) |
Venus | 1.13% (0.252) | 2.86% (0.167) | 1.97% (0.164) | 9.11% (0.191) | 2.68% (0.15) | 2.01% (0.171) | 100% (0.334) |
What do we see? Stars are correctly matched against stars. Comets -- against comets (although not very strongly). "True" planets are matched to "true" planets with much greater confidence. And Pluto? Seems like the closest match among the given choices is... Mars. With the next being... Venus!
So at least Wikipedia, when discussing Pluto, uses more "planet-like" language rather than "comet-like" or "star-like".
To be scientifically honest -- if I add Ceres to that set, the method correctly groups it with Pluto, sensing the "dwarfness" in both. But the next classification choice for each is still strongly a "real" planet rather than anything else. So at least from the standpoint of this classifier, a "dwarf planet" is a tight subset of a "planet" rather than anything else.
Now let's put it to real life.
Test 3. Scientific Articles
The 47th Lunar and Planetary Science Conference held in March 2016 featured over two thousand great talks and poster presentations on the most recent discoveries about many bodies of the Solar System, including Mars, Pluto, and 67P/Churyumov–Gerasimenko comet. The program and abstracts of the conference are available here. What if we use them for a test?
This was more difficult than it seems. For ethical reasons, I did not want to scrape the whole site, choosing to use instead a small number (12 per subject) of purely randomly chosen PDF abstracts. Since the document count was low, I decided not to bother with PDF parsing IFilter and to copy relevant texts manually. That turned out to be a painful exercise requiring great attention to detail. I needed to exclude authors lists (to avoid accidentally matching on people or organization names), references, and manually fix random line-breaking hyphens in some documents. For any large-scale text retrieval system, this process would definitely need serious automation to work well.
But finally, the results were produced:
67P | Mars | Pluto | |
67P | 100% (0.252) | 17.9% (0.126) | 19.7% (0.132) |
Mars | 17.9% (0.126) | 100% (0.224) | 21% (0.111) |
Pluto | 19.7% (0.132) | 21% (0.111) | 100% (0.21) |
The differences are far less pronounced, probably because the texts have very similar origins and format restrictions, and use the same highly specialized scientific vocabulary. Still, within the space of this data set, Pluto was slightly more of a planet rather than of a comet!
Closing Remarks
How seriously should these results be taken?
First, they are obtained on a small data set, with a home-grown classifier, and the differences detected are relatively modest. If anybody wants to take a serious Data Science driven stance on this issue, they should use a much, much larger text corpus.
Second, people are still free to call Pluto a "planet", a "dwarf planet", or a "candelabrum" if that suites them -- regardless of these results.
But there are important points that I wanted to make. First, language use may offer another definition of what is more practical to call "a planet". Second, the language differences behind that definition are objective and measurable. And finally, those differences are representative of the physical world.
Thank you for reading,
Eugene V. Bobukh
See Part 1 here.
No.
There seems to be a fundamental limit on fuzzing curve extrapolation.
To see that, consider bug distribution function of the following form:
where p_{0} >> p_{1} but a_{0} ≈ a_{1} and δ(x) is a Dirac delta-function.
For such a distribution U(N) is:
At early fuzzing stages, the 2^{nd} term is so much smaller than the first one that it remains undetectable. Under that condition, only the first term would be considered, with the expected U(N) value at infinity of a_{0}. However, the true bug yield for that curve is a_{0} + a_{1}! But we will not notice that until we've done enough fuzzing to "see" the 2^{nd} term. It is possible to show that it will become detectable only after more than the following number of iterations is performed:
Prior to that, the extrapolation of the above fuzzing curve will produce grossly incorrect results.
Since (in general) any fuzzing curve can potentially carry terms similar to the last one in (150) it means that no matter how long we've fuzzed, predicting what the curve would look like beyond a certain horizon is impossible. If that reasoning is correct, only short or mid-term fuzzing extrapolation is possible.
What if we obtain G(p) not from the observations but use some theory to construct it in the most generic form?
Unfortunately, it seems that G(p) is more a function of a fuzzer rather than of a product.
Often, you run a sophisticated fuzzer for countless millions of iterations and beat seemingly all the dust out of your product. Next day, someone appears with a new fuzzer made out of 5 lines of Python code and finds new good bugs. Many. That's hard to explain if bug discovery probability was mostly a property of a product, not of a fuzzer.
[Also, fuzzing stack traces often tend to be very deep, with many dozens of frames, suggesting that the volume of functionality space being fuzzed is so large that most fuzzers probably only scratch the surface of it rather than completely exhaust it]
If that is correct, fuzzer diversity should work better than fuzzing duration. Twelve different fuzzers running for a month should be able to produce more good bugs than one fuzzer running for a year.
That said, constructing G(p) from a pure theory and therefore finding a fuzzing curve approximation that works better than generic form (140) might be possible – but for a specific fuzzer only, using the knowledge of that fuzzer intrinsic properties and modes of operation.
The only right answer from the security standpoint is "indefinitely"!
However, from the business prospective the following return-on-investment (ROI) analysis may make sense and lead to a different answer.
Fuzzing has its dollar cost. That’s because infrastructure, machines, electric power, triaging, filing and fixing bugs cost money, not to mention paying the people who do all that.
Not fuzzing has its cost as well. The most prominent part of it is the cost of patching vulnerabilities that could’ve been prevented by fuzzing.
Let’s introduce the cost of fuzzing per one iteration X(N). In addition to that, there is a cost of bugs management Y per bug, and a one-time cost of enabling fuzzing infrastructure C. Thus, the cost of running N fuzzing iterations per a parser is:
Not running fuzzing introduces security bugs. Let’s denote the average cost of responding to a security vulnerability via T, and the probability that the response will be required via q (assuming, for simplicity, that q is constant). If there are B total bugs in the product, and the fuzzer in question has ran for N iterations, it would’ve found U(N) bugs. Then B - U(N) bugs would remain in the product, and some of them would be found and exploited. Therefore, the cost of not running fuzzer after the initial N iterations is q*T*(B-U(N)).
Combined together, these two figures give the cost of a decision to run fuzzing for N iterations and then stopping, as a function of N:
From purely the ROI standpoint, fuzzing needs to continue while the increase of iterations decreases the total cost:
which translates to:
Let’s introduce ΔN(N) – the number of fuzzing iterations required to find one bug. By the definition,
Multiplying (220) by ΔN and slightly rearranging, we arrive to:
This is a generic stopping condition for static fuzzing.
It says: “fuzz until the cost of finding the next bug plus managing it exceeds the cost of addressing a vulnerability multiplied by the chance such vulnerability would be found or exploited”.
Since fuzzing with no bugs found (yet) costs no less than fuzzing until the first bug is hit, the condition above could be generalized to include situations with or with no bugs found yet:
Similar logic could be applied to the attacker’s side and similar conclusion would be derived, but with "vulnerability cost" replaced by "the benefits of exploiting one vulnerability". Comparing the attacker's and the defenders' ROIs then leads to some conclusions:
Some fuzzers learn from what they find to better aim their efforts. Their bug search is not completely random. However, a slight modification to the same theory permits describing their output with the same equation (140) where individual bugs are replaced by small (3-15 in size) bug clusters.
1. In many practical cases, fuzzing curve could be modeled as
where a_{i} and p_{i} are some positive fixed parameters and D is small, on the order of 0 to 3.
2. Two fuzzers working for a month are better than one fuzzer working for 2 months.
3. There is a limit on how far into the future a static fuzzing curve could be extrapolated, and it seems to be fundamental (not depending on the method of extrapolation).
4. State-sponsored, non-profit driven attackers are likely to fuzz more than ROI-driven software manufacturers. That said the defenders have no other choice as to continue fuzzing and invest in fuzzing diversity and sophistication (rather than pure duration).
To Tim Burrell, Graham Calladine, Patrice Godefroid, Matt Miller, Lars Opstad, Andy Renk, Tony Rice, Gavin Thomas for valuable feedback, support, and discussions. To Charlie Miller for his permission to reference data in his work.
]]>While fuzzing, you may need to extrapolate or describe analytically a "fuzzing curve", which is the dependency between the number of bugs found and the count of fuzzing inputs. Here I will share my approach to deriving an analytical expression for that curve. The results could be applied for bug flow forecasting, return-on-investment (ROI) analysis, and general theoretical understanding of some aspects of fuzzing.
If you are wondering about things like "what is fuzzing?" here is a little FAQ:
Q: So, what is fuzzing?
A: (Wikipedia' definition): "Fuzz testing or fuzzing is a software testing technique <…> that involves providing invalid, unexpected, or random data to the inputs of a computer program. The program is then monitored for exceptions such as crashes, or failing built-in code assertions or for finding potential memory leaks. Fuzzing is commonly used to test for security problems in software or computer systems."
Basically, every time you throw intentionally corrupted or random input at the parser and hope it will crash, you do fuzzing.
Q: What is a fuzzing curve?
A: That is simply the dependency between the number of unique bugs (e.g., crashes) found and the number of inputs provided to the program, also called “iterations”. A fuzzing curve, being a special case of a bug discovery curve, may look like this:
[For the record, the data is simulated and no real products have suffered while making this chart.]
Q: Why do we need to know fuzzing curve equation?
A: If you don't know the true equation of a it, you can't justifiably extrapolate it. If a choice of extrapolation function is random, so is the extrapolation result, as the picture below illustrates:
Imagine you tell your VP that you need to continue fuzzing and budget $2M for that. They may reasonably ask "why?" At that point, you need a better answer than "because we assumed it was a power law", right?
Q: Has there been any past research of this area?
A: Plenty, although I haven't seen yet a fuzzing curve equation published. To mention couple examples:
Q: Can this work replace any of the following:
A: No.
Q: Can it provide any of the following:
A: No.
If you are still interested – let's continue after the Standard Disclaimer:
<Disclaimer>
Information and opinions expressed in this post do not necessarily reflect those of my employer (Microsoft). My recommendations in no way imply that Microsoft officially recommends doing the same. Any phrases where it may look as if I don't care about customer's security or recommend practices that increase the risk should be treated as purely accidental failures of expression. No figures or statements given here imply any conclusions about the actual state of security of Microsoft's or any other products. The intent of this text is to share knowledge that aims to improve security of software products in the face of limited time and resources allocated for development. The emotional content represents my personal ways of writing rather than any part of the emotional atmosphere within the company. While I did my best to verify the correctness of my calculations, they may still contain errors, so I assume no responsibility for using (or not using) them.
</Disclaimer>
The following conditions, further referred to as "static fuzzing", are assumed:
Let's start with a hypothetical product with B bugs in it where each bug has exactly the same fixed probability p > 0 of being discovered upon one fuzzing iteration.
Actually, the "bug" does not have to be a crash, and "fuzzing iteration" does not have to apply exactly against a parser. A random URLs mutator looking for XSS bugs would be described by the same math.
Consider one of these bugs. If the probability to discover it in one iteration is p, then the probability of not discovering it is 1 – p.
What are the chances that this bug will not be discovered after N iterations, then? In static fuzzing, each non-discovery is an independent event for that bug, and therefore the answer is:
Then, the chance that it will be eventually discovered after N iterations is:
Provided there are B bugs in the product, each one having the same discovery probability P(N), the mathematical expectation of the number of unique bugs U found by the iteration number N is:
If all bugs in the product had the same discovery probability that would’ve been the answer. However, real bugs differ by their “reachability” and therefore by the chance of being found. Let’s account for that.
Consider a product with two bug populations, B_{1} and B_{2}, with discovery probabilities of p_{1} and p_{2} respectively. For each population, the same reasoning as in (10)-(30) applies, and (30) changes to:
Extending that logic onto an arbitrary number Q of populations {B_{q}} one arrives at:
(q enumerates bug populations here)
Let’s introduce bugs distribution function over discovery probability G(p) per definition:
B_{q}(p) = G(p_{q})*dp
Then (50) is replaced by integration:
Let’s recall that (1 ‑ p)^{N} = e^{N*Ln(1 ‑ p)}. In the absence of super-crashers G(p) is significant only at very small probabilities p << 1. Thus, the logarithm could be expanded into first order series as Ln(1‑p) ≈ ‑p, and then:
This is the generic analytical expression for a static fuzzing curve.
Can it be simplified further? I believe the answer is "no", and here is why.
Let's look at the first derivative of U(N) which is easy to obtain from (100):
Since G(p) is a distribution function, it is non-negative, as well as the exponent and p. Therefore,
for any N.
Similarly, the 2^{nd} derivative could be analyzed to discover that it is strictly non-positive:
Repeating the differentiation, it is easy to show that
for any n>=0.
That is a very prominent property. It is called a total (or complete) monotonicity and only few functions possess it. Sergei Bernstein has shown in 1928 [see Wikipedia summary] that any function that is totally monotonic must be a mixture of exponential functions and its most general representation is:
where m(x) is non-negative.
In our case the first derivative of U(N) is totally monotonic. Therefore, U(N) with possible addition of some linear terms like a + b*N belongs to the same class of functions. Therefore, (100) is already a natural representation for U(N) and there is no simpler one. Any attempt to represent U(N) in an alternative non-identical way (e.g., via a finite polynomial) would destroy the complete monotonicity of U(N), and that would inevitably surface as errors or peculiarities in further operations with U(N).
So let’s settle upon (100) as the equation of a static fuzzing curve.
But wait! Expression (100) depends on the unknown function G(p). So it's not really the answer. It just defines one unknown thing through another. How is that supposed to help?
In a sense, that's true. The long and official response to that is this: you start fuzzing, observe the initial part of your fuzzing curve, solve the inverse problem to restore G(p) from it, plug it back to (100) and arrive to the analytical expression of your specific curve.
But here is a little shortcut that can make things easier.
Every function could be represented (at least for integration purposes) as a collection of Dirac delta-functions, so let's do that with G(p):
where a_{i}>0 and p_{i}>0 are some fixed parameters.
Most of the time, only small ranges of probabilities contribute significantly to fuzzing results. That's because "easy" bugs are already weeded out, and "tough" ones are mostly unreachable. To reflect that, let's replace the infinite sum with a fixed small degree D corresponding to the most impactful probability ranges:
Substituting that into (100) results in the following fuzzing curve equation:
After some experiments I found that D = 1 already fits and extrapolates real fuzzing curves reasonably well, while D = 3 produces a virtually perfect fit -- at least on the data available to me.
Finally, at the very early fuzzing stages (140) could be reduced even further to:
]]>This is just a summary of the previous chapters as a flow chart (click here for the derivation of the method):
Here variable meanings are:
That’s it – I hope you find it useful for assessing the hidden part of your product's security "iceberg".
Obviously, the method has its limitations, and some of them should be named:
That's it. Thank you for reading and have a great day.
]]>
That simple logic is nice, but practice makes it questionable for at least two reasons:
The good news is that in “classic” software with a clear notion of “version” (such as an operating system or a browser) both effects could be accounted for.
The 2^{nd} one is easier. By shipping version 1.0 of the product to the customer you effectively create a nearly frozen snapshot of it affected only by bug fixes. External Researchers will be testing that version. Of course, Internal Engineering will be working with version 2.0 which is different. But it is usually quite easy to filter their discoveries (for analytical purposes) down to the codebase common with 1.0. So if we apply the method to the codebase that is shared between 1.0 and 2.0, we can work around the issue #2.
#1 is trickier. Intuitively it's clear that shared bugs can still appear in a changing product if both parties hit the same bug while it's active – i.e., between the moments it was found and fixed. But knowing the quantity of such discoveries requires more precise tracking of bug counts at each intermediate moment of time.
And one of the ways to achieve that is to describe bug populations with a system of differential equations.
The next few pages treat that subject. If you are not interested, just jump directly to the Step by Step Guide where the results are applied.
Let’s introduce the following variables:
Each of them is a function of observation time t passed since version 1.0 of the product was released.
We will assume that bugs are found and fixed with rates linearly proportional to their amount. While that is obviously a simplification, it is probably somewhat close to the reality, since when bug counts increase more developer resources are poured in to fix them.
That will bring along a few more variables:
Strictly speaking, none of these variables are constants. They change over time and as bug counts change. But we need to start somewhere, so let's use that model as a first order approximation.
We will also assume that f is the same for all bugs regardless of whether they are internal or external. The case of significant difference could be reduced to the solution of uniform f, as will be shown later.
In these variables, the dynamics of each bug category is described by the following system of differential equations:
The first line says that active External bugs E are found with frequency x per year proportionally to the total bugs count B in the system, and are fixed with frequency f proportionally to their amount E, and sometimes also move away from External to Shared category when Internals find them, which happens with frequency y.
The 2^{nd} equation is the same but with respect to the Internal active bugs count I.
The 3^{rd} says that active Shared bugs are created when one party hits another's existing active bugs, and destroyed by fixing with a common fix rate f. We assume here that internal sources don't have knowledge of external bugs to check for duplicates before filing. That's not exactly the case but is likely close to reality; if needed, a modification to account for that effect is relatively easy.
Finally, the 4^{th} equation describes the overall bug population, and states simply that the total bug count in the product is reduced when active bugs are fixed.
There is really nothing of a rocket science behind solving that system, but it takes a few pages if done by hand on paper, so I’ll spare you from that and will just write down the answer that shows how each bug category behaves over time:
where
And B_{0} is the total initial bug count in the system.
Hmm?…
This is ugly.
Extracting B_{0} from that system is technically possible. You "just" fit those curves to the observed bug rates… in theory… and obtain miserable nonsense in practice because of the data noise, model imperfections, and unmanageable complexity.
Fortunately, we don’t need full precision here. In real life we often deal only with special cases – such as very short or very long observation times, or very high fix frequency, and so on. And it turns out that in this case it is possible to effectively “tile” most practically useful situations with those simpler special cases.
So let’s derive them from the general form.
At very short observation durations, such that t << 1/f and t << 1/(x+y), functions [30] – [60] could be expanded into low-order time series, and then it is possible to show that:
Since nobody prevents us from declaring t = 0 at any arbitrary moment of product lifetime, this means a simple and very powerful thing: as long as there are statistically enough active bugs of Internal, External, and Shared category opened far less than 1/f time ago, the total bug count in the system can be estimated at the current moment of time as simply as:
This works regardless of the frequency of product fixes and (as will be shown later) even if there is an asymmetry between fix rates for internal and external bugs.
At very long observation times such that t >> 1/f and t >> 1/(x+y) most of the bugs found are fixed, so there are very few (if any at all) of them in the active state suitable for use in expression [80].
However, fixed bugs of each category will probably be numerous. Can we use their counts instead?
Let’s call them E_{f}, I_{f}, and S_{f} -- similarly to the corresponding active categories (index “f” stands for “fixed”). Each of those is obtained by integrating the corresponding active bug fixes over time, e.g.:
Using these variables and simplifying the full solution for the case of t -> ∞, one can show that
where
[And estimates for x and y could be used to verify if indeed t >> 1/(x+y)]
The relation between the current bug count B(t) and the initial one B_{0 }is:
When f >> (x+y) another useful simplification arises for times such that t >> 1/f:
And then
with x and y provided by [110] and [120].
This set of simplifications actually covers (mostly) all of the problem space. Indeed:
a) If there is statistically significant number of active bugs in the product, expression [80] approximately solves the problem, since we can declare t = 0 at any moment of time.
b) If not, but there is statistically significant number of fixed/closed bugs in the product that means the observation time is far greater than typical fix time. In other words, we are in the t >> 1/f domain. Since most bugs are in the fixed rather than active state that also means f >> (x+y) and expressions [170] and [180] provide the answer.
e) In other cases there are either not enough bugs to make a call, or we are in the "narrow" intermediate time window of t ≈ 1/f. We should either wait, or use the full solution [30]-[60].
Situations when fix rates for internal and external bugs are significantly different are conceivable. How would our logic work for them?
Let f be the fix rate of internal bugs, αf – fix rate of external bugs, and βf – of shared bugs. System [20] becomes then:
Technically, it could be solved from scratch, but we can avoid that by making variable substitution:
and introducing an intermediate variable
That would result in the solutions of the same form as [30] – [60] with the following input variable replacements:
For E(t): t -> αt, x-> x/α, y->y/α [210]
For S(t): t -> βt, x-> x/β, y->y/β [220]
The short-term estimate then is exactly the same as [80] (i.e., not sensitive to the asymmetry):
Long-term is more difficult because sufficiently high fix rate discrepancy may cause a situation where ft << 1 while α*ft >> 1 at the same time. Technically, the answer would need to be derived from full case [30]-[60], but there is one potential alternative here.
That is to consider internal product version as the primary target, with f being the frequency of internal fixes, and E meaning not all externally reported bugs but only those that affect the internal product version currently in development. If fix rate discrepancy is smaller for the internal version (which is probably more reasonable to expect), long-term estimates [170]-[180] may still be applicable within that interpretation, provided the following changes in variables interpretations:
Also the expressions for x and y in this case change to
]]>
Probably every piece of software has some defects in it. Known defects (also called bugs) are found by manufacturers and users and fixed. Unknown ones remain there, waiting to be discovered some day.
The question is: how big is that unknown set?
One approach to that problem is presented here. It is not the most precise and definitely has some limitations. But it is relatively simple and straightforward to derive, with no secret know-hows, so I decided to share it.
Every time I mention "counting unknown bugs" questions arise that are worth a little FAQ, so here it is:
<Little FAQ>
Q: Why would your even need to know the count of unknown bugs?
A: First, it sets the expectations. Knowing how many bugs are still there provides an upper cup on your support/response story. Second, it provides a quantitative assessment of the engineering effectiveness, by comparing what’s been found to what still remains in there.
Q: Is it even possible to count all unknown bugs?
A: Strictly speaking, no. Some "bugs" are infinite by nature (e.g., new feature suggestions). Some bug may belong to classes yet unknown -- for example, there was no way count CSRF bugs before CSRF was invented. So we are talking here about counting unknown bugs of known, specific class/nature, defined at the today's level of technology.
Q: Any examples of those classes?
A: It’s got to be something of a common nature, with similar lifetime, discovery methods, and statistical properties. For instance:
Q: The idea of "total bug count" looks like an abstract theory rather than something with a practical meaning. Is it anything material?
A: To translate that into a practical world, think these measurable examples:
Q: Fine, the theory may look good, but can it be validated? Does it have a predictive power?
A: Here are some ways to validate it that I see.
The first is side by side products comparison. If products A and B of similar functionality and similar user base have vastly different hidden bug counts, it is reasonable to expect that their post-release defects discovery rates would be similarly different.
Second, this method derives an analytical description of some bug curves over time. That equals to a long-term bug discovery rate forecast. Unfortunately I had a chance to do only very rudimentary work on such forecasting so this still needs more research. Yet this approach could probably be used for validation purposes as well.
</Little FAQ>
With that, let's go to the theory.
<Disclaimer>
While I did my best to verify the correctness of calculations, they may still contain errors, so I assume no responsibility for using (or not using) them.
The method presented here is a research work and carries no guarantees.
Opinions expressed in the post do not necessarily reflect those of my employer (Microsoft). My recommendations in no way imply that Microsoft officially supports them or recommends doing the same. Any phrases where it may look as if I don't care about customer's security, or recommend practices that seemingly increase the risk should be treated as purely accidental failures of expression. No figures or statements given here imply any conclusions about the actual state of security of any Microsoft products. All numeric figures are purely fictitious examples.
The intent of this text is to share the knowledge and improve security of software products across the industry. The emotional content represents my personal ways of writing rather than any part of the emotional atmosphere within the company.
</Disclaimer>
The approach is based on the Capture & Recapture method that has been used in biology and criminalistics since at least 1930s. Conceptually, it is very simple.
When two entities independently search for something in a shared space (e.g., for birds in a forest), they occasionally may stumble upon each other's past discoveries. The tighter the space to search is the more often bumping into each other's finds will occur. By measuring the frequency of such encounters it is possible to reason about the total size of the space being searched.
To formalize it, let's use a fictitious product example. In that product, no bug fixes or code changes ever happen at all. Two groups of people randomly and independently of each other look for bugs in it. The first group is called Internal Engineering, the 2nd -- External Attackers Researchers. Think of them as of two projectors randomly scooping with light rays through the product and sometimes highlighting the same spots:
Let’s put that into basic math:
<The Math>
Let B be the total count of bugs of a particular class in the product, both known and unknown.
Detection effectiveness of the Internal Engineering process is y per year. That means, they find I = y*B bugs in one year.
Detection & reporting effectiveness of External Researchers is x. They find and report E = x*B bugs through the same time.
If these processes are independent, then it does not matter whether External Researchers are looking for already known or yet unknown bugs. In either case their chances of success are x. Therefore they will find x of those bugs that have already been found internally, which would be x*I bugs. These are the shared bugs, known to both Internal and External parties. Let’s denote their count through S = x*I.
Now let's substitute into that the already known value of I = y*B, so S = x*I = x*y*B.
Then let's construct an expression of the form I*E/S. What does it evaluate to?
It’s easy: I*E/S = (y*B)*(x*B)/(y*x*B) = (y*B)*(x*B)/(x*y*B) = B*B/B = B.
In other words, the total number of bugs (both known and unknown) in such an ideal product is estimated simply as:
B = I*E/S [10]
</The Math>
For example, let's say that in one year Internal Engineering has found 40 bugs in a product. External Researchers have found 20 bugs in the same time, and out of those twenty 4 bugs were the same as internally known bugs. That means, Externals were capable of detecting 4/20 = 20% of known bugs. If unknown bugs are not fundamentally different, that means Externals' detection rate is also 20% for unknown (and all) bugs. And that 20% efficiency is represented by 20 external discoveries. Therefore, the total bug count in the system is 20/0.20 = 100.
Although very simple, formula [10] already enables reasonable estimates in many cases. If a product in question is not changing much through the duration of observation, just multiplying the number of internal finds by the number of external ones, and dividing that by the number of shared bugs is sufficient to assess the remaining defects count.
And how do you count shared bugs? Make a list of bugs that affected version 1.0. Cross out those that also affected 2.0. The remaining are those that hit 1.0 exclusively. So most likely we did something to 2.0 that has prevented those bugs. Look at the 2.0's code where those bugs originally were in version 1.0. See what's changed there. Often, that would be a fix to an internally found bug. Often something completely oblivious of the externally reported problem :)) Yet this is a shared find. Count it.
Another source is your Duplicate bugs. See if any of them are resolved to the externally reported cases.
]]>Problem introduction and disclaimer.
Security Review Heuristics Zoo.
Or rather a few closing notes...
Can you quantify "product security"?
Usually when people start talking about "X being 23% more secure than Y" I just snort. However, with the notion of features interaction complexity, we can at least try to approach that problem from a quantitative side.
The idea is to assign a numeric security measure L to a product if all features interactions up to the complexity level L for most industry-recognized representations of the product have been analyzed and checked for security. Yes, "most representations" is not very precise, but as in practice there are not so many of them (about 10?), this still may offer a workable yardstick for a quantitative assessment. The scale would look then something like this:
Level 0. No systematic security work has been done even with individual features. Well-recognized security industry standards are likely violated (e.g., credit card numbers are sent over WiFi in clear). A product almost certainly contains glaring security flaws easily discovered by non-professionals.
Level 1. Security coverage has been achieved for isolated features of a product. It is probably free from the most common security failures associated with its domain of operation. No pairwise component interactions have been seriously considered though, so security weaknesses still arise in corner customer scenarios; those would require some basic level of security expertise to recognize them. (e.g., a customer is authenticated at the product's Web site, but critical operations do not verify that the Authentication has not expired yet, so a CSRF is considerable).
Level 2. All elements and their pairwise interactions are covered for most of industry recognized ways to re-factor a product into elements. Theoretically, a well-done Threat Modeling can achieve that at least from the design standpoint. Finding a security hole would require pushing a product very far into an unexpected state. Most pentesing vendors would still be able to discover scattered holes in such a product with enterprise-level funding (e.g., via fuzzing).
Level 3. Exhaustive verification of all three-way interactions is conducted for multiple re-projections. A product is solid against most flavors of nearly all industry-recognized attacks. Finding a new security hole in it costs O(N^{4}) effort with respect to the number of elements logically comprising a product. In practice this would require at least billions of attempts. State-sponsored agencies can probably still run effective security offence against products at that security level, with costs expressed in 8-9 digit figures.
Level 4. All 4-way interactions have been verified as safe. I would say this is what a military-grade security *should* look like. Compromising such a product would take computational resources of leading world superpowers and cost significant chunks of their yearly GDPs. Security holes at this level would literally be solitary, unique in nature, and kept under high secrecy as important military assets.
And so on.
Security Career Roles
Every company needs "security people", but whom exactly? While most security specialists can wear multiple hats at the same time, they also have their natural preferences. And those preferences could be classified within the definitions of this article as following:
Note: standard Disclaimer expressed in Part I applies here as well.
Heuristic 5: "Area Expertise" and "Penetration Testing"
These two seemingly different techniques share a lot in how they approach managing the complexity of security reviews, so I will consider them together.
"Area Expertise" is simply learning. Studying a technology long enough can make one a Subject Matter Expert (an "SME") who knows all subtle interactions within the area and can recognize possible security issues there quickly. This approach is orthogonal to an "audit". While "audit" spans a horizontal area across the product, an expertise would tend to occupy a deep "vertical" cut of a limited width. E.g., an expert may know how to write interfaces with static methods on them in Intermediate Language -- but would have no idea about network security:
Penetration testing is a technique that seeks (to some degree) to re-create SME effect on a faster timescale. It starts with carefully choosing a sub-set of a product "surface" that appears to be softer. Then, with a help of tools and deep design dives it seeks to pass through the L1-L2 complexity levels and gain a control over a set of more complex interactions, typically at Level 3 and deeper. As many security heuristics would have patchy coverage at complexities above L2, chances are good that executing or examining that functionality would reveal new security holes.
Importantly, pentesting cannot be used to ensure large products security. As it focuses on higher order interactions, it would need at least something like O(N^{3}) time for that. Time and budget limitations would prevent such an algorithm from scaling across the entire product. Pentesting resolves this contradiction by reducing N – in other words, but precisely aiming at a smaller part of a product to break through.
Therefore, the value of pentesing is akin to that of a geological drilling. It measures product "security" by assessing the difficulty of breaking through its L1-L2 security "crust", and delivers samples not easily found on the surface.
Heuristic 6: "Learning from Attackers"
If you are an enterprise, you may have something like 10^{4}-10^{16} automatic security checks at your disposal to run against your product before giving it to the user. On top of that, there are 10^{2}-10^{5} human verification questions available to apply, too. Seems like a lot? But a product with just 100 elements can have up to 10^{30} possible element interaction states! Even if 99.999% of them are void, the remainder still represents a space overwhelmingly huge and is beyond any technical or human capability. Within that vast space, where should we place our "tiny" 10^{16} security checks?
Well, the good news is that we actually don’t need to find all security holes. We need to find only those that attackers can and will (eventually) find. With that leverage, the effectiveness of the security assurance can be boosted by orders of magnitude, enabling practical and effective security defense.
What specifically could be done?
First, you can study past attacks against your product or other similar ones on the market. How are they typically attacked? What are the top 5 classes of security vulnerabilities they had faced? http://cve.mitre.org/ could be a good starting point for such an assessment.
Second, you can attend security conferences and talk to security researchers. Or, if travel budget is an issue, just read abstracts of their presentations and summaries of their talks. Even in such form the knowledge gain can help you predict how your product would be attacked in 1-3 years. By smartly prioritizing security assurance work then it could be shrank in volume (compared to "blind" planning) by a factor of 2-10 times.
Finally, if you are an online service, you can learn security attacking patterns from your logs! That way, attackers actually work for you. Any new creative idea they come up with becomes part of your tools set. For free. Isn't it cool? Of course, this is easier to say than to do. Putting together a proper monitoring & detection solution is a daunting task. But the benefits are so great that I would definitely strive for that.
Mind the privacy though – you don't want any sensitive or private users' data to be exposed internally or externally in the process.
By the way, not only services can benefit from that approach. Traditional "box" products can collect crash reports and mine attacks in them as well, thus improving their security posture.
Heuristic 7: "Scenario Fuzzing"
A JPG file may miss up to 90% of information contained in the original bitmap image. Yet for a human viewer it represents all the essentials of the picture. Conceptually, this is achieved by abandoning a “dot” as a principal element of image construction and choosing a “wave” for that purpose.
Can we apply the same idea? Can a very large product be effectively represented with a number of elements substantially smaller than its feature count?
Personally, I'm quite positive that is possible, and possible in many ways. Here I will discuss one of them that is based on the concept of an end-user scenario.
What is a "scenario"? A scenario is a description of some meaningful user's interaction sequence. An example would be: "launch the browser, login into your email account, read new messages, log-off and close the browser." A product could be represented then as a set of scenarios it supports.
Why would that work?
First, scenario-based description is complete. The proof is obvious: if there is a functionality that is not activated in any legitimate user interaction, it is a dead code. Throw it away. No customer would notice that, ever.
Second, scenario representation is compact. Typically, scenarios drive product requirements, which drive the specifications, which produce the API, then the code, and thus the product itself. A single use case quoted above would already hit thousands APIs and dozens of features in a contemporary OS. A few hundred scenarios often completely describe even a large product.
In fact, for a product where scenarios "choose" features to participate randomly and independently of each other, the number of scenarios needed to execute all features grows logarithmically with respect to feature count: S = O(Log(N)). While that is probably not exactly how real products are structured, it still demonstrates the power of scenario-based product description at least for some cases.
Effectively, scenarios play a role similar to that of a “coverage graph”, providing access to a vicinity of each and every possible functionality combination in a product.
How do we use that to discover security weaknesses?
If you look carefully at security breaches you'll see that most have valid user scenarios closely associated with them. When an attacker breaks into a system, he/she mostly interacts with it in ways that are fairly close to what a "legitimate" user would do. Those are only 1-3 unexpected "twists" that throw the system into a state of abnormal that may further lead to exploitability.
In other words, an attack often is a scenario that differs from a legitimate use by few changes. If so, can it be derived from a legit scenario by some morphing, using techniques such as random tweaks, grammar-based generators, genetic algorithms, or N-wise combinations building?
After all, if we do fuzz binary inputs, why not fuzz scenarios?
Many security reviewers do exactly that. They take a valid workflow and morph it. “Next step: FTP log-in. Well, what if I send my log-in packet twice after seeing a 100 response but right before a 200 is received? Or what if I start sending log-in packet, but will never finish it?”
Such variations may trigger completely new sequences of calls within the system and cause interactions between components that have never been in touch before… and sometimes result in an unexpected behavior.
Personally, that’s why I tend to start security reviews with a simple question of “please give me the user’s perspective”. That’s my minimal "fuzzing seed” – a valid use case that I can tweak to produce new variations.
Scenario fuzzing is intuitively easy. It scales well with respect to the scenarios count involved. It naturally considers multi-component interactions and can, in theory, discover completely unknown classes of security bugs.
But of course it has its limitations as well.
First, today it is primarily a human-driven process. I’d be very happy to see a tool that can do that job but I think our scientific understanding of natural languages is just not there yet. This tool would need to be able to take a valid English description of a scenario, tweak it, and produce a different (yet still meaningful) description as an output.
Of course, this does not have to be done in English. There are artificial formal languages for scenario description in software industry. They may offer a better starting point for this approach.
Second, it takes years of security training and a good bit of human creativity to come up with good scenario tweaks.
Third (and the most important) its' practical application is limited by the knowledge limits of your data sources.
It is easy to ask crazy questions like “what if I do A before doing B, without doing C, while continuing to do D at the same time?” However, in many cases neither the specification, nor the product team would be confident in the answer! In fact, hitting a blank look and “we don’t know” is an indicator that the review is within the space that nobody has deeply thought through before.
But does that help? If most of these questions eventually lead to boring answers, people will quickly learn to ignore your requests. Because obtaining the answers might be difficult for them. It could take hours of examining the source code, or getting a response from SMEs, or painful debugging. Plus it takes great tenacity to make sure that email threads do not die, that people keep working on your questions and deliver consistent answers.
So it requires building great trust with people. While operating in this mode, you need to make sure you are asking "good" questions frequently enough so that people would not learn to dismiss them as "mostly useless" based on past observations.
What is the complexity of this algorithm? It's hard to say. It really depends on the specific mechanism chosen to generate new scenarios. And it does not even have a defined stopping point, so formally speaking this is not even an "algorithm".
Just for the sake of illustration, we can assess the complexity for one special case when:
For that case, the complexity of the review could be shown to be approximately O(t*Log(N)) – in other words, it's less than linear! How's that possible? The short answer is because in scenario fuzzing, one question executes many features, and the reviewer does not need to explicitly know them all. So he/she doesn't have to spend O(N) time enumerating them – that has already been done by the engineering team when they designed and built the product.
Item 8: What about Secure Design and Layers of Abstraction?
I'll say it upfront: neither of them is a review heuristic. They are design techniques. But as it's simply impossible to avoid this topic while discussing software security, I'll share my view on it.
So, in a system of N elements where each can (generally) interact with each other one, there are potentially up to 2^{N} various interaction combinations possible. That causes security review to be exponentially complex and generally unmanageable in practice.
But what if we avoid that "any-to-any" interaction pattern and bring some structure into the system?
One way of doing that could be logical arrangement of product elements into a two-store hierarchy where:
Here is how it may look like:
What would be the total number of checks needed to ensure the secure design of such a system? Obviously, it will require 2^{k} checks to cover the upper level, plus k times 2^{N/k} for the lower:
T = 2^{k} + k*2^{N/k} [3]
What is the choice of k that minimizes that number? That's answered by demanding that ∂T/∂k = 0 and solving the resulting equation. While the precise solution is impossible to express in common functions, a good approximation for N >> 1 is easy to obtain:
k ≈ N^{1/2} + ¼ Ln(N) [4]
With the corresponding number of needed security checks being then:
While that is still a huge number, it is tremendously smaller than 2^{N} checks needed for a "disorganized" system where just everything interacts with everything. So arranging a system as a 2-layered hierarchy brings in a great reduction in complexity of the security assurance.
Can we capitalize on that and introduce more layers of abstraction, somewhat similarly to the OSI model?
The answer is positive. For a system with L such layers the total number of security checks needed (very approximately and assuming very large N) is even further reduced to:
Why not continue adding layers indefinitely? Because there is a design and implementation cost behind each next layer of abstraction. So there needs to be just enough of them to keep the system manageable – but not more. How many, exactly? If your budget of security checks is T, and there are N >> 1 atomic elements expected to be in the system, the count is obtained by inversing [6], and it is, roughly:
L ≈ Ln(N)/Ln(Ln(T)) [7]
For a product 10 Gb in size with 1 million security checks budget that evaluates to L ≈ 10. Not surprisingly, it is quite close to the number of abstraction layers in the OSI model, because the dependency on both input parameters is rather weak.
In fact, a logarithm of a logarithm of pretty much anything in our Universe could be considered a constant for most practical uses :))
Now when we are done with that, let's make a few sobering notes.
First, as a security reviewer, you rarely face a large product that you have a chance to design "properly" from scratch. More often you face systems already mostly designed, with a long history of past versions, people joining and leaving, patches, changes, and even acquisitions. Your responsibility is similar to that of a doctor, who needs to diagnose a patient and give actionable advice while fully accepting their background, past life events and known health issues.
Second, even if you get a chance to design something from the beginning, I seriously doubt that such a thing as "Secure Design" exists in practice. Sure, it would be nice to live in a world where no unexpected higher order interactions between components are possible, and all lower-order ones are documented. But I doubt that's possible. In my opinion, there is no such thing as "completely non-interacting features". There are only features with very low interaction probability. So every time you think you've eliminated a class of unwanted interference, a completely new one surfaces right behind it…
Functions A and B may never call each other. But if each one allocates one byte of memory dynamically, A may (potentially) eat the last memory byte precisely before B would need it. So suddenly B's call becomes dependent on A's execution state. Is that an effective control mechanism? Not really. But is that a dependency? Yes. And it may need to be factored into the analysis if the cost of a potential failure is represented by a nine digit figure.
Certainly, not keeping any function variables cures this problem. But even if software is (somehow?) completely removed from the risk picture, what about hardware failures? Believe it or not, some researchers have learned how to exploit results of memory corruptions caused by (among other things) random protons from the interstellar space hitting our memory chips -- see http://dinaburg.org/bitsquatting.html for details.
My personal take on that is that there is always some probability of an unexpected interaction. Eliminating one class of it brings up the next one, far less probable but way more elusive and difficult to eliminate. So some surprises will always keep hiding somewhere within the realms of higher complexity interferences.
That being said, a "Secure Design" as something that tries to minimize side interactions is valuable. While it may not be perfectly achievable, it may still bring the security of a product couple levels up in terms of the complexity analysis needed to successfully attack it.
Introduction (or Part I) is here.
<Disclaimer>By no means this list is "complete". I think every security person on the planet can add couple extra good tricks learned through their experience.</Disclaimer>
Heuristic 1: "Audit"
I already considered this method in the previous chapter, so I'll be brief here.
To run a security "audit", you logically split the system into a set of elements and ask "security" questions about each one. Something like: “Do you use MD5?”, “Do you transfer user passwords over HTTP in clear?”, “Do you have Admin/root accounts with default password?”, "Do you parse JPGs with a home-grown untested parser?" and so on. Answering "Yes" to a question typically indicates an actionable security risk.
This approach may look primitive, but it does have its place. It usually prevents at least the most outrageous security holes. It provides *some* basic assurance when nothing else is available under the time crunch. The complexity of this algorithm is linear O(N) so it scales well. And when you are done with it, you have a bonus list of high risk targets for later detailed inspection.
On the complexity diagram, audit occupies "the first floor":
The primary shortcoming of this method is that without special attention it would not consider complex interactions and so can miss contract violations like those I already presented:
Component A: “We don’t check any user’s input, we only archive and pass it through”
Component B: “We are the backend system and expect only good filtered inputs”
Despite that limitation, audit may still be helpful at least as a tool for the first quick assessment or when there is no time for anything else.
Some readers may point out that even audit may be tough to apply across really large products that can easily have 100,000 work items logged against them. Just reading the description of every item would consume weeks of time. It's impossible to imagine asking 20 security questions about each one on top of that.
There are approaches to deal with that problem, both technical and organizational. The former focus on automatic items triaging and may utilize machine learning for that. The latter try to distribute this work across feature owners in some reasonable manner. Unfortunately, as this is a large topic, I'll have to defer the discussion on it to (hopefully) some better times.
Heuristic 2: "Threat Modeling" (and other polynomial algorithms)
Let's study this method in detail to understand some of the inherent limitations behind it.
Threat Modeling (at least in its "classical" form) primarily considers pairwise interactions between product elements. It can detect contract violations like the one just mentioned. It is a well-recognized and fairly efficient technique known broadly across the industry, it is easy to learn and there are tools that support it (e.g., http://www.microsoft.com/security/sdl/adopt/threatmodeling.aspx).
Downsides? It complexity is O(N^{2}). No, the problem is not the volume of needed computations per se. It is rather how humans tend to respond to that, causing what I call "Threat Modeling fragmentation".
If the cost to Threat Model one 10-element feature is 10*10 = 100 questions/checks, splitting it into two sets and running two independent Threat Modeling sessions would reduce that cost to 5*5 + 5*5 = 50 checks. That is a large time saving! Unfortunately, the benefit comes purely from ignoring the connections between separated subcomponents, and thus missing possible security holes associated with them. But under tight time pressure, it visibly saves time! So people across the organization may start slicing large features into smaller and smaller components, eventually producing a huge set of tiny isolated Threat Models each concerned "with my feature only" and totally ignoring most inter-component connections.
That is the mechanism that makes Thread Modeling difficult across large products.
That issue would affect any security review algorithm with a complexity significantly greater than linear O(N). Why? Because for any function f(N) with a growth faster than O(N) it is true that:
f(N1) + f(N2) < f(N1 + N2) [1]
[This says, effectively, that two smaller reviews cost less than one review of the combined features.]
So unless a carefully planned consolidation effort is applied during the release cycle, all human-performed "complex" non-linear security algorithms would naturally tend to produce fragmented results, up to a point (in the worst case) of impossibility to merge them back into any coherent single picture.
To be free from that problem, a security review heuristic must run in time not much worse than O(N). Otherwise, reviews fragmentation will kick in and it may consume significant resources to deal with it.
An algorithm that would be naturally feature-inclusive and facilitating holistic security is the one with the property opposite to that of [1]:
f(N1) + f(N2) >= f(N1 + N2) [2]
…which is satisfied if it runs in O(N) time, or faster.
But can it really run faster? The answer is "no", because simply enumerating all features takes O(N) effort, and not enumerating any of them would mean we completely missed some, with possible associated vulnerabilities!
Therefore, a security review algorithm that scales well across a large org on large systems must run in precisely O(N) time.
In practice small deviations from that could be permissible. E.g., if doubling the scope of the review increases a cost per feature by a small fraction only it may still be acceptable. What constitutes a “small fraction”? A review process should scale well from individual feature (N = 1) to a large product (N = 1 million). A cost per feature should not change more than by an order of magnitude upon such an expansion. That dictates that a practically acceptable review algorithm should run in a time no worse than something like O(N*Log(N)) or O(N^{1.16}).
Finally, it's worth mentioning that since Threat Modeling primarily focuses on pairwise interactions, it tends to pay less attention to vulnerabilities arising from interactions of higher orders of complexity (i.e, of triplet interactions). As a result, some classes of security holes are not easily found with Threat Modeling. For example, people with field experience know that it is very difficult to represent (yet to detect) an XSS or a Clickjacking attack by using a Threat Model. In reality, those attacks are usually found only when a reviewer is aware of their possibility and specifically hunts for them.
Heuristic 3: "Continued Refactoring"
So a product was represented as a set of logical elements, you ran your favorite security review algorithm over it, found some issues, got them fixed.
A question is: if you re-represent the product as a different set of elements and run a review again, will you find new security bugs?
The answer is "almost certainly yes"
A semi-formal "proof" is easy. Let's say a product has a total of M truly "atomic", indivisible elements (such as bits of code). There are up to 2^{M} potential interactions between them. For review purposes, the product is represented as a set of N_{1} < M groups of those bits. Within each group, the review is perfectly efficient – it finds all weaknesses. Outside the group, connections are ignored.
Assuming all groups are equal in size (just to show the point), how many connections would this grouping analyze? Within each group, there are (M/N_{1}) elements, therefore 2^{(M/N1)} possible interactions between them will be considered. Since there are N_{1} groups, the total coverage would be N_{1}*2^{(M/N1)} which is significantly lesser than 2^{M} when N_{1} < M.
In other words, by choosing a split of a product we greatly shrink the volume of our review but make it manageable. That also means that by abandoning a representation and switching to an essentially different one a new fresh look is almost guaranteed. And so new security bugs!
Of course, the catch is to choose "good" representations. Neither splitting a product into just two components, nor slicing it down to individual bits is really helpful:
Choosing the right representation still remains a bit of an art.
So keep trying. The more meaningful ways to represent a product as "a set-of-somethings" you can creatively come up with, the more successful your review will be. Some of the possible approaches are:
Generally, if you have k independent ways to re-represent your product, your security enhances roughly k times. So it makes sense to approach security from multiple angles. You split by feature and do Threat Modeling. Then you split by source file and do code review or scan for unwanted API. Then you split by API entry points and do fuzzing. Then by binary and scan for recognized insecure structures. And so on.
That's one of the reasons why the "many eyes" argument works. And that is, essentially, how SDL-like heuristics arise. By the way, they still run in linear time – in O(k*N), precisely.
Heuristic 4: "Pattern Elimination"
Some security bugs are not completely random. Studying their past history may lead to recognition of a pattern. When a pattern is known, its detection could be automated. And then a whole class of security issues could be eliminated at one push of a button. At O(1) human time cost.
But patterns do not necessarily follow simple lines on product complexity diagram – they would rather tend to have quite complex shapes and may be difficult to spot:
So pattern elimination is a two-step approach. First, it takes a good human guess. Next, cheap computational power essentially replicates the results of that guess across all the product, detecting and/or eliminating security issues.
The effectiveness of this approach is limited by four factors: