Once again, I'm proven right about false positive lag time
I hate to brag (no, wait, I love to brag), but once again I have been proven right.
One the problems with getting accurate statistics about false positives is that users quite regularly submit them late. So, assume for the week of Dec 3 - Dec 10 we report that we had 100 false positives. One week later, we report that the week of Dec 3 - Dec 10 had 188 false positives. This is a net change of 88 FPs! What happened?
For the longest time, I intuitively knew this. When I was processing FPs, I always saw FPs submitted by people that I knew I had fixed. I began to get a good feel for how late people submit them and found that once we reach the 3-week mark, there is little chance that an FP that occurred 3-weeks ago will be submitted. Said another way, there is little chance that an FP that occurred on Nov 27 will be submitted to us on Dec 18.
I also found that while after 1 week we could get a good feel for how many FPs would be submitted, it was not enough time for them all to come in. After 2 weeks we could get a pretty good representation of what the numbers would eventually look like. For example, suppose it's now March 10, 2008. The final FP numbers for Dec 3-Dec 10 are 197 FPs. Well, on Dec 17, the numbers for Dec 3 - Dec 10 would say 180 FPs. That's pretty close to what the final numbers will be.
As I was saying, these lag times were estimated by me based on experience and intuition. A couple of weeks ago, I finally got around to actually writing some scripts and tracking data in databases. Here are the numbers with regards to false positives:
- 11% are submitted on the same day they occur.
- 20% are submitted the day after they occur.
- 50% are submitted in less than 3 days after they occur.
- The remaining 50% are submitted up to 12 weeks afterwards.
- 99% are submitted within 3 weeks after they occur.
So, the numbers back up my experience and intuition. This illustrates a point I have been talking about for months - my intuition for dealing with spam is often correct and rarely is contradicted by actual data.