You can use backups for things other than restoring


A customer wanted to know the internal file format of Visual SourceSafe databases. (That wasn't the actual question, but I've translated it into something equivalent but which requires less explanation.) They explained why they wanted this information:

We are doing some code engineering analysis on our project, so we need to extract data about every single commit to the project since its creation. Things like who did the commit, the number of lines of code changed, the time of day... We can then crank on all this data to determine things like What time of day are most bugs introduced? and possibly even try identify bug farms. Since our project is quite large, we found that generating all these queries against the database creates high load on the server. To reduce the load on the server, we'd like to just access the database files directly, but in order to do that, we need to know the file format.

Oh great, directly accessing a program's internal databases while they're live. What could possibly go wrong?

I proposed an alternative:

Take a recent backup of your project and mount it on a temporary server as read-only. Run your data collection scripts against the temporary server. This will spike the load on the temporary server, but who cares? You're the only person using the temporary server; the main server is unaffected. After you collect all your data from the temporary server, you can then perform a much smaller number of queries against the live server to get data on the commits that took place since the last backup.

Comments (23)
  1. Bruce says:

    Your example doesn't prove the headline, as placing a backup online (regardless of what server it's on) is still a restore operation.

  2. No One says:

    @Bruce: s/restoring/recovering lost data/ would be more specific to the usage Raymond meant but would be a far less compelling title.

    [Hooray, somebody who understands the importance of a snappy title. -Raymond]
  3. barbie says:

    Of course, the tricky bit is, did they actually have functional backups :D ?

  4. DysgraphicProgrammer says:

    I guess the bonus it that this tests the backups, before they are needed for an emergency.

  5. Bob says:

    What time of day are most bugs introduced? I don't understand why this question would be meaningful.

  6. @Bob says:

    > What time of day are most bugs introduced? I don't understand why this question would be meaningful.

    I agree. Especially because it would not says "when are most bugs introduced" but "when are most bugs checked in" which is totally non-correlated with the bug writing time.

  7. DWalker says:

    @Bob:  Management could mandate that all programmers stop working for the half-hour period around which most bugs are introduced.  Or make them go to lunch at that time.  :-)  Of course, after that change, there would be a new "time of day when most bugs are introduced".

  8. Adam Rosenfield says:

    I predict that the time of day when most bugs are introduced is the same time of day that most checkins occur.

  9. JohnL says:

    @Admam:  Right, because until the code is checked in it's not really "introduced".  I can do what I like as long as what I check in is correct.

    @Bob:  The time of day might be relevant – sometimes you don't know until you run the report.  Ever heard of the phrase "5 o'clock checkin"?

    [You do realize that I totally made up that example, right? -Raymond]
  10. Joshua says:

    Which still means the question for internal file format for Visual Source Safe needs to be answered.

    Good idea about restoring to a second instance to take load off primary. It would be doubly true if the primary really were VSS, but that's beside the point right now.

  11. Leo Davidson says:

    << Of course, after that change, there would be a new "time of day when most bugs are introduced". >>

    That is fine. Once we eliminate each time period, one by one, we will achieve our goal of having no more bugs introduced to the project. Mission Accomplished!

  12. Not-Poe says:

    Unfortunality, any example of management asking for crazy metrics falls under Poe's Law, even if you call out the fact it's not real.

  13. JohnL says:

    > You do realize that I totally made up that example, right? -Raymond

    It occurred to me, but you'd be surprised the number of issues that are the result of a rushed checkin at 5PM on a Friday, or just before a build checkpoint deadline.

  14. > You do realize that I totally made up that example, right? -Raymond

    You do realize how good you are at making up examples, right?

    Actually, it's an interesting idea, and less off-topic than my asking if you'd ever buy a yarn called "Snuggly Wuggly ™".

  15. alegr1 says:

    Assuming that the scripts would just pull all/some diffs of all files of a particular project, I don't see how this would be more load on the server than a project backup.

  16. alegr1 says:

    Adding to that:

    SourceSafe "database" is just a bunch of badly named files in badly named directory tree, and SS "server" is just a fileserver. If it's a plain Windows Server, the access to the files would be locally cached, and it's fast enough.

    But if the customer had a misfortune of using a different server, the files might not be locally cached, and then the client becomes incredibly slow. I've been there; our SS database was stored on non-Windows brand-name SMB filer, and it's been the world slowest SS server. A wire capture showed that the filer didn't support opportunistic locking, and that caused all file accesses to be non-cached. Even though the network was 1Gbps.

  17. Tony Mach says:

    Might be OT, but hey, I need to get it of my chest:

    You know what Obi-Wan said about Mos-Eisley? That exactly were my feelings when working with Visual SourceSafe. I love the Visual C++ compilers. I worked with 1.52, 6.0 and 2003 all the later. These are nice products. But I hate the guts of VSS. Am I glad that I have left it behind years ago.

    [Sigh. I should have made up a different example. -Raymond]
  18. Tony Mach says:

    Nice solution to the problem, though.

  19. Mark says:

    640k:

    (loads of crap about VSS)

    Read the article: "A customer wanted to know the internal file format of Visual SourceSafe databases. (That wasn't the actual question…)"

  20. Joshua says:

    "Like asking for sqlserver's internal file structure to be able to read data faster." Which might not be such a bad request under certain circumstances. E.G. A single user reader on its own copy is often faster than the database engine by an awful lot. Raymond here is emphasizing "own copy".

  21. 640k says:

    If it was a programmer which suggested this approach, I suspect *HE* is the one committing most bugs.

    1: The files are probably locked by VSS.

    2: If the files are not currently locked by VSS, VSS might need to lock them in order to write to them. Or else commits might FAIL.

    3: Yes, you could use shadow copy (as backup softwares do), but who does that when the backup software already served you the files.

    This is one of the most stupid requests I ever heard. Like asking for sqlserver's internal file structure to be able to read data faster.

    If you want to pull these kind of tricks, the solution is to migrate to a better scm which saves its content in a more documented format. Lite tfs.

  22. Ben Voigt says:

    How about NOT restoring the backup to a second instance?

    Grabbing the data from the innards of the backup file would seem to meet all requirements — it doesn't interfere with the live server, and it doesn't require all that performance-killing interprocess communication and locking.

    And, the backup format is likely to be much more stable than the live database format, since one expects to be able to use backup/restore when upgrading.

    [On the other hand, it means that you're going to spend several weeks writing the code to parse the backup file format. Code that you will run exactly once. -Raymond]
  23. ErikF says:

    The secondary benefit of Raymond's method is certainly not to be underestimated. We've had clients where I work that had tons of backups, but not one of them was usable! You obviously don't have to test every backup, but it's probably a really good idea to at least test your big full annual ones.

Comments are closed.

Skip to main content