Little learning exercise


i had an idea for a little bit of code i wanted to write that would give me an opportunity to learn about a lot of bits of technology that i am currently pretty ignorant about.  i wanted to create a little service that would help keep my system defragged, but would have better characteristics than my current solutions.  It makes sense to have a service like that, at it parrallels some of the work that Apple has done to keep their system defragged (when it comes to small files at least).

 

First, i want this to be a service because i have no idea how that system works.  And i want to write a defragmenter that works even if i’m not logged on, while still being a well behaved managed service.  This will allow my system to defragment overnight when it’s not doing anything if i’ve left it on, and i won’t have to deal with pesky problems if it’s an application and i’m logged on to multiple sessions in my Win2k3 box.

 

Second, i want to learn a bit more about how file management is done on windows.  Should i just register listeners for file changes on all the volumes?  Or should i watch the system volume journal?  How do i actually use the defrag APIs i’ve found.  If i mess them up am i going to totally roach my HD and files (*eek*)?  How do i actually determine how many fragments are in a file?  My idea was to watch out for when files are written to once the system is idle check if it’s fragmented.  If so, defrag it and then wait until the system is idle again.  Is that a good way to do this?

 

Third, i want to know how to write an app that does a lot of disk IO but doesn’t affect the user experience.  Tools like MSN Desktop Search and Google Desktop Search do a great job of this.  They seem to watch the CPU and disk activity and don’t do anything if the system is use.  When i run the regular system defragmenter everything crawls to a halt.  i want a system that doesn’t have that kind of negative user impact.  There are lots of ways to query for this info (like System Counters and WMI) i’m just not sure what the best is, or how it’s exposed to managed code.

 

Fourth, i want to figure out a good system for managing this service.  Maybe some sort of MMC plugin that would allow you to set tuning parameters.  A way to say “only start when the system hasn’t been used for ‘x’ minutes.  Only defragment files with more than ‘n’ fragments in them, etc.”   It would also be nice for being able to check in on it and see how much work it’s done, how much work it thinks it has left to do, and maybe basic stats along the way. 

 

Fifth, i want to develop this with the community, or make my code accessible to the community.  Not sure what’s the best way to do this, but i think that it would be fun to try.

 

i’d like to do most of this in C#, but i’m think that my job will be easier if i write a small interop layer in C++/CLI.  In the C++ side i can easily call into the native APIs like DeviceIOControl and deal with structures like FILE_SYSTEM_INFORMATION easily.  Then i can wrap those in nice managed objects that can be exposed to the C# side where the rest of the logic can sit.

 

Note: the defragmentation portion of this is only a minor part of the actual project.  The main purpose is to learn about the APIs and methods for pulling of tasks like this in the future.  Maybe in the future i’ll use these skills to write a “C# Compiler Service” which sits there in the background and will keep all your C# sources compiled and ready to be spit to IL once you tell it to :)

 

i think it’s going to be a lot of fun.  i just hope i don’t destroy my HD and data in the process!

 

 

Edit: Based on some great feedback, i’ve realized that if i do a lot of development under VirtualPC then i can screw things up without hosing my system.  Of course, at some point i’ll have to move it to a production machine, but this gives me a lot of leeway and assurance that i’ll be ok by allowing me test on such a machine and leave it runnning for a long time without impacting my negatively otherwise.


Comments (21)

  1. Thomas Scheidegger says:

    if you use the official APIs

    http://msdn.microsoft.com/library/en-us/fileio/base/defragmenting_files.asp

    you can’t destroy your disk.

  2. virtualblackfox says:

    Fot the community part, a GotDotNet workspace or a Sourceforge.Net project could be cool, especially if there is blogs post associated with each problems / successes who arise… It could be a good way to share knowledge.

  3. DrPizza says:

    "i had an idea for a little bit of code i wanted to write that would give me an opportunity to learn about a lot of bits of technology that i am currently pretty ignorant about. i wanted to create a little service that would help keep my system defragged, but would have better characteristics than my current solutions. It makes sense to have a service like that, at it parrallels some of the work that Apple has done to keep their system defragged (when it comes to small files at least)."

    Mimicing their system (defrag-on-open) will be quite hard to do; I believe the only way to reliably hook file open events is to write a filesystem filter driver, rather than a service (well, it could feasibly defer to the service to perform the actual defragmentation, but I think that’d actually just be an added complication). This would be the best solution for scoring points in the Battlefront; it wouldn’t be much good for your goal of writing a service, though, and as you regard that as the primary purpose of the exercise it’d be rather self-defeating.

    "Second, i want to learn a bit more about how file management is done on windows. Should i just register listeners for file changes on all the volumes? Or should i watch the system volume journal?"

    Depends. If you want to do it exactly like the Mac does then, as mentioned, you’ll need a filter driver. If you’re happy to just watch the FS the choices you have essentially three routes to take:

    FindXxxChangeNotification

    ReadDirectoryChangesW

    USN Journal

    The first two are broadly equivalent; the notifications that they return are slightly different, but they’re more or less the same. ReadDirectoryChangesW is more flexible, FindXxxChangeNotification is a little simpler. These functions can both watch entire volumes (by creating a subtree watch of the root directory), though I don’t believe either of them will see "through" mount points, which may be something you want to bear in mind.

    The USN Journal is clumsier to use and can lose entries if it overflows. However, it has the benefit of recording changes even when your service isn’t running. It might be a good approach nonetheless, as your service need only bother to read the journal when it detected that the processor load/disk usage was below some threshold (so that it doesn’t interfere with your use of the machine). With the live notifications you’d have to manually maintain a list of changes yourself; the USN journal of course does that for you.

    "How do i actually use the defrag APIs i’ve found. If i mess them up am i going to totally roach my HD and files (*eek*)? How do i actually determine how many fragments are in a file?"

    RETRIEVAL_POINTERS_BUFFER.ExtentCount.

    Specifically:

    HANDLE file(::CreateFile("path\to\file", …);

    ::STARTING_VCN_INPUT_BUFFER svib = {0};

    size_t retrievalBufferSize(1024);

    boost::scoped_array<char> retrievalBuffer(new char[retrievalBufferSize]);

    ::RETRIEVAL_POINTERS_BUFFER* rpb(reinterpret_cast<::RETRIEVAL_POINTERS_BUFFER*>(retrievalBuffer.get()));

    bytesReturned = 0;

    while(FALSE == ::DeviceIoControl(file, FSCTL_GET_RETRIEVAL_POINTERS, &svib, sizeof(svib), rpb, retrievalBufferSize, &bytesReturned, NULL)

    && ERROR_MORE_DATA == ::GetLastError())

    {

    retrievalBuffer.reset(new char[retrievalBufferSize *= 2]);

    rpb = reinterpret_cast<::RETRIEVAL_POINTERS_BUFFER*>(retrievalBuffer.get());

    ::SetLastError(0);

    }

    if(ERROR_SUCCESS != ::GetLastError()) { throw ::GetLastError(); }

    std::cout << "The file is in " << rpb->ExtentCount << " extents." << std::endl;

    "My idea was to watch out for when files are written to once the system is idle check if it’s fragmented. If so, defrag it and then wait until the system is idle again. Is that a good way to do this?"

    They’re as safe as any other FS activity that updates the disk. You can’t use the APIs to clobber one file with another or anything like that. They do update the metadata and rewrite the file, so can potentially cause data loss in power failure situations, but this is no worse than normal file activity. Once you’ve grasped how they work they’re not hugely complicated. The difficult bit is figuring out where to place everything.

    "Third, i want to know how to write an app that does a lot of disk IO but doesn’t affect the user experience. Tools like MSN Desktop Search and Google Desktop Search do a great job of this. They seem to watch the CPU and disk activity and don’t do anything if the system is use. When i run the regular system defragmenter everything crawls to a halt. i want a system that doesn’t have that kind of negative user impact. There are lots of ways to query for this info (like System Counters and WMI) i’m just not sure what the best is, or how it’s exposed to managed code."

    I think both WMI and Perfmon counters are exposed to managed code. WMI is probably cooler.

    "Fourth, i want to figure out a good system for managing this service. Maybe some sort of MMC plugin that would allow you to set tuning parameters. A way to say “only start when the system hasn’t been used for ‘x’ minutes. Only defragment files with more than ‘n’ fragments in them, etc.” It would also be nice for being able to check in on it and see how much work it’s done, how much work it thinks it has left to do, and maybe basic stats along the way. "

    MMCs are just COM components so there’s nothing really interesting there; the service knows how many files it’s defragged, how many it has left to defrag, and so on, so it just needs to show those in the MMC.

    "Fifth, i want to develop this with the community, or make my code accessible to the community. Not sure what’s the best way to do this, but i think that it would be fun to try."

    Sourceforge or one of its clones would be the obvious mechanism.

    "i’d like to do most of this in C#, but i’m think that my job will be easier if i write a small interop layer in C++/CLI. In the C++ side i can easily call into the native APIs like DeviceIOControl and deal with structures like FILE_SYSTEM_INFORMATION easily. Then i can wrap those in nice managed objects that can be exposed to the C# side where the rest of the logic can sit. "

    Writing the MMC snapin is probably rather more fun to do with C#, but bit-twiddling to work with the defrag structures is certainly gonna be better suited to C++.

    "Note: the defragmentation portion of this is only a minor part of the actual project. The main purpose is to learn about the APIs and methods for pulling of tasks like this in the future. Maybe in the future i’ll use these skills to write a “C# Compiler Service” which sits there in the background and will keep all your C# sources compiled and ready to be spit to IL once you tell it to :) "

    The thing is, the defragmentation portion is the portion that’s hard to do well, and IMO the part that’s interesting. Writing a service is just a question of running the wizard to create a service, and writing an MMC snap-in is just a question of running the wizard to create a COM component (implementing at a minimum IComponent, IComponentData, IDataObject, and IClassFactory); one can just read stuff on codeproject et a. to find out about that.

  4. virtualBlackfox: Great suggestion! Sourceforge would probably be my preferred choice. I’ll have to check in on how source control is done there and how i can plug into VS.

  5. DrPizza: "Mimicing their system (defrag-on-open) will be quite hard to do; I believe the only way to reliably hook file open events is to write a filesystem filter driver, rather than a service (well, it could feasibly defer to the service to perform the actual defragmentation, but I think that’d actually just be an added complication). This would be the best solution for scoring points in the Battlefront; it wouldn’t be much good for your goal of writing a service, though, and as you regard that as the primary purpose of the exercise it’d be rather self-defeating. "

    I’m not trying to mimic the actual way they accomplish this. I’m more trying to mimic the way that they keep the system incrementally defragged rather than requiring a heavyweight sledgehammer approach to the problem.

    I don’t see much in the way of superiority in either approach. And i’m certainly willing to adefer things to idle since my system wastes so much time in that state.

    Again, this is not particularly about the defrag solution, or learning how to do just FS operations. I’m want to learn more about the peripheral stuff going on here and the defrag task is nice because it can cover all these areas while keeping the core task fairly simple.

  6. DrPizza:

    "If you’re happy to just watch the FS the choices you have essentially three routes to take:

    FindXxxChangeNotification

    ReadDirectoryChangesW

    USN Journal "

    Yup. That’s what i figured. However, i was unsure which would be the better path. Are there perf costs associated with either. What if the service isn’t running for some reason? Maybe i want a hybrid approach where it can sacn the journal for changes that occured while it was offline, while getting notified about the changes that happened while it was online.

  7. Thomas: Yup, i’ve been looking at all of those. however, i’m curious why you don’t think that i couldnt’ destroy my data in teh process.

    For example, if i misuse those APIs, don’t understand how a certain parameter is supposed to work, etc. etc., it’s possible that maybe i’ll leave things in an indeterminate state where they can’t get recovered. Unless there is simply an FS call:

    "DefragFile(path)" which ensures that everything will be fine there is still the chance for me to error and screw something up.

    If i was dealing with normal IO calls i could roach my files, what makes these special?

    Thanks!

  8. DrPizza: As to your code examples. Some of the APIs concern me a bit. For example, they’re pretty explicit about saying taht information you retrieve from them might be stale and can be updated as you’re doing the defragment (like bitmaps of freespace and whatnot). So i want to be quite careful when writing this code that i can handle those situations cleanly and safely.

  9. DrPizza: again. Is that a good way to do this?"

    They’re as safe as any other FS activity that updates the disk. You can’t use the APIs to clobber one file with another or anything like that. They do update the metadata and rewrite the file, so can potentially cause data loss in power failure situations, but this is no worse than normal file activity. Once you’ve grasped how they work they’re not hugely complicated. The difficult bit is figuring out where to place everything.

    I suppose, although some other APIs present a much simple API than these. Especially the means of getting buffers back with values that you have to do some funky arithmatic on to determine strucuture sizes and where the next structure lies.

    It’s not very pretty stuff and i’d like to wrap it with nice clean things like iterators so that from managed code it’s simple and you won’t have to do any sort of unsafe { } code.

    You might consider this "not hugely complicated", but i’ve found that for me, this sort of funky structure twiddling is a pain and i like to avoid it whenever possible.

    That said, it’s good to get my hands into it since it will get me more comfortable with it and maybe i can write some nice libraries that will:

    a) abstract this stuff away

    b) be extensible and generic enough to work on other problems like this when you want to take funky native structures and pass them on to maanged code.

  10. DrPizza: "I think both WMI and Perfmon counters are exposed to managed code. WMI is probably cooler. "

    Yup. AFAIK WMI is all COM, so it should be available. I’m just wondering if it’s as simple as registering a call back that says "when CPU activity has averages less than 5% for 5 minutes, call me back" or, if i have to do some sort of polling and keeping track of running averages myself.

    It would be useful to know how much functionality the stock system gives you and what you need to build on top.

  11. DrPizza:

    "The thing is, the defragmentation portion is the portion that’s hard to do well, and IMO the part that’s interesting."

    And it’s quite interesting to me as well. But just because something is hard doesn’t make it less interesting to me. There are areas here that i’m pretty unfamiliar with, and i really do enjoy expanding my knowledge and becoming comfortable with them. When the breadth of my knowledge expands it ends up being useful in all sorts of ways because suddenly i can say: "a-ha, i know a great way to do Foo, and it will only take a couple of days because i’m already familiar with the system", whereas before i would have to go with "well… maybe Foo will work, but you gotta give me a lot of time to investigate it and really determien if it’s right".

    Don’t get me wrong, i also think the defrag part will be great fun, and i really want to learn about how all this low lever stuff works since i normally sit on top of really high level’s of absrtaction. But it’s only part of the fun. I would be really glad to work on the defrag stuff with you since that seems to interest you and only the other portions with other members of the community that like and know about that stuff.

    "Writing a service is just a question of running the wizard to create a service, and writing an MMC snap-in is just a question of running the wizard to create a COM component (implementing at a minimum IComponent, IComponentData, IDataObject, and IClassFactory); one can just read stuff on codeproject et a. to find out about that. "

    Sure. But it’s still uncharted territory for me, and that excites me. Go figure.

  12. Steve Hall says:

    For the defrag portion of the monitor, you can use the following quick-and-dirty command-line utility (sources included!) to experiment with and learn the defrag APIs:

    http://www.sysinternals.com/ntw2k/info/defrag.shtml

    There was also an accompanying article that appeared in WinNT Magazine:

    http://www.windowsitpro.com/Articles/Index.cfm?IssueID=20&ArticleID=304

    I don’t know if the source code needs to be updated for NTFS 5.x (i.e., for W2K/XP), but I’ll leave that reconciliation up to you… (I haven’t tried it myself since NT4.)

    Personally, when working on device drivers, FSDs, or even "innocent" utilities like defraggers, I ALWAYS dual-boot into an alternate Windows image for testing (and keep my production system as well as the alternate system backed up on tape). It doesn’t matter how careful you think you are, or the wonderful assurances that others may give you, playing with file-system drivers/utilities will almost always result in a corrupt file-system as some inopportune moment… (So says 30 years of writing file-system drivers on a half-dozen OSs… Remember: Tape backups are your FRIEND!)

    Have fun!

  13. Thanks Steve!

    Great resources. I’ll have to check and see if the licenses will be compatible with any license i want to release this under before i go rooting around in the source.

    And i definitely think that doing this under VirtualPC is the way to go so that i can play around without risk.

    The only fear will be when i finally think it’s ok and run it on a productions system 😀

  14. AT says:

    Defrag is not trivial task.

    At first – there are technologies in WinXP (do not remember – but I feel there was something like this even in WinME) designed to layout files on disk for fast startup of some application by using prefetching.

    Take a read for example this short detail:

    http://www.microsoft.com/whdc/driver/kernel/XP_kernel.mspx#ELAA

    At second – there is no way to know "optimal" layout of data for performance.

    For example if you will leave no free space after current file – but file will grow over time – this will render all your defrag efforts useless – as new extends will be created.

    Another example – if after reordering you will place files used by single application far-far away each other and HDD heads will have to do long seek.

    At third – there are some non-trivial configurations of HDD can be created using software (as well hardware) RAID that rule – "placed at beginning of disk – accessed faster" will not be true at all.

    At forth – those defrag APIs were created by Executive software (or for them by Microsoft) Lite version of Diskeeper included in all Windows OSes.

    I know nothing about terms of agreement between Executive and Microsoft – but something you plan to do will directly compete with current Diskeeper 9.0 offering ( http://www.executive.com/diskeeper/servers.asp )

    P.S> If defrag function was critical (yep, I’m aware about possible BSODs with fragmented MFT like this one http://support.microsoft.com/kb/320397/en-us ) for Windows operations – it was already implemented long time ago at OS level (then IO subsystem decide thich free segments to use).

    So – if you do defrag daily – you are probably victim of some "memory optimizers"-like hoax.

  15. Thomas Scheidegger says:

    Cyrus, the FSCTL_MOVE_FILE API does reject invalid parameters. You simply can’t relocate a file to an invalid position (physically & FS structural). Sure, if you pass wrong data, the new cluster positions could be even more -inefficient- (more fragmented, placed far away) than the original ones. But the disk data integrity is not compromised.

  16. Thomas: Thanks for the info! I’ll send off a mail to the NTFS guys to get some idea about what they think about this :)

  17. DrPizza says:

    "I don’t see much in the way of superiority in either approach. And i’m certainly willing to adefer things to idle since my system wastes so much time in that state. "

    Well, here’s the thing. The way Apple does it (and the way a file system filter could do it) is to trap files being *opened*. This necessarily traps *every single* file open (which is ideally what we want). Change notifications might not trap file open operations; specifically, if last-accessed time tracking is disabled, they won’t catch a read-only file open. This is undesirable, because we’d still like to defrag those files. Just, we never know about them.

    There’s also the second issue that change notifications don’t see through mountpoints. This is again annoying; we would want to trap every single file open whether it be to a volume mounted as a drive letter or a volume mounted as a directory. This isn’t insurmountable–we can register a notification for every mountpoint–but it’s not as convenient.

    "Again, this is not particularly about the defrag solution, or learning how to do just FS operations. I’m want to learn more about the peripheral stuff going on here and the defrag task is nice because it can cover all these areas while keeping the core task fairly simple. "

    But it doesn’t keep the core task fairly simple!

    "Yup. That’s what i figured. However, i was unsure which would be the better path. Are there perf costs associated with either."

    Yes, but not so severe as to cause a problem. The USN journal has the least performance overhead, because the journal is updated *anyway*, so you can defer reading it until later.

    "DrPizza: As to your code examples. Some of the APIs concern me a bit. For example, they’re pretty explicit about saying taht information you retrieve from them might be stale and can be updated as you’re doing the defragment (like bitmaps of freespace and whatnot). So i want to be quite careful when writing this code that i can handle those situations cleanly and safely. "

    There’s no problem. If you try to move the extents of a file to space that’s now no longer empty, the call simply fails with the error STATUS_ALREADY_COMMITTED.

    "I suppose, although some other APIs present a much simple API than these."

    Other APIs do much simpler things.

    "Especially the means of getting buffers back with values that you have to do some funky arithmatic on to determine strucuture sizes and where the next structure lies. "

    It’s just the normal pointer (well, offset) chasing stuff. Nothing special, but not the kind of thing you want to do in safe code, because that just makes it too complicated.

    "That said, it’s good to get my hands into it since it will get me more comfortable with it and maybe i can write some nice libraries that will:

    a) abstract this stuff away

    b) be extensible and generic enough to work on other problems like this when you want to take funky native structures and pass them on to maanged code. "

    Well, someone’s already made astonishingly crude wrappers (http://blogs.msdn.com/jeffrey_wall/archive/2004/09/13.aspx). I don’t care for the way they’ve done it (if you’re going to go for the C# route I’d prefer to define some structures with specified layouts and use them to read the information, as simply reading integers straight from the buffer seems to me rather error-prone and unclear), but I’m sure it works. That code doesn’t win you a great deal, though.

    "Yup. AFAIK WMI is all COM, so it should be available. I’m just wondering if it’s as simple as registering a call back that says "when CPU activity has averages less than 5% for 5 minutes, call me back" or, if i have to do some sort of polling and keeping track of running averages myself. "

    You’ll have to poll. There are WMI events that you can use to wait for things, but I don’t believe that processor load is one of them. Still, a call every second or two to see how busy the machine is won’t be too bad.

    "It would be useful to know how much functionality the stock system gives you and what you need to build on top. "

    It gives you everything you need, but perhaps not everything you’d want.

    "The thing is, the defragmentation portion is the portion that’s hard to do well, and IMO the part that’s interesting."

    "And it’s quite interesting to me as well. But just because something is hard doesn’t make it less interesting to me. There are areas here that i’m pretty unfamiliar with, and i really do enjoy expanding my knowledge and becoming comfortable with them. When the breadth of my knowledge expands it ends up being useful in all sorts of ways because suddenly i can say: "a-ha, i know a great way to do Foo, and it will only take a couple of days because i’m already familiar with the system", whereas before i would have to go with "well… maybe Foo will work, but you gotta give me a lot of time to investigate it and really determien if it’s right". "

    Yeah, but frankly, if you’ve implemented one COM interface, you’ve implemented them all. And you’ve surely implemented at least one COM interface before…. And you *know* that doing "Foo" (where "Foo" in this case is "writing an MMC that talks to a service") is possible because Windows itself has so many MMC snap-ins that work by talking to services.

  18. Jeff Davis says:

    There is a project on SourceForge that is a managed .Net wrapper for creating MMC snap-ins.

    http://sourceforge.net/projects/mmclibrary/