If you want to use GUIDs to identify your files, then nobody’s stopping you


Igor Levicki proposes solving the problem of file extensions by using a GUID instead of a file name to identify a file.

You can do this already. Every file on an NTFS volume has an object identifier which is formally 16-byte buffer, but let’s just call it a GUID. By default a file doesn’t have an object identifier, but you can ask for one to be created with FSCTL_CREATE_OR_GET_OBJECT_ID, which will retrieve the existing object identifier associated with a file, or create one if there isn’t one already. If you are a control freak, you can use FSCTL_SET_OBJECT_ID to specify the GUID you want to use as the object identifier. (The call fails if the file already has an object identifier.) And of course there is FSCTL_GET_OBJECT_ID to retrieve the object identifier, if any.

#define UNICODE
#define _UNICODE
#include <windows.h>
#include <stdio.h>
#include <tchar.h>
#include <ole2.h>
#include <winioctl.h>

int __cdecl _tmain(int argc, PTSTR *argv)
{
 HANDLE h = CreateFile(argv[1], 0,
                 FILE_SHARE_READ | FILE_SHARE_WRITE |
                 FILE_SHARE_DELETE, NULL,
                 OPEN_EXISTING, 0, NULL);
 if (h != INVALID_HANDLE_VALUE) {
  FILE_OBJECTID_BUFFER buf;
  DWORD cbOut;
  if (DeviceIoControl(h, FSCTL_CREATE_OR_GET_OBJECT_ID,
                 NULL, 0, &buf, sizeof(buf),
                 &cbOut, NULL)) {
    GUID guid;
    CopyMemory(&guid, &buf.ObjectId, sizeof(GUID));
    WCHAR szGuid[39];
    StringFromGUID2(guid, szGuid, 39);
    _tprintf(_T("GUID is %ws\n"), szGuid);
  }
  CloseHandle(h);
 }
 return 0;
}

This program takes a file or directory name as its sole parameter and prints the associated object identifier.

Big deal, now we have a GUID associated with each file.

The other half is, of course, using this GUID to open the file:

#define UNICODE
#define _UNICODE
#include <windows.h>
#include <stdio.h>
#include <tchar.h>
#include <ole2.h>

int __cdecl _tmain(int argc, PTSTR *argv)
{
 HANDLE hRoot = CreateFile(_T("C:\\"), 0,
                 FILE_SHARE_READ | FILE_SHARE_WRITE |
                 FILE_SHARE_DELETE, NULL,
                 OPEN_EXISTING,
                 FILE_FLAG_BACKUP_SEMANTICS, NULL);
 if (hRoot != INVALID_HANDLE_VALUE) {
  FILE_ID_DESCRIPTOR desc;
  desc.dwSize = sizeof(desc);
  desc.Type = ObjectIdType;
  if (SUCCEEDED(CLSIDFromString(argv[1], &desc.ObjectId))) {
   HANDLE h = OpenFileById(hRoot, &desc, GENERIC_READ,
                 FILE_SHARE_READ | FILE_SHARE_WRITE |
                 FILE_SHARE_DELETE, NULL, 0);
   if (h != INVALID_HANDLE_VALUE) {
    BYTE b;
    DWORD cb;
    if (ReadFile(h, &b, 1, &cb, NULL)) {
     _tprintf(_T("First byte of file is 0x%02x\n"), b);
    }
    CloseHandle(h);
   }
  }
  CloseHandle(hRoot);
 }
 return 0;
}

To open a file by its GUID, you first need to open something—anything—on the volume the file resides on. Doesn’t matter what you open; the only reason for having this handle is so that OpenFileById knows which volume you’re talking about. In our little test program, we use the C: drive, which means that the file search will take place on the C: drive.

Next, you fill in the FILE_ID_DESCRIPTOR, saying that you want to open the file by its object identifier, and then it’s off to the races with OpenFileById. Just as a proof of concept, we read and print the first byte of the file that was opened as a result.

Notice that the file you open by its object identifier does not have to be in the current directory. It can be anywhere on the C: drive. As long as you have the GUID for a file, you can open it no matter where it is on the drive.

You can run these two programs just to enjoy the thrill of opening a file by its GUID. Notice that once you get the GUID for a file, you can move it anywhere on the drive, and OpenFileById will still open it.

(And if you want to get rid of those pesky drive letters, you can use the volume GUID instead. Now every file is identified by a pair of GUIDs: the volume GUID and the object identifier.)

So Igor’s dream world where all files are referenced by GUID already exists. Why isn’t everybody switching over to this utopia of GUID-based file identification?

You probably know the answer already: Because people prefer to name things with something mnemonic rather than a GUID. Imagine a file open dialog in this dream world. “Enter the GUID of the file you wish to open, or click Browse to see the GUIDs of all the files on this volume so you can pick from a list.” How long would this dialog survive?

For today, you don’t have to call me Raymond. You can call me {7ecf65a0-4b78-5f9b-e77c-8770091c0100}, or “91c” for short.

(And I’ve totally ignored the fact that using GUIDs to identify files does nothing to solve the problem of trying to figure out what program should be used to open a particular file.)

Bonus chatter: You can also open files by their file identifer, which is a volume-specific 64-bit value. But I chose to use the GUID both for the extra challenge, and just to show that Igor’s dream world already exists.

Comments (37)
  1. Falcon says:

    "(And I've totally ignored the fact that using GUIDs to identify files does nothing to solve the problem of trying to figure out what program should be used to open a particular file.)"

    Come on, Raymond – any intelligent human being knows that this could be solved by using program GUIDs!

  2. Tergiver says:

    That's cool. Is there a FindFirstFile/FindNextFile way to search for files by object identifier?

  3. Karellen says:

    Or, you could just choose to give all your files filenames that happen to be GUIDs, and store them all in the same directory. :-)

    "using GUIDs to identify files does nothing to solve the problem of trying to figure out what program should be used to open a particular file."

    If I understand Igor's original comment correctly(!), the idea is to store the mime-type or associated program for a file in an alternate datastream, which allows you to strip the file extension from the filename. The GUID-based file naming scheme is merely a "fix"[0] for the bug that without filename extensions you can no longer have two files of different types with the same (base)name in the same directory.

    [0] Deciding whether the proposed fix is worse than the original problem[1] is, naturally, an exercise for the reader ;-)

    [1] whatever that was… but it doesn't seem entirely relevant any more as my brain appears to have crawled out of my ear in an attempt to escape the insanity.

  4. Wladimir Palant says:

    Because people prefer to name things with something mnemonic

    That's one reason of course. But isn't there also the issue that a GUID might get lost? There are those pesky programs who save data by writing to a new file and replacing the old one by it instead of simply writing the data into the original file…

  5. Sunil Joshi says:

    For today, you don't have to call me Raymond

    I thought it was Mr Chen…

  6. laonianren says:

    "And I've totally ignored the fact that using GUIDs to identify files does nothing to solve the problem of trying to figure out what program should be used to open a particular file."

    Just use WriteClassStg to put the class GUID in the file. Of course, this doesn't work terribly well with files that aren't based on structured storage.

    [So opening an Explorer folder requires that every file be opened in order to determine its type. (And what if the file allows only Administrators to read it? Explorer won't know what program to elevate if you try to open it. And IT administrators will love the network traffic and tape recall.) -Raymond]
  7. Dan Bugglin says:

    @Tergiver No, because unique identifiers are by their very nature, unique.  You aren't supposed to have two files with the same identifier.

    Also I see an alternate (or probably not-so-alternate if you read enough TheDailyWTF) universe where someone actually did this but then it utterly failed when their client revealed they had systems that, for some asine reason or another, required them to use FAT32.

  8. Tergiver says:

    Ah.. After checking the MSDN docs I see that the "object identifier" is an identifier unique to each file (that has one). So this will not help as a substitute for file extensions which would have to allow multiple files to have the same identifier.

  9. Adam Rosenfield says:

    Is OpenFileById an O(1) operation?  Or does it require some type of searching on the kernel's part which takes longer on a fuller disk?  Or more to the point, is there a performance reason for preferring good ol' CreateFile to OpenFileById?

  10. Vilx- says:

    One way of dealing away with the file-type-by-extension "problem" (it works rather well, actually) that I've never seen mentioned before is – "let's put a mime-type in the file's metadata". That would be the same metadata which contains file name, size, times, etc. So no need to open the actual file to see it. However this would (naturally) not work on FAT32, and would not survive an FTP trip. Would work just fine in emails/web though. :P

  11. pete.d says:

    Interesting! I agree that this is not a globally universal feature. But it's a very useful one for programs that have documents that relate to other documents, and want to be able to easily recover them if the user moves the other documents around.

  12. Gabe says:

    What I don't understand is how Igor would expect users to deal with two different files that have the same name. When a user is presented with a dialog box that has two files named "Foo" in it, how is the user to know which one is which? This is often a problem on default Windows configurations where foo.exe and foo.dll both get presented as "foo", but is easily solved by telling Explorer not to hide extensions.

    The easiest solution is to simply not allow multiple files to have the same name, thus preventing the problem from ever happening in the first place.

  13. Ian says:

    Is this a follow-on from last week's tip about serializing shortcuts? Both techniques could be used as a way to track a file that a user has renamed or moved to a different location.

    As far as I can see, the shortcut technique has the advantage that it works on FAT32 volumes as well as NTFS, and doesn't even require the moved/renamed file to be on the original volume. But are there disadvantages of relying on shortcuts? Presumably OpenFileById() will find the file much more quickly than resolving a wayward shortcut if speed is important and the file is still on the original NTFS volume?

  14. Joshua says:

    @Adam: There are no O(1) operations on disk. It's probably O(log n) with a really small constant.

  15. Mike Dimmick says:

    @Adam Rosenfeld, Joshua: NTFS supports generic indexing – MS can fairly easily create a B+-Tree index of many different properties. Object IDs for a volume are indexed in the metadata file named $O in $Extend$ObjId. Performance should be the same as any directory lookup (for a directory containing the same number of files as on the volume).

  16. chrismcb says:

    I don't understand how sticking a MIME type anywhere will solve the "extension hell" The extension is a piece of metadata to identify the file. The MIME type is a piece of metadata that identifies the file. Any problem you have with the extension you will ALSO have with the MIME type.  

  17. waleri says:

    If only that API existed in Windows 2000…

  18. configurator says:

    A few questions:

    When calling FSCTL_CREATE_OR_GET_OBJECT_ID with no write permission on a file that doesn't have an object id, is the object id created anyway?

    And can two volumes naturally have files with the same object id? This is assuming FSCTL_SET_OBJECT_ID wasn't used and no disk-cloning tools or some such was used either. (Of course you couldn't rely on it in actual code, I'm just wondering).

    When moving a file between two volumes, would the new file get the old file's object id? If not, I'd assume moving a file from C: to D: and then back would reset its object id, which stands to reason.

    And one warning: if you're using this, don't assume that C: and C:SomeFolder are the same volume. I've seen this mistake made before. Remember, drives can be mounted in any directory, and junction points are not a myth.

    [I don't know either, but unlike you, I decided to try to find out. Here's the definition of FSCTL_CREATE_OR_GET_OBJECT:
    #define FSCTL_CREATE_OR_GET_OBJECT_ID CTL_CODE(FILE_DEVICE_FILE_SYSTEM, 48, METHOD_BUFFERED, FILE_ANY_ACCESS) // FILE_OBJECTID_BUFFER
    You have now used up your lazy question quota. -Raymond
    ]
  19. Nick says:

    "You can call me {7ecf65a0-4b78-5f9b-e77c-8770091c0100}, or "91c" for short."

    I am not a GUID, I am a free man!

  20. James Schend says:

    @ChrisMcB: I quite liked the Mac Classic system where you had a "Type" and "Creator" meta-data. Since it was baked-in, all of the network file operations would cache/request/deliver it the way Windows does with the filename now. The Type told the system what type of file it was (for example, "TEXT") and thee Creator told you which application opened the file by default (for example, "WORD").

    The beauty of this system is that you could have 400 text files on your system, some of which were opened by Netscape, others by MS Word, others by SimpleText… and you could switch the opening app at will. I wish I had that in Windows– there are some image files I always want to open with Paint.NET instead of Preview, but no way to tell Windows that.

    Of course, the weakness, as with all Mac Classic awesomeness, is when you need to interact with other systems over a network… it all falls apart when your Windows fileserver has no way of storing the Type and Creator meta-data, and even Mac Classic had to add a opener based on file extension long before they switched to the Unix-based OS X.

  21. Lasse V. Karlsen says:

    Question: If I store many small documents on a disk, divided up into many directories (to avoid one big directory with 100s of thousands of files), is there a performance benefit in opening the file through its "GUID" compared to opening it via its path? If I were to store the ID-s in a database, would there be a noticable speed difference between the two methods?

  22. Alex Grigoriev says:

    @James Schend:

    Alternate streams were designed into NTFS for a reason…and the original reason was (AFAIK): support Mac OS metadata streams for network fileserver.

  23. Nathaniel Mishkin says:

    That's an interesting bit of NTFS arcana that I wasn't aware of.

    FYI, the same principle–using unique IDs to identify files–was implemented in the early 1980s on the OS that ran on the engineering workstations built by Apollo Computer.  Apollo wasn't the originator of the idea of using fixed-length unique IDs that could be readily generated in non-centralized fashion for the purpose of aiding in the development and operation of distributed systems.  (I think Barbara Liskov gets credit for that.)  But Apollo's 64-bit UIDs were at the head of a chain of design that led to UUIDs, via the Open Software Foundation's (OSF) Distributed Computing Environment's (DCE), parts of which (including UUIDs and DCE's RPC protocol and API) were adopted by Microsoft.

    Anyway, the Apollo file system not only used UIDs as a stable file identifier (in addition to human-sensible names, of course).  It also used UIDs to identify the "type" of files.  This "type UID" was used to select the code that would interpret the file's raw content.  This mechanism was extensible in that (a) anyone could, of course, generate their own unique type UID, and (b) extend the streaming I/O system with a "type manager" that implemented the streaming I/O operations for that type of file.

    One interesting lesson from all this though was how hard it was to evolve the well-known and long-standing model of file I/O with concepts like "file type".  Programs were all too happy to think they could duplicate a file simply by opening it, reading out its bytes, and writing them to a new file without bothering with little details like remembering to preserve the source file's type.

  24. Cheong says:

    I found Active Directory user objects have both objectSid and objectGUID properties of different value, perheps we should have called you yet another name… :P

  25. Worf says:

    Wladimir Palant:  That's one reason of course. But isn't there also the issue that a GUID might get lost? There are those pesky programs who save data by writing to a new file and replacing the old one by it instead of simply writing the data into the original file…

    Apps do that because an overwrite move is an atomic operation if the files are on the same filesystem. You see, while the program is writing out the new file, the power can go out, the CPU may decide to fry, or well, some piece of hardware may decide to tickle the bus the wrong way and cause a BSOD. If it happened while writing the new file, that file is corrupt, but the user still has the old version. If it happened during the move operation, then it depends how far the OS got – either it failed to replay and the user gets the old file, or it succeeded and the user gets the new file.

    So it's more to ensure at no point will the user lose all their data. If the app simply overwrote or appended their file, a corruption may cause garbage to occur and freak out the file parser, so the user loses all their data. Or even worse, the file is partially corrupt but opens fine, and the user fails to realize there's hidden corruption.

  26. Cheong says:

    @Worf: That's why a transactional file system is essential for system reliability.

    In your scenario, when the system reboots, the FS driver will see the file write isn't complete, and discard the node change transaction. The sequence of nodes in the original file is not updated, so the users will just see the old file untouched.

    Btw, in Wladimir's case, it'd be great to know if "tunneling"[blogs.msdn.com/…/439261.aspx] will work with file object identifier as well.

    [From what I can tell (remember, not authoritative) it does. -Raymond]
  27. Jules says:

    @Joshua: with the appropriate patch ( http://www.kernel.org/…/open-by-inode-rml-2.6.18-rc1-2.patch ) Linux's ext3 supports an O(1) open operation based on a file's inode number.  Inodes are stored in tables with static offsets at known locations on the disk.  There is a direct 1->1 inode number to block number mapping, which you can determine only by reading the filesystem's superblock (an O(1) operation).  Opening a file by inode therefore requires exactly 2 disk reads: superblock + inode block.  I don't know the structure of NTFS, but it's plausible that depending on how file IDs are allocated a similar approach may be taken.

  28. Ivan K says:

    "If you want to use GUIDs to identify your files, then nobody's stopping you"

    Woohoo!

  29. Gechurch says:

    @ChrisMcB

    There is one difference between the file extension and true metadata – the user can't easily change the other metadata, but wiping out or changing the file extension is easy.

    @James Schend

    "The beauty of this system is that you could have 400 text files on your system, some of which were opened by Netscape, others by MS Word, others by SimpleText…"

    You have *such* a different defintion of beauty than I do! That's one of my primary hates of the old Mac OS. I can see how it would be handy occasionally, but if I have a preferred text editor, I want to use that to edit my text documents. I definitely don't want the file opening in whatever program the person who created the file preferred. It's also a big part of why I found Mac OS to be such a "messy" OS. I like my OS to be deterministic.

  30. James Schend says:

    @Alex Grigoriev: That's great, but it doesn't/didn't help with the hordes of Linux machines that came along with the Internet. All moot now anyway, Apple botched the next-gen Mac OS development and now it's just yet another Linux.

    @Gechurch: Maybe you didn't know you could change the file Creator at any time, and change which application opens it. Besides, it was completely deterministic: the file Creator didn't randomly change on its own, and the file's icon clearly communicated what program would be launched when you double-clicked it.

  31. Worf says:

    @Jules: one read operation, not two. The superblock read is free as the filesystem reads and caches it during the mount, so open by inode requires just one disk operation to look up the details.

    (The superblock is rarely touched – at best a bit is flipped to indicate the filesystem is dirty and not unmounted cleanly. But it's also important enough that an in-memory representation is used as it holds all the vital parameters for the filesystem.)

  32. Ashleigh says:

    Ahhhhh…. Takes me back to the days of VAX/VMS and opening files by FID (File ID).

    This was a wonderful way in a program of opening a file that was read/write, but buried inside a directory that was not accessible by other users. Viola – system for updating a file but nobody else could see what the file was, or navigate to it, and dumping the exe did not even yield a path full of text. Of course you could always debug and single step it.

    And then (as mentioned above) the Apollo Domain/OS system had a similar idea as well. I even wrote one of the those file system type managers – it turned file system access and grabbed it to re-route to a device driver for a home made I/O card. All in user space with no need for kernel drivers or mods.

    Some of these neat ideas have a very long history behind them.

  33. GSerg says:

    > it'd be great to know if "tunneling" will work

    [From what I can tell (remember, not authoritative) it does. -Raymond]

    I was going to ask the same question!

    On a side note, what wonders me about tunneling is why Windows 7 suddenly stopped respecting it at the visual level.

    If you create an Excel file and place it in the middle of your desktop on WinXP, then make changes to this file and save, the file will remain in the middle of the desktop. WinXP knows it's the same file, despite Excel deleted the old file and created a new one.

    If you do the same on Win7, the file will jump to the first free cell of the icons grid on the desktop as soon as you save it. Win7 fails to recognize tunneling (at least at the visual level), so the "newly created" file gets into the first free slot.

    Same for file lists in Explorer. Daily I have to deal with folders filled with like forty Word documents each. I open them one by one, making changes and saving. The selected file in Explorer has been my bookmark all the time. I always knew the selection will not get lost when Word deletes the old version and creates a new file to save. With Windows 7, the bookmark won't work for me anymore. Whenever Word saves the currently selected file by deleting and creating, the selection resets. I now have to figure out where I was on the list.

    Is that a hint for us the tunneling gotta go soon?

  34. Alex Grigoriev says:

    @GSerg:

    Could it be that tunneling is tied to short name generation, and you disabled short names in your setup?

  35. GSerg says:

    @Alex Grigoriev:

    No, I didn't disable anything. I'm having this issue on both Windows 7 Professional 32-bit at my workplace and Windows 7 Home Basic 64-bit at home.

    NtfsDisable8dot3NameCreation on both PCs has the value of 2, and I'm fairly sure nobody has ever touched it. I know that allowed values are 0 and 1 (technet.microsoft.com/…/cc959352.aspx).

  36. GSerg says:

    Oh.

    There's a newer version of this article (technet.microsoft.com/…/cc778996(WS.10).aspx).

    fsutil 8dot3name query c:

    -> Yes, I've got 8dot3 names on for volume c:.

  37. Igor Levicki says:

    Having a unique GUID for each file is just one part of the equation. That would allow you to have several files with the same filename from the filesystem's point of view because filename would turn into a mere human readable _description_ of the file contents, instead of being the unique key by which the file is accessed.

    To differentiate between file types, you would need to use GUIDs instead of file extensions. You could specify the type/GUID when you CreateFile(), and if the file is called "blah" the system would still know the correct program to use to open it even if you had two "blah" files in the same folder (For example one AVI and one JPEG).

    [At this point, you're just changing the definition of "name" from "the thing used to identify the file" to "the thing shown to the user." (And what problem is using GUIDs supposed to solve? I don't recall file extension conflicts being a serious problem.) -Raymond]

Comments are closed.