Building a large text file editor – Part I

The purpose of this series of two blog posts is to illustrate how an editor for large files can be implemented. The first part will address the model behind the editor whereas the second part will include an actual UserControl that allows users to view and edit large files.

The basic idea behind a large file viewer is that you can’t load a very large file into a TextBox or RichTextBox control because the memory usage as well as the time to load the file will be enormous. Moreover, the user can only see a very small portion of the file on the screen at once, so it doesn’t make sense to load all the data in the file at once.

As such, what is needed from a large file viewer is to only display some small parts of the file in order to fill at least a screen. This is pretty straightforward and will be addressed in the second post. What is a bit more complex is to enable the user to edit the file (i.e. make insertions and deletions). The problem with editing the file is that edits need to be taken into account when computing the position in the stream. If the user deletes say the first 10 characters and then wants to read two characters at offset 4, what actually needs to be read is the two characters at offset 14 from the original file:

  • Original string: "0123456789abcdefgh"
  • String after deleting 10 characters: "abcdefgh"
  • Character at offset 4 (zero-based): "e"
  • Offset in initial string: 14.

Obviously, things get more complicated when the user starts to randomly delete and insert text at various locations in the file.

The solution I’m proposing is the class RevisionStream (see attached code). This class is wrapper around the FileStream class and supports four public methods:

  • AddDeletion – the caller signals that a deletion has been made by the user
  • AddInsertion – the caller signals that the user has inserted some text
  • Read – reads some bytes from the revised stream
  • SaveTo – saves revised stream to file

The basic idea behind this approach is that normally the largest part of the text will remain identical to the original stream and only some minor changes will be done by the user. So it makes sense to leave the original text on disk and only load to memory the parts that we need together with the changes the user makes.

Under the hood, the RevisionStream uses a list of Segments. A segment is either a reference to a contiguous section of the original text or a newly added text. The really difficult part is to adjust the segments from the RevisionStream when an insertion or a deletion is made.

Getting back to the previous example, the RevisionStream initially contains all the text from the original file, which is "0123456789abcdefgh". The list of segments will contain only one segment, [0, 17] (where 17 = the offset of the last character).

Scenario #1: The user deletes the first 10 characters

Resulting list of segments: [10, 17].
Resulting string: "abcdefgh"

Scenario #2: The user deletes the character ‘b’, which is located at offset 11.

Resulting list of segments: [0, 10], [12, 17].
Resulting string: "0123456789acdefgh".

Scenario #3: The user deletes the character ‘b’, which is located at offset 11, and writes "ZZZ" at the same offset.

Resulting list of segments: [0, 10],["ZZZ"],[12, 17].
Resulting string: "0123456789aZZZcdefgh".

The real challenge here is to deal with all the possible scenarios of deleting and inserting text (delete from more consecutive segments, delete at end of segment, etc.). Hopefully I’ve covered all the possible scenarios in the implementations of the private methods AdjustForInsertion and AdjustForDeletion but if you find any case that hasn’t been covered, please post a comment here.

There’s still some work to do here: using only long indices to ensure really large files can be read, making the Insertions encoding-independent, etc. However, up to this point, we can go ahead and build a pretty decent Large File Editor based on this Revision Stream.

RevisionStream.cs