I move a lot of large, row-formatted text files into Azure Storage for work I do with HDInsight and other technologies. I land these files as block blobs which means that individual files must stick below the 200 GB block blob size limit. Azure Data Lake Store does not have this limitation and most of the technologies I work with support compression, but let's assume these are not options for the specific work I want to perform.
The way I typically work around this limitation is to split my files into smaller files that are below the max allowed size. There are several freely available file splitters on the Internet for this but they typically split the files based on a number of bytes which means that the split will likely occur in the middle of a row. As these splitters do not recognize rows, they also do not recognize that my file may have a header which I would like replicated in the output files. For these reasons, I wrote a custom script in PowerShell to split my files.
The script requires you to target an input file, define a max number of bytes per output file, and identify whether the file has a header row. It then reads the input file line by line and writes these to a series of output files, preserving the original order of the rows, and making sure each file stays below the specified max size. If a header row is present, it is written to each output file. The encoding of the original file is detected and preserved in the output files.
There are two things I'd like to add to the script but will simply need to return to at a latter date. First, I'd like the script to detect the row-delimiter used by the original text file and then use this in the output files. As it currently stands, you must tell the script what row-delimiter to use in your output files. Second, I'd like the script to assess the size of the original file and attempt to create output files that are more consistently sized. This one is tricky if your rows are highly variable in size so I decided to keep the script simple for now.
One other thing about the script .... Because I am using this with very large files, I made use of the .NET Framework to read and write lines of data. If you will do a quick search, you'll find a ton of posts showing the performance impact of this. I found in some simple testing that the performance benefits of using the .NET Framework instead of the built-in PowerShell commandlets, i.e. Get-Content, Set-Content & Add-Content, were tremendous. .NET adds complexity but it was totally worth it in this scenario.