Leveraging Azure Data Lake Partitioning to Recalculate Previously Processed Days

Article
05/03/2016

Many data flows will require partial reloading of U-SQL tables due to the need to recalculate a previously loaded result. Often times this is a full or partial recalculation of a previous day’s results. For this reason, partitioning your data by a time slice can make it more efficient to do this recalculation. Partitioning by time allows you to truncate a specific slice and rerun the load process for that time period in the event of a need to re-compute the data. This is fairly typical in large data warehousing scenarios also as deletes can be problematic from a performance standpoint over time.

The question is what is the right time period to use? The answer is it depends on the size of your partitions. Generally, for managed tables in U-SQL, you want to target about 1 GB per partition. So, if you are bringing in say 800 mb per day then daily partitions are about right. If instead you are bringing in 20 GB per day, you should look at hourly partitions of the data.

In this post, I’d like to take a look at two common scenarios that people run into. The first is full re-compute of partitions data and the second is a partial re-compute of a partition. The examples I will be using are based off of the U-SQL Ambulance Demo’s on Github and will be added to the solution for ease of your consumption.

We’ll be working off of a new table in the Ambulance Demo Database called vehiclesP. This table will be created with two partitions on the event_date column.
CREATE TABLE vehiclesP( vehicle_id int , entry_id long , event_date DateTime , latitude float , longitude float , speed int , direction string , trip_id int? , INDEX idx CLUSTERED (vehicle_id ASC) PARTITIONED BY (event_date) DISTRIBUTED BY HASH (vehicle_id) INTO 2 ); DECLARE @pdate1 DateTime = new DateTime(2014, 9, 14, 00,00,00,00,DateTimeKind.Utc); DECLARE @pdate2 DateTime = new DateTime(2014, 9, 15, 00,00,00,00,DateTimeKind.Utc); ALTER TABLE vehiclesP ADD PARTITION (@pdate1), PARTITION (@pdate2);
The scenario for this first example is that you have found that all data for 2014-09-15 needs to be reloaded due to some error in the previous data. We will accomplish this be extracting the new data from ADLS into a variable, truncating the partition that we want to update and then loading the data in the variable to that partition.
DECLARE @pdate2 DateTime = new DateTime(2014, 9, 15, 00,00,00,00,DateTimeKind.Utc); DECLARE @date2 DateTime = DateTime.Parse("2014-09-15"); DECLARE @ADL_DIR string = "[insert your path here]/Ambulance/"; DECLARE @ADL_FILESET string = @ADL_DIR + "vehicle{vid}_{date:MM}{date:dd}{date:yyyy}.{*}"; @data = EXTRACT vehicle_id int, entry_id long, event_date DateTime, latitude float, longitude float, speed int, direction string, trip_id int?, vid int // virtual file set column , date DateTime // virtual file set column FROM @ADL_FILESET USING Extractors.Csv(); TRUNCATE TABLE vehiclesP PARTITION (@pdate2); // Load the day statically. INSERT vehiclesP PARTITION (@pdate2) SELECT vehicle_id , entry_id // Do not specify: , event_date , latitude , longitude , speed , direction , trip_id FROM @data WHERE date == @date2;

As your data sizes and complexity of your ETL increase, it may be more efficient to reload only the data that needs updating instead of the entire partition. The next scenario is more common in that you may want to do a partial reload of a previous partition. In this example, we found that the data for Ambulance 3 on 2014-09-14 was incorrect and we need to reload that data only. We accomplish this by capturing the data from the existing table partition we want to keep into a variable, pull data the data we want to recalculate (Ambulance 3 on 2014-09-15) from ADLS into another variable, union them together and then truncate and reload the data into our target partition.
//assume you want to reload all the data for 9/15 associated with Vehicle 3. DECLARE @pdate2 DateTime = new DateTime(2014, 9, 15, 00,00,00,00,DateTimeKind.Utc); DECLARE @vehicleID int = 3; DECLARE @date2 DateTime = DateTime.Parse("2014-09-15"); DECLARE @date3 DateTime = DateTime.Parse("2014-09-16"); DECLARE @ADL_DIR string = "/[insert your path here]/Ambulance/"; DECLARE @ADL_FILESET string = @ADL_DIR + "vehicle{vid}_{date:MM}{date:dd}{date:yyyy}.{*}"; //get all data from partition except data we want to delete and reload @existingdata = SELECT vehicle_id, entry_id, event_date, latitude, longitude, speed, direction, trip_id FROM dbo.vehiclesP WHERE vehicle_id != @vehicleID AND event_date >= @date2 AND event_date = @date2 AND date < @date3 AND vehicle_id == @vehicleID; TRUNCATE TABLE vehiclesP PARTITION (@pdate2); //union together the data kept from the table and the data loaded from ADLS, insert @datatobeloaded = SELECT vehicle_id , entry_id // , event_date (don't specify) , latitude , longitude , speed , direction , trip_id FROM @existingdata UNION ALL SELECT vehicle_id , entry_id // , event_date (don't specify) , latitude , longitude , speed , direction , trip_id FROM @data; INSERT vehiclesP PARTITION ( @pdate2 ) SELECT vehicle_id, entry_id, latitude, longitude, speed, direction, trip_id FROM @datatobeloaded;

For more information around creating tables in Azure Data Lake, please take a look at the U-SQL reference document.

Leveraging Azure Data Lake Partitioning to Recalculate Previously Processed Days

Additional resources