The last two Azure Data Lake refreshs in March and April brought a lot of interesting new capabilities in U-SQL.
In this post I would like to call out some of the highlights.
Based on popular customer demand and frequent questions, we added
UNPIVOT expressions. Like in SQL Server, they need a static list of values to pivot on. But unlike SQL Server,
UNPIVOT also provides you with the ability to include rows that have
null values! If you need a more dynamic pivot, we still recommend that you use a
SqlMap typed column for the “unknown” columns.
The March refresh also added the ability to access U-SQL Catalog objects such as tables and assemblies across different Azure Data Lake Analytics accounts. The only requirement is that they share the same Azure Active Directory context. This for example allows you to set up a global ADLA and ADLS account that contains files and U-SQL databases in one subscription and then give different teams read and execute access to the objects from their own ADLA accounts in their own subscription. These ADLA accounts in turn have their own default ADLS accounts so team specific objects are written (and billed) into the team specific store.
U-SQL packages are a way to package up more complex assembly references, variable declarations and resource deployments so a script writer can import a single package and get all the references and variable declarations in a single statement. It also allows developers to provide an abstraction to the script writers behind which they can refactor their assembly structures without impact on the scripts.
The easier C# integration with CROSS APPLY should make it easier to use the CROSS APPLY with simple set-based C# expressions without the need to have to use a custom applier or wrap everything into a
SqlArray data type.
One problem we noticed with U-SQL tables was that a lot of people just used the
HASH distribution scheme without specifying an explicit number of hash buckets. This often lead to over- or under-partitioning of the table which negatively impacted the performance experience when loading or querying such tables. Therefore, we introduced in April a new algorithm to determine the number of hash buckets on the first insertion into a table (or table partition). The goal is to get a more optimal distribution size with better out-of-the-box performance experience.
Speaking of performance experience: We now also explain how to enable the preview of faster, more scalable file sets! If you use file sets in your U-SQL queries, there is no reason not to try it! If you do please report back here in the comments or through other channels (twitter, email etc) how it works for you!