Azure Data Catalog is a services that enables discovery of data assets in your organization. Users can search for data sources entered into the catalog in various ways, so they can (re)use the data sources, instead of creating duplicate and potentially inconsistent sources for the same data. As explained in the Azure Data Catalog developer concepts, you can only have one Data Catalog per Azure account. If an Azure account contains multiple subscriptions, the same Data Catalog applies to all subscriptions. This means that if you use subscriptions to separate environments (e.g. dev/test/production or finance/marketing/manufacturing), then these environments still share the same Data Catalog. From the perspective of a Data Catalog that holds information on all data sources in your organization, this makes sense, but it also raises a question: how do you separate data sources from different environments?
Let’s look at an example situation where we have three subscriptions under the same account: dev, test, and production. As services are being developed, the same data source may exist in dev, test, and production with slight variations because of version differences. You need to ensure that a developer doesn’t use wrong (version of the) data source, and inadvertently writes development data into a production system. It is important to realize that in itself this is not a discovery problem, it’s a rights management problem. If the rights to a production system allow a developer to manipulate production data, this issue exists regardless of whether you have a discovery mechanism in place or not. Separating information in the Data Catalog will not solve this issue. However, making clear in the Data Catalog whether a particular data source is for dev, test, or production can help a little.
How NOT to use Data Catalog
I’ve added two assets to the Data Catalog, both named Contacts. When searching for these, you get a result like this:
If you look at the details of both assets, you can see they are almost similar, except for the location:
For someone that is not knowledgeable about your data sources and locations, the above doesn’t help. In this case you can still infer whether it’s a development asset, but with more usual data source names, this is less likely.
Data Catalog Best Practices
To avoid situations like the one shown above, you should follow the guidelines below.
- Use a consistent naming scheme, for example contacts.dev and contacts.prod. The friendly name can be the same or something like Contacts (Development) and Contacts (Production).
- Use tags to identify the environment the asset belongs to.
- If naming convention and tags are not sufficient, limit visibility to different user groups, for example only make development assets visible to people in the developers group. You can do this as follows:
- In Azure Active Directory create the needed groups and add users.
- Select an asset and Take Ownership*.
- Set Visibility to Owners & These Users.
- Click Add… to add the group(s) you want to grant visibility.
* The guidance in How to Manage Data Assets is to have at least two individuals as owner of an asset.