One of the major issues that people who want to get started with "Data Science" or "Big Data Analytics" face is finding datasets that are both compelling and diverse enough to provide a useful playground for exploration.
In my previous life, I blogged about this subject and wrote instructions on how to create a large SQL Server database from data available from the Home Mortgage Disclosure Act (HMDA). You can read that entry here although I'm not 100% positive that the steps I've laid out in that blog are still valid in terms of data locations and availability.
The purpose of this series of posts is to provide instructions on how to create a SQL database in Microsoft Azure that can be used as a data source for advanced analytics. I'll be posting future entries on this blog that refer back to this database. Also, to be clear, while I'm focusing on creating the database in Microsoft Azure, there is no reason why you couldn't use the same techniques to create the database on a local instance of SQL Server.
For those that want to use Azure but don't have a subscription, you can sign up for a 30-day trial here. If all you want to do is host a simple database in Azure, you can do so for around US $5 per month, up to 2GB in size. For a larger database, you'll want to look at the standard tier, which starts at US $15 per month for 250GB. If you follow the instructions that I lay out in this post, you'll end up with a database of about 6GB in size.
For this first article in the series, I will discuss setting up the database environment in Azure and downloading the initial data. I will also discuss the initial loading of the data into a temporary table in the database. Additional articles in this series will focus on turning the temporary table into a useful data structure as well as analyzing the data.
The Use Case
For this series, the use case I'll be working with is crime, based on data from the city of Chicago. The reason I chose this particular dataset is that it's a subject that most people can relate to in one way or another, and it does lend itself to some advanced analytics capabilities very well.
The City of Chicago maintains a data portal where they publish crime detail on a regular basis. This is significant because the level of detail published is very granular – down to the individual police report. When combined with other data, such as NOAA Weather data, there are a number of interesting analytics possibilities that can be realized. The solution that will be developed throughout this series will be surfaced using Microsoft PowerBI and will result in a dashboard that looks like this:
Configuring The Database Environment
The first step in building the database is to ensure that you're properly setup in the Microsoft Azure portal. Follow the instructions here to sign up for an account (a trial account will work just fine for this, but remember that it's only good for 30 days). Also remember that the instructions I'm providing here will also work with a local SQL Server instance, you'll just have to modify how and where you load your data appropriately.
Once you are logged in to the Azure environment, you should have a portal that looks something like this (likely with fewer pre-configured resources however):
To create a new database, choose SQL Databases from the left side navigation, and then select New at the bottom of the page, then select Data Services / SQL Database / Quick Create and then fill out the form, choosing a New SQL database server along with a region close to you.
Once the form is filled out appropriately, choose Create SQL Database which will then submit everything to Azure and begin the process of provisioning your new database. You will see a status message appear, and it will take a few minutes to complete the process.
If you receive a message about enabling auditing on the new database, you can choose to ignore it, unless you want to experiment with auditing throughout this process.
Once the database is created, you can click the details link to view the status of the job:
Managing the Database
Now that the database has been created, you'll want some tools to manage it. The good news there is that the same tools that manage local SQL Server instances work just fine with Azure SQL Database. If you don't already have a local instance of SQL Server Management Studio to work with, you can download a free version here (Click the link to download Express Edition, and then select the appropriate 32 or 64-bit Management Studio option – Follow the instructions to install it on your local machine). Make sure that you install all of the management tools, as you will be using the Import/Export Wizard to populate the initial table in the database.
The first step to enable managing your new database is to enable your client IP Address in the Firewall rules. From the Azure Management Portal, choose SQL Databases on the left side, then select the Servers tab, then select the server that you just created.
Select the Configure tab, and then select Add to the Allowed IP Addresses. Choose Yes next to Windows Azure Services (you will need this option later) and then choose Save at the bottom of the screen. This will add your local IP address to the firewall rules to enable your local machine to connect to the Azure SQL Database server.
Once the firewall rules are saved, you'll use SQL Server Management Studio (SSMS) to manage the database and server. To connect to the database, start SQL Server Management Studio, and when prompted, login to the new Database Server that you created above, using SQL Server Authentication and the username and password that you provided when you initially provisioned the database. (notice that you will use the fully-qualified name of the database server, which is <servername_you_provided>.database.windows.net)
Once connected, you should see your server in the Object Explorer window. You can expand Databases to see the database that you provisioned (note: My personal preference is to automatically open a new query window when I start SSMS. The default is to not open a new query. If you want to configure this option, it is available under Tools/Startup )
Once you have successfully connected to the database, you are ready to proceed to importing the data into the initial staging area.
Downloading the Crime Data
Downloading the data from the City of Chicago is a very easy process. For this initial load, we will download the entire dataset and load it in a single pass. Since the data is updated on a regular basis in the portal, later entries in this series will explain how to keep your data in sync with the portal.
This will open the Crimes dataset in the online explorer (which is very nicely done, and allows for a wide-range of analysis directly from the portal).
In the upper-right corner of the portal, choose the Export option and then choose CSV for Excel. This will eventually open a Save File dialog (it can take a few minutes to generate the export file)
Choose Save, and the file will begin downloading. This will take several minutes to download, depending on your Internet connection speed.
Now that the file is downloaded, you can import the data into your Azure SQL Database.
Importing The Data
Note: The process that we will use to import the data is a very simplistic process. There are more efficient ways to accomplish this task, but I wanted to use the simple and easy approach to load the initial data.
To load the initial data, start the SQL Server 2014 Import Export Wizard. (It was installed along with the SSMS tools above and can be found in your Start menu. Make sure you choose the appropriate version of the tool – 64-bit or 32-bit depending on your operating system)
When the wizard starts, click Next and then choose Flat File Source. Click the Browse button and select the file that you downloaded in the previous step (in order to see the file in the window, you will need to select the CSV files option next to the File textbox) and then choose Open.
In the Text Qualified textbox, enter a double-quote (").
Ensure that the options are configured as in the image above. Choose the Columns option on the left side to ensure that the fields are lining up properly.
Once you are sure that the columns are properly lined up, select the Advanced option on the left side and then choose the Description Column. Change the data type to a text stream (DT_TEXT)
Then choose the Next button and then select the SQL Server Native Client destination. Enter your Server Name and login information, then select the ChicagoCrime database in the drop down.
Click the Next button and change the destination table name to [dbo].[Crimes_New]
Choose the Edit Mappings button, and select the Edit SQL button, and then add PRIMARY KEY CLUSTERED to the [ID] column as shown below. Azure SQL Database likes to see Clustered indexes on tables.
Choose the OK button and then choose the Next button and then choose Finish twice to start the import operation.
This operation will run for several minutes as there are over 5 million rows of data to import.
Once the operation is complete, you can switch back to SSMS and verify that the initial table has been created and populated by executing the following query in the ChicagoCrimes database:
Preparing for the Next Step
In this blog post, we setup a new Azure SQL Database and imported some crime data into a temporary table in the SQL Database. You will note that we did not manipulate the data in any way, so all of the data types are currently inherited from the text file. In the next article in this series, I'll walk through how to create the permanent table and how to populate it with the data, converting the data types to the appropriate values.