An Introduction to DataMarket with PHP

Hi! I’m Jovana, and I’m currently interning on the DataMarket team. I come from sunny Western Australia, where I’ve almost finished a degree in Computer Science and Mechatronics Engineering. When I came here I noticed that there wasn’t too much available in the way of tutorials for users who wanted to use DataMarket data in a project, but weren’t C# programmers. I’d written a total of one function in C# before coming here, so I’d definitely classify myself in that category. The languages I’m most familiar with are PHP, Python and Java, so over the next few weeks I’ll do a series of posts giving a basic introduction to consuming data from DataMarket using these languages. I’ll refer to the 2006 – 2008 Crime in the United States (Data.gov) dataset for these posts, which is free to subscribe to, and allows unlimited transactions.

In this post I’ll outline two methods for using PHP to query DataMarket; using the PHP OData SDK, and using cURL to read and then parse the xml data feed. For either method, you’ll firstly need to subscribe to a dataset, and make a note your DataMarket account key. Your account key can be found by clicking “My Data” or “My Account” near the top of the DataMarket webpage, then choosing “Account Keys” in the sidebar.

 

The PHP OData SDK

DataMarket uses the OData protocol to query data, a relatively new format released under the Microsoft Open Specification Promise. One of the ways to query DataMarket with PHP is to use the PHP OData SDK, developed by Persistent Systems Ltd. This is freely available from CodePlex, however unfortunately there seems to be little developer activity on the project since its release in March 2010, and users report that they need to do some source code modifications to get it to work on Unix systems. Setting up the SDK also involves making some basic changes to the PHP configuration file, potentially a problem on some hosted web servers.

A word of warning: not all DataMarket datasets can be queried with the PHP OData SDK! DataMarket datasets can have one of two query types, fixed or flexible. To check which type a particular set is, click on the “Details” tab in the dataset description page. The SDK only supports datasets with flexible queries. Another way to check is to take a look at the feed’s metadata. Copy the service URL, also found under the “Details” tab into your browser’s address bar and add $metadata after the trailing slash. Some browsers have trouble rendering the metadata; if you get an error, save the page and open it up in notepad. Look for the tab containing <schema xmlns=”…”> (There will probably be other attributes, such as namespace, in this tab). The PHP OData SDK will only work with metadata documents specifying their schema xmlns ending in one of “/2007/05/edm”, “/2006/04/edm” or “/2008/09/edm”.

Generating a Proxy Class

The PHP OData SDK comes with a PHP utility to generate a proxy class for a given OData feed. The file it generates is essentially a PHP model of the feed. The command to generate the file is

php PHPDataSvcUtil.php /uri=[Dataset’s service URL]

/out=[Name out output file]

Once generated, check that the output file was created successfully. The file should contain at least one full class definition. Below is a snippet of the class generated for the Data.gov Crime dataset. The full class is around 340 lines long.

 /**
* Function returns DataServiceQuery reference for
* the entityset CityCrime
* @return DataServiceQuery
*/
public function CityCrime()
{
$this->_CityCrime->ClearAllOptions();
return $this->_CityCrime;
}
Using the Proxy class

With the hardest part complete, you are now ready to start consuming data! Insert a reference to the proxy class at the top of your PHP document.

require_once "datagovCrimesContainer.php";

Now you are ready to load the proxy. You’ll also need to pass in your account key for authentication.

$key = [Your Account Key];

$context = new datagovCrimesContainer();

$context->Credential = new WindowsCredential("key", $key);

The next step is to construct and run the query. There are a number of query functions available; these are documented with examples in the user guide. Keep in mind that queries can’t always be filtered by any of the parameters– for this particular dataset we can specify ROWID, State, City and Year. The valid input parameters can be found under the dataset’s “Details” tab. Note that some datasets have mandatory input parameters.

 try
{
$query = $context->CityCrime()
->Filter("State eq 'Washington' and Year eq 2007");
$result = $query->Execute();
}
catch (DataServiceRequestException $e)
{
echo "Error: " . $e->Response->getError();
}
$crimes = $result->Result;

(If you get a warning message from cURL that isn’t relevant to the current environment, try adding @ in front of $query to suppress warnings.)

In this example we’ll construct a table to display some of the result data.

 echo “<table>”;
foreach ($crimes as $row)
{
echo "<tr><td>" . htmlspecialchars($row->City) . "</td>";
echo "<td>" . htmlspecialchars($row->Population) . "</td>";
echo "<td>" . htmlspecialchars($row->Arson) . "</td></tr>";
}
echo "</table>";

DataMarket will return up to 100 results for each query, so if you expect more than 100 results you’ll need to execute several queries. We simply need to wrap the execute command in some logic to determine whether all results have been returned yet.

 $nextCityToken = null;
while(($nextCityToken = $result->GetContinuation()) != null)
{
$result = @$context->Execute($nextCityToken);
$crimes = array_merge($crimes, $result->Result);
}

The documentation provided with the SDK outlines a few other available query options, such as sorting. Some users have reported bugs arising if certain options are used together, so be sure to test that your results are what you expect.

Using cURL/libcurl

If the PHP OData SDK isn’t suitable for your purpose, another option is to assemble the URL to the data you are after, then send a request for it using cURL and parse the XML result. DataMarket’s built in query explorer can help you out here – add any required parameters to the fields on the left, then click on the blue arrow to show the URL that corresponds to the query. Remember that any ampersands or other special characters will need to be escaped.

The cURL request

We use cURL to request the XML feed that corresponds to the query URL from DataMarket. Although there are a number of options that can be set, the following are all that is required for requests to DataMarket.

 $ch = curl_init(); 
curl_setopt($ch, CURLOPT_URL, $queryUrl); 
curl_setopt($ch, CURLOPT_USERPWD, ":" . $key);  
curl_setopt($ch, CURLOPT_RETURNTRANSFER,  true); 

curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
$response = curl_exec($ch); 

curl_close($ch); 

The $response variable now contains the XML result for the query.

Parsing the response

Before using the data, you’ll need to parse the XML. Because each XML feed is different, each dataset needs a parser tailored especially to it. There are a number of methods of putting together a parser, the example below uses xml_parser.

The first step is to create a new class to model each row in the result data.

 class CityCrime
{ 
var $City;
var $Population;
var $Arson;
public function __construct()
{
}
} 

I’m also going to wrap the all the parser functions in a class of their own. This function will be called with the query uri and account key. Firstly I’ll give it some class variables to store the data that has been parsed.

 class CrimeParser
{
var $entries = array();
var $count = 0;
var $currentTag = "";
var $key = "";
var $uri = "";
public function __construct($key, $uri) 
{
$this->key = $key;
$this->uri = $uri;
}
}

The parser requires OpenTag and CloseTag functions to specify what should happen when it reaches an open tag or close tag in the XML. In this case, we append or remove the tag name from the $currentTag string.

 private function OpenTag($xmlParser, $data)
{
$this->currentTag .= "/$data";
}
private function CloseTag($xmlParser, $data)
{
$tagKey = strrpos($this->currentTag, '/');
$this->currentTag = substr($this->currentTag, 0, $tagKey);
}

Now we are ready to write a handler function. Firstly declare the tags of all the keys that you wish to store. One method of finding the tags is to run the code using a basic handler function that simply prints out all tags as they are encountered.

 private function DataHandler($xmlParser, $data)
{
switch($this->currentTag){ 
default:
print "$this->currentTag <br/>";
break;
}
}

The switch statement in the handler needs a case for each key. We also need to let it know when it reaches a new object – from running the code with the previous handler, I knew that the properties for each row started and finished with the tag /FEED/ENTRY/CONTENT, so I’ll add a class variable to keep track of when the handler comes across that tag – every second time it comes across it I know that the result row has been fully processed.

 var $contentOpen = false;
const rowKey = '/FEED/ENTRY/CONTENT';
const cityKey = '/FEED/ENTRY/CONTENT/M:PROPERTIES/D:CITY';
const populationKey = '/FEED/ENTRY/CONTENT/M:PROPERTIES/D:POPULATION';
const arsonKey = '/FEED/ENTRY/CONTENT/M:PROPERTIES/D:ARSON';
private function DataHandler($xmlParser, $data)
{
switch(strtoupper($this->currentTag)){
case strtoupper(self::rowKey):
if ($this->contentOpen)
{
$this->count++;
$this->contentOpen = false;
}
else
{
$this->entries[$this->count] = new CityCrime();
$this->contentOpen = true;
}
break;
case strtoupper(self::cityKey):
$this->entries[$this->count]->City = $data;
break;
case strtoupper(self::populationKey): 
$this->entries[$this->count]->Population = $data;
break;
case strtoupper(self::arsonKey): 
$this->entries[$this->count]->Arson = $data;
break;
default:
break;
}
}

Now we create the parser, and parse the result from the cURL query.

 $xmlParser = xml_parser_create(); 
xml_set_element_handler($xmlParser, "self::OpenTag","self::CloseTag"); 
xml_set_character_data_handler($xmlParser, "self::DataHandler"); 
if(!(xml_parse($xmlParser, $xml)))
{ 
die("Error on line " . xml_get_current_line_number($xmlParser)); 
} 
xml_parser_free($xmlParser);

After the call to xml_parse, the $entries will be populated. A table of the data can now be printed using the same foreach code as the SDK example, or manipulated in any way you see fit.

Final Thoughts

The two methods of consuming data from DataMarket with PHP both have their strengths and weaknesses. A proxy class generated from the OData SDK is very easy to add to existing code, but setting up the library can be tedious, and there is not much support available for it. Using cURL and parsing the xml provides slightly more flexibility, but requires much more coding to set up.

Since it only requires an URL and an Account key, opening the connection to DataMarket is very straightforward, whichever method is chosen. If the dataset you’re connecting to is free, I suggest opening Service Explorer and trying out various queries to get a feel for the data. Both methods shown above will result in the dataset’s conversion to an associative array, from which data can be manipulated using any of the PHP functions available.

At this stage, if you want to access a flexible query dataset, and are able to modify your PHP configuration file, the PHP OData SDK is a good tool for accessing OData feeds. However, if you want access to a fixed query dataset, or are unable to modify the configuration file, using cURL and parsing the result is straightforward enough to still be a valid option.

- Jovana Taylor

datamarketPHPSample.zip