Media Capture API: Helping Web developers directly import image, video, and sound data into Web apps

Article
12/09/2011

Last March, we released a prototype implementation of the audio portion of a working draft of the W3C Media Capture API on HTML5 Labs. This prototype publicized some proposed API enhancements described in section 6.1 of Microsoft’s HTML Speech XG Speech API Proposal. We have now updated the prototype to include the image and video capture features described in the proposal to support scenarios we’ve heard are important for Web developers, as well as incorporating your feedback on audio.

As more and more consumers use mobile devices to take still pictures, videos, and sound clips, Web developers increasingly need support to capture and upload image, video, and sound from their Web sites and applications. A usable and standardized API for media capture means Web sites and apps will be able to access these features in a common way across all browsers in the future.

During this past year, the effort to standardize media capture has intensified. The WebRTC working group was formed and combined scenarios that support basic video and audio capture with the ability to share that media in real-time communication scenarios. A broad interest in both of these scenarios from industry partners and browser vendors alike shifted focus away from the Media Capture API and brought the WebRTC draft spec to the forefront.

This past November, we took our experience with the development of this prototype and interest in media capture for the browser to the W3C's technical plenary meeting (TPAC). Travis Leithead shared some of our feedback with the Device APIs (DAP) Working Group and we continued existing discussions within the HTML Speech Incubator Group. One result of our engagement was the formation of a media capture joint task force in order to bring the best of local media capture and real-time communication scenarios together. We are actively participating in the task force and support the getUserMedia approach to capture.

With the release of this prototype, we give Web developers early access to photo, video and audio media capture APIs in the browser. We anticipate evolving the prototype to share implementation feedback and experience with the new media capture task force. The end goal remains to create the best possible standard for the benefit of the whole Web community.

Let’s also look back at our earlier proposals, explain why we believe the scenarios are still important, and why we implemented them in this new version of the prototype:

Privacy of device capabilities

The prototype allows enumeration of the capture device's capabilities (its supported modes). In the old W3C Device Capture API spec, privacy-sensitive information about the device could be leaked to an application because the navigator.device.capture.supported* properties could be accessed without user intervention. Our prototype moves these APIs to an object that is only available after the user gives permission. The W3C's current getUserMedia API does not support enumeration of device capabilities, but we believe it is valuable to Web developers and should be done in a similar privacy-sensitive manner.

Multiple Devices

The W3C's current getUserMedia API is designed to support multiple devices, either via hints to the API or through user preference. This is an improvement for a scenario that was not supported in the old W3C Device Capture API spec.

Our current prototype also supports a conceptually similar design: navigator.device.openCapture() which returns asynchronously with the capture device the user prefers (through preference or UI).

Direct Capture

In the old Media Capture API spec, the Capture.capture[Image|Video|Audio] operations launch an asynchronous UI that returns one or more captures. This means the user has to do something in the Web app to launch the UI and then initiate the capture, which makes it impossible to build capture UI directly into the Web application. Not only would this be unusable for a speech recognition application, but it is also places unnecessary user interface constraints on other media capture scenarios.

Our prototype and the current getUserMedia capture API directly capture from the device and return a Blob. Note that for privacy reasons some user agents will choose to display a notification in their surrounding chrome or hardware to make it readily apparent to the user that capture is occurring, together with the option to cancel the capture.

Streaming

For applications like speech recognition, captured audio needs to be sent directly to the recognition service. However, the current getUserMedia API design only supports capturing to Blobs, which delay the ability to process the recorded data.

Our prototypeallows starting a capture asynchronously and returning a Stream object containing the captured data. Support for Streams would also be useful in video recording scenarios. For example, using a capture stream, an app could stream a recording to a video sharing site, as it is recorded.

Preview

In the case of video capture, live preview within the application is important and something that was missing from the old Media Capture API spec.

Both our prototype and the W3C's getUserMedia API, allow a preview of the recording to be created with URL.createObjectURL(). This URL can then be used as the value for the src attribute on an <audio> or <video> element.

End-Pointing

For applications like speech recognition, it's important to know when the user starts and stops talking. For example, if the app starts recording but the user doesn't start talking, the app may wish to indicate that it can't hear the user. More importantly, when the user stops talking, the app will generally want to stop recording, and transition into working on the recognition results. This sort of capability may also be of some use during non-speech scenarios, to provide prompts to users who are recording videos.

Neither the old Media Capture API, nor the current getUserMedia approach support end-pointing.

In order to support key speech/voice scenarios, we recommend adding end-pointing capability. The prototype provides this feature and allows Web developers to experiment and provide feedback on these capabilities which will be useful feedback for the W3C.

Looking Forward

We are supportive of the getUserMedia API, and note that it incorporates many of the points of feedback previously submitted. To avoid confusion about the future direction of media capture at the W3C, the DAP working group has officially deprecated the old Media Capture API, which now redirects to the media capture joint task force's current deliverable.

In addition to a prototype plugin that exposes the modified APIs, we have added to this package a functional demo app that makes use of them.

Building this prototype and listening to your feedback will help Microsoft and the other browser developers build a better and more interoperable Web. We look forward to continuing this discussion in W3C and helping to finalize the specifications.

—Claudio Caldato, Principal Program Manager, Interoperability Strategy Team