Markerless Facial Motion Capture with a Kinect

Eric Escobar

11 years ago

In about two years time, I think, the overwhelming majority of 3D character animation content will not be made by professional 3D artists using professional, high end software like Maya.

3D character creation and animation are poised to become as accessible and ubiquitous as digital cameras and non-linear video editing. There are too many developers working on making character animation as easy as navigating a videogame. There will be consumer apps galore running on everything from phones to tablets to the increasingly rare “tower” computer.

Most of these new animators will be folks wanting to animate their chat avatars using their facial movements to deliver the grand promise of the Internet — appearing to be something you are not. App developers are poised to allow iPad users to comp in pre-built animated CGI elements into their social media videos. Prepare for whole new genres of internet video born out of non-3D artists using easy-to-use 3D character animation tools.

Will it be as good as big budget 3D images created by highly skilled artists?

No. But who cares?

Not every person wanting to create 3D content needs it to look like Avatar. However, while these tools seem kind of gimmicky now (like a lip-synched chat avatar), they will undoubtedly find comfortable homes in the pipelines of professional media creators who never thought about adding CGI Character Animation to their menu of services until they suddenly could.

These newly minted Character Animators will be writer/ directors like myself who have enough technical ability to put some pieces of hardware and software together. We will use tools like iPi, FaceShift and physics engines inside 3D apps to build keyframeless, performance driven character animation. We will use Kinect cameras for facial motion capture and multiple low-res PS Eye cameras for markerless body motion capture in our tiny studio spaces. The economics of “off-the-shelf”, for better or worse, always wins in the end.

I got a chance to take out FaceShift’s debut application, the eponymous “FaceShift“. An affordable, cross-platform Kinect-based markerless facial motion capture system. Let me break that last sentence down. The app runs on Mac OSX, Linux and Windows 7; it uses the Microsoft Kinect sensor and camera to track your facial movements without the need to put stickers on your face and look like a page out of Cinefex. With tiered pricing, it will cost you anywhere from 150 dollars for unlimited non-commercial usage. The professional pricing is an annual subscription (yes, a subscription), starting at 800 dollars a year for the “Freelance” version on up to 1500 per year for the “Studio” version. You may be wondering how 1500 dollars a year is “affordable”. Facial Motion capture systems have, thus far, been properitary, complex and technical and generally start at around 8 to 10 grand.

How Does it Work?

Surprisngly well, especially for the version I tested so early on in the beta. I tested the app on both my MacBook Pro and my loaner BOXX desktop machine along with my Kinect camera. The app is unbelievably user-friendly, no need for a manual, just launch it, and go.

The app has you go through a calibration phase, which is remarkably a lot like a video game. An avatar on the screen does the requested facial gesture, and you can’t help but mimic it, like some kind of pre-verbal communication ritual buried deep in our primate brain. You hold the facial gesture, like a smile or eye brow raise and take a snap shot. As you complete each gesture request, the app shows your face map slowly being built, again a very game-like reward system. Pretty cool.

When your done building your personal face map, the app just works. When you turn your head, so does one of the pre-built models. Smile, arch an eyebrow, ditto. When you talk, so does the model. It’s fun and kind of unsettling.

The app has record, edit and playback functions so you can capture different performance takes, chop them up and export them to Maya and Motion Builder as data streams. You can export a performance as virtual marker data in FBX, BVH or C3D and use it in a bunch of other 3D apps. There’re also a Faceshift plug-in for Maya and Motion Builder and you can use it to drive custom characters in those apps.

The team behind Faceshift are a group of really smart and sharp computer scientists in Switzerland. I really think I contributed very little to being a part of FaceShift AG’s beta, but I was happy to be along for the ride. What I did do, was take some time to talk with FaceShift’s leader, Thibaut Weise, CEO, and find out what FaceShift’s plan is.

Here’s the interview

Eric Escobar: The Kinect camera has been criticized for it’s lack of fine detail tracking, and yet you all built a markerless, facial tracking system out of it. Have you found that the Kinect (and Asus camera) limit what you could do with your app?

Thibaut Weise: The accuracy of the kinect and asus cameras are indeed very low, but with faceshift we have put a lot of effort on using the minimal information we get from the cameras for accurate facial tracking. Nevertheless, we are looking forward to the next generation of sensors as with improved sensors we will achieve even better tracking.

EE: How much did the Kinect, and the Kinect hacking community, influence the creation/ development of your product?

TW: We have been on it from day one. In the first week after the release of the Kinect we had already developed a first prototype of the facial tracking system using our previous pipeline that we had developed for high quality 3D scanners. The Kinect was a great opportunity as it was the first affordable commercially available 3D sensor.

EE: Is your target market predominantly Character Animators, or do you see Faceshift being used by other markets?

TW: We are mainly targeting the character animation market, but we have also a lot of interest from people in research, art, HCI, and remote education. With the real-time capability it is ideally suited for online interaction in multi-player games and services.

EE: Are you using OpenGL? Any thoughts about the role of GPU processing for realtime face tracking, puppet rendering?

TW: We are using OpenGL for the rendering pipeline, but the tracking itself is done purely on the CPU. We have used GPU computing before as there are quite a few parts in the algorithm that can be parallelized. However, there is typically only a performance gain for higher end graphics cards, and for compatibility reasons we decided to only use the CPU.

EE: Any plans to stream video out of Faceshift into Video chat apps like iChat or Skype?

TW: We will not stream the rendered images out of faceshift. Instead we stream the facial tracking parameters in an open format, so anyone can develop
their own applications and plugins which use the faceshift tracking data. In 2013, we will also release an API that can then be directly integrated
into other software without the need to have a separate faceshift application running.

EE: Are there plans for other app plug-ins, other than Maya, for Faceshift? Which apps?

TW: We will support MotionBuilder with the first release of the software, but we are also planning to roll out plugins for 3DS Max and Cinema4D at a later stage. Besides, with our streaming format anyone can develop their own application and plugins for their favorite 3D software.

EE: Any plans for multicamera support, like iPi’s use of two Kinects or six PS Eye’s for a greater range of capture?

TW: Yes, we plan to support multiple 3D sensors in the future. We will also support other (offline) 3D capture systems such as high-quality stereo reconstructions based for example on Agisoft’s Photoscan.

EE: While I know the app is called Faceshift, any plans to do body capture? Hands?

TW: We are focusing on faces, while iPi for example is a great system for full-body capture. Currently there are no plans to develop our own technology, but we are looking into ways in combining the different systems.

EE: Are you all following the Leap development as a possible capture device?

TW: Yes, the development of the Leap is exciting, but the (first) device will not deliver 3D data as we need it for facial tracking.

EE: Why a subscription model?

TW: For the pricing, we’ve been considering it for a long time, and we believe an annual subscription is best for the customer, as it contains all updates and upgrades – and this will include compatibility with the next generation of sensors, as well as texturing. The alternative would be to have a more expensive one-time license (e.g. for $1600 instead of $800/year) and each major upgrade would then be $800.

Grand Philosophical Questions (Half-joking, but kind of not)

EE: Realtime face tracking with off-the-shelf hardware/ software will, necessarily bring about an era of untrustable video chat authenticity (remember the AT&T pay phone scene in Johnny Mnemonic?

How far away, technically, do you see this happening? What are the technical hurdles before Faceshift will make videochat as virtually anonymous as text chat? And Is this your goal?

TW: In order to animate a virtual character photo-realistically several technical challenges still need to be overcome including accurate tracking, expression transfer, photorealistic rendering, and audio distortion/analysis/synthesis. With faceshift we aim to solve the first two problems, and our goal is that people can use their avatars to express themselves emotionally. This does not necessarily mean a photorealistic character, but it can be any kind of virtual character, and this will lead to exciting possibilities in online communication. The issue of untreatable video chat authenticity will come up in the future, but I believe that we still have some time, given that Benjamin Button was so far the only character that truly seemed realistic.

DISCLOSURE: I was on the early beta for this app, and was offered, like all other beta participants, a US$200 dollar code that expires on December 1st (which I have not redeemed). I contacted the developers and requested to be on the beta as part of my research into the emerging field of real-time, keyframeless animation systems.