Idea Friday # 1, Neural networks for animation and acting

Hello everyone !
I'm constantly busy thinking about many things, and there's only 24 hours in a day.
This means that a lot of the ideas I have will be abandoned on the workbench. 

Idea Friday
is an attempt to solve this problem by simply writing and sharing these ideas.

Okay, for this article I'm assuming you ( the reader) have a basic understanding of what is a Neural Network and how it works :)

2 minutes summary (just in case)

A neural network is a series of operations placed in a network configuration. Each one of the "cells" or "nodes" or "neurons" does it's thing and passes it to the next one.
The idea is that each cell can have a very simple function, the simplest of which is a neuron that gets "exited" given a certain threshold of input, and then passes the info to his friends.
These networks are very good for automating tasks that would be impossible to program in regular code and math operations.

Let's say we have a 2D space with data points that represent cats and dogs.
There is a linear boundary between cats and dogs. As more data points are added, the boundary becomes a better and better tool to differentiate between cats and dogs.
The Perceptron is the simplest form of Neural networks and assigns weights to data points, then uses the weights to create a more accurate boundary.

But immediately you can start thinking about the problem that can arise : 
Overfitting is when the boundary is misled by incorrect weights, and is doing some crazy tricks to try and classify cats and dogs given the data. 

It might seem like our network is doing a good job so far, but it's actually not going to be great at predicting whether or not your next data point is a "cat" or a "dog".

Now to solve this overfitting problem, smart people came up with hidden layers.
The math is pretty crazy but the theory is simple : you withhold some information from the network, but you DO assign weights to it.
Think about the network like a very young child, it doesn't need to understand the cultural details of actions, but eventually trough the weights of "YES" and "NO", the child will create a mental boundary of what to do and what not to do.
So the hidden layers judge the rest of the network on how accurate the boundary is, without actually showing the data points. (If you're interested in learning further, an adversarial neural network functions in similar ways, with two networks competing each other)

Okay.... hope I didn't scare anyone with all this infodump, let's get into the topic !

What does all this have to do with animation ?

There are three big problems with motion capture animation that are affecting both games and VFX : Price, cleanup, blending

Since our method is data-driven, the character doesn’t simply play back a jump animation, it adjusts its movements continuously based on the height of the obstacle,” 

This paper solved three problems related to blending animations. Scalability ( you can't just have 1.5 GB of mocap data sitting in memory at all times. Ambiguity ( different motions have the same inputs, system gets confused over what take to pick from). Quality ( making stuff look natural).

Now I will go back to this paper in a minute but let's look at the problem of Price and Cleanup in mocap.

That's how it's done so far : you track 2D features ( might be markers, might be not )  then reconstruct to 3D with your known camera data ( usually fixed to the walls or measured distance) once you have the camera you can pace points in 3D space, then fit those points to kinematic skeleton.
Right now we have two big problems with mocap : Either we invest in a huge studio with hundreds of highspeed cameras, polarized lights and tracking software. Or we go with inertial or markerless and we will have a lot of cleanup to do. 

Xsens is an Inertial motion capture system, a little bit like the accelerometers sensors in your phone.
The results looked kind of as I expected of what basically amounts to 12 iphones strapped together.

And by the way your cleanup people HAVE to be highly skilled animators or it will look like shit.

Markerless feature tracking and 3D reconstruction sounds exactly like the type of challenge that would (over) fit a neural network, doesn't it ?
Motion capture can be split into two goals, either we want a specific performance ( film, cinematics) or we want a 3D motion dataset ( and apply it to a wide cast of characters ). 
In both cases, the holy grail is reliable monocular capture.The good thing is : we have the benchmark.
the MPII human pose dataset : a large-scale 2D human pose dataset that includes 25K single images extracted from YouTube videos containing over 40K people and 410 activities.

The first paper attempting to do monocular motion capture using machine learning was published in Janurary 2017, that's less than a year ago. Saying that this field is hot would be an understatement. (please comment below if you know an earlier paper on the subject, even multi-view )

This method uses a Convolutional Neural Network (CNN) to generate a heat map of possible joint locations based on the video, then a expectation–maximization  ( EM) statistical function is ran to find the most likely candidate from a 3D "pose dictionary" (left).

This paper is interesting but you can clearly see in the tennis swing example that one thing is missing. Physics. In their example you see the 3D motion correctly projected onto the image plane, but a secondary camera reveals that the skeleton model was overfitting.
After during the swing the arm is contracted and extend to give the highest kinetic energy at the point of impact, then the momentum carries the arm in the same direction as the energy is absorbed. But in the monocular tracking example, the arm is kept contracted the whole time, giving a linear energy output (but for us artists, mostly just an unrealistic swing !).

Inverse kinematics (IK) isn't new, the idea is to have an end effector drive the rest of the "chain" of joints, this has been used extensively in games and VFX rigging before.
The good thing with IK is that you could add values for mass, momentum and energy expenditure.
What that would allow us to do is having a hidden layer of IK calculation as an adversarial network: the IK network is trying to get better at spotting joints that are misplaced in 3D space relative to laws of physics, while the CNN tries to get better at placing the joints.
My theory is that you would only have to use the IK solver for the training part, because the CNN would have learned a "physics-based overfitting negative weight" by competing against the IK hidden layers.

So...let's say we jump 5 years in the future and now we have an open source monocular motion capture system  ( using Blender as a Front-end client ), we solved for Price, Cleanup and Blending.

There's one problem left.

Many of those papers describe the downside of the method, which is artistic control.
What happens is that now you're in a situation where  you have a huge incomprehensible "black box" that is pulling from a gigantic library of animations based on your inputs.
You simply loose a lot of control in the process.

Enter Semantic Image Segmentation.

Source : Semantic Video CNNs through Representation Warping

This is the process of separating classes of objects in images and videos depending on their shape, colors, patterns. Red = biker, Purple = road, Green =tree
The industry hype for Self-driving cars have pushed this technology forward with billion-dollar tech investments in the last three years. That means it's actually one of the most robust and researched type of CNNs because people's lives will soon depend on it.

It also means that if we know what is what in a video, we could feed the same CNN a face and it could identify emotions and actions and relationship between characters.

Coders and data scientist all over the world are looking into way to generate accurate image descriptions. Some of them are looking into ways to generate more refined and emotionally-accurate descriptions.
This is of obvious interest to Google and Facebook for them to analyse their user's content qualitatively. After a phase of study, Facebook could choose to prioritize "happy pictures" and posts over sad ones in your news feed or Instagram feed. Or if you've shown to reliably click like and engage with  baseball content, it will be able to curate your news feed with a much higher proportion of baseball-related shenanigans.
(This type of technology is at the core of the echo chamber effect btw).

This means that you could type an action like "He smiles sadly and looks down" and somewhere out there is a database containing thousand of video clips or part of clips that are likely matches for your description.

But what does that mean for us artists ?

It means that the entire pipeline from going directly from script to animation is moving into place as we speak.

You type in 
"She turns and throws a javelin, she tenses up while watching it fall down"Say There is 3000 distinct monocular video clips containing Javelin throws. Thousands more for "She turned" and a couple "watching it fall down".
Now there are two options, if your request requires new animation, the monocular motion capture pipeline will go and generate 3D motion for  your request.
But chances are, your animation can be done by blending existing BVH from the database.
Now the system should be semantically aware of the Javelin and we could model one and assign it as the end effector of the IK system.
Upper body is going to be affected first by the "she turned", then it backpropagates down the IK chain. We can manually help the system a little bit though, so let's put emphasis for energy expenditure.
"She turns and throws a javelin, she tenses up while watching it fall down"Now the system will place the character's feet in order to maximize the kinetic energy of the javelin. And the characters has a Look At  function on the javelin until the condition down is met.
After some (hopefully realtime)  processing time, we see that our character keeps looking at the javelin as it penetrates trough the ground and is lost in 3D space.
"She turns and throws a javelin, she tenses up while watching it fall down to the ground"The user enables physics collision for the Javelin. The shot plays back with the correct animation.

Animating this shot might have taken five minutes. Simply cleaning up a motion-capture shot of this kind might take five hours.

Now I can already hear some of you in the back "This will not replace hand-animation" or "This will not replace Mocap".

There's the network for marker-less, sparse-views motion capture.
Then there's the one for blending multiple animations based on IK energy expenditure.
And finally the Neural network for pairing animations with semantic for acting control via narrative inputs.

RPG games and their complex dialogue / locomotion systems are the obvious candidate for implementation of this technology in the next few years.
It seems that the only industry that would be mostly unaffected by Neural networks for 3D animation is high-end Animated feature films, with their custom stylized animation ( sparse inputs ) not very suited for a world of large databases.

I am personally very exited about taking part in this future where storytelling means are getting back to the creator's hands, and low budget doesn't have to look low budget.
How do YOU see these technologies evolving and impacting your field ? Is a Neural Network coming for your job ? Let me know in the comments below !


  1. I think we are going to see CNN development tools come of age where each of use are able to easily build our own CNN's in a very plug and play interface in order to quickly apply them anywhere we see appropriate in our daily lives and work. Right now it is still a bit painful to build a CNN. I think we will see this shift to a more artist friendly tool and become accessible on a widespread level. It is already incredibly interesting currently. I think it will continue to be transformative to most of us.

    1. Absolutely ! I also think we are going to see a lot of Generative Adversarial neural networks (GAN's ) on sparse datasets where CNN's would fail.
      But yes, it's like internet moving from the 80's a quiery that can only be made by a researcher who knows what he's looking for VS the current state of things where your 5 year old daughter can literally ask Google in her own words and most of the time it will work.

  2. Apologies in advance if the technology I’m about to go about is already operation (I’m neither a gamer or an animator), but couldn’t the links between the chain of joints be further refined by defining parameters based upon the CNN? In other words, if the position of the joints indicates that something is likely to be an upper arm, then from the position of that upper arm relative to the other parts of the anatomy, it (the program) would be able tell how contracted or extended according to the set parameters of the upper arm, but also whether it was foreshortened in some way.
    Or to put it in sequence from a machine learning p.o.v.:
    - This looks like an upper arm based upon the position of the joints
    - An upper arm has biceps on the inside and triceps on the outside. Contracted, the bicep forces the end effector (in this case the forearm) to be in an angular position relative to the upper arm and vice-versa.
    - Based on that information (the position of the upper arm relative to the forearm) we can then determine what the action is, and from that, other likely body positions.
    I’m assuming here that having determined the body part, the program will have assigned a physical weight (which could be further divided into categories based physical state of the character i.e. conscious, drowsy/drunk, dead etc.) to it in proportion to the rest of the body.
    I guess this is a long-winded way of saying that by adding real anatomical function parameters between joints (and on the range of motion of the joints themselves) on the bones tree, and how they relate to one another, plus state of the character, you could conceivably reduce the number of joint positions needing to be identified in order accurately extrapolate.

  3. I love to read educational blog where I found student resourcesand also gather knowledge.

  4. Thanks for sharing the wonderful and helpful information
    Delhi Agra Jaipur Yatra
    India Trip Designer


Post a Comment