bibliography

Markov process

A movement can be see as a succession of small “kinemes” (small moves think of the equivalent of phoneme related to movement). A kineme can be “foot forward”, “foot up”, “knee left”, etc. We could write a Markov process describing the probabilities for any given sequence of kinemes to occur using the product of all the probabilities for the transitions between kinemes. Let’s use a simple graph with the probabilities for each transitions :

markovProcess

According to the first-order Markov asumption, we consider that the transition probability only depends on the previous state. This means that “the probability of event A from the events f4,f3,f2,f1 is equivalent to the product of the probabilities of A given f4.

For example, the probability for “rest” given the sequence “rest -> foot up -> foot forward -> rest” is 40%. The probability for “foot forward” given the sequence “foot up -> rest -> rest -> foot forward” is 0% since the transition “foot forward -> foot forward” does not exist (probability 0).

This is fine to guess for the next move to come knowing precisely the last one that has occurred. But what if we want to know the probability for a given sequence given a set of observations of the kind “first kineme looked like foot up (80% sure), second seemed to be rest (60% sure), etc” ?

This is the kind of problems a hidden Markov model can solve. The term “hidden” indicates that the real states and transitions are not known. We are not sure about “foot up” and we do not know the probabilities of the transitions. All we know is a set of observations.

Hidden Markov Model

hmm

As you can see from the diagram above, we do not know the probabilities for the transitions, since we do not know the states for sure given some observations (signals A to E). Note that the sum of the probabilities leading to a specific kineme (rest, foot forward, foot up) is 100%.

For an HMM to be fully described, we need to know :

  • hidden states (kinemes): “rest”, “foot forward”, “foot up”.
  • observables states (signals): “signal A” ... “signal E”.
  • vector π (pi) : probabilities for the model to be in any of the hidden states (kinemes) on the first day of creation (t0). In our example, this could be [ 90%, 5%, 5% ] (the system was in “rest” state most probably).
  • state transition matrix A: all the ”??” above.
  • confusion matrix B: probabilistic relation between an observable state and a hidden state.

All probabilities from the vector π and the matrices A and B do not change as time passes. This is an assumption made by HMM that may (will most of the time) not reflect the behavior of the real world. For example, since doing “foot up”, “foot forward” is tiring, the dancer might do this transition less and less thus the probability should change to reflect this but in HMM it doesn’t. Once learned the model is not altered.

The three problems of HMM

The first two are problems you can solve with HMM. The last one is a problem you need to solve to have a working HMM.

  1. evaluation Find the probability for a given sequence of observations. This is the useful part of HMM for our dancing problem : we build one HMM model per movement (word) and then find which model most probably generated the set of observations (signals/sounds). In speech you might look at this as “given some sounds, what was the word ?”. The forward algorithm is used.
  2. decoding Find the hidden states that generated a set of observations. Often, the Viterbi algorithm is used.
  3. learning As you might have guessed this is what we need to solve: given some observations from known sequences, build the probability matrices. In speech, this is “given a database of sounds->words, build the probability matrices”. In our problem, it’s “given some signals from known kineme sequences (movements), build an HMM for each movement”. The Baum-Welch algorithm is used for this.

Evaluation

hmmEvaluation

Our goal is to compute P(“signal B”, “signal C”, “signal A” | HMM) (probability for the sequence of observed signals “B,C,A” given the HMM). We could execute a full-search in the hidden states’ space and sum all the possible successions to get the overall probability for the given sequence to appear in this HMM. From our example above, this means computing 17 paths. In this case, this is not so much, but if we have 14 internal states with 6 observations, this makes 14^6 = 7’529’536 paths: too much for real-time.

eval

Forward algorithm

Hopefully, there is a better way then brute force. By looking closely at the calculations that must be done, we see that we can reuse most of the calculations. On the right, you can see the result of the 3 step “optimized” calculation. We only need to compute 9 probabilities. In the case of 14 states with a sequence of 6, this makes 6*14 = 84 (much better then 7’529’536). The algorithm for this “step” calculation reduces the complexity from O(c^n) to O(n) ! (see big O notation if this O makes you think of a book ).

The first probability (18%) is the probability of this state at time t0 (π(rest) = 90%) multiplied by the probability of this state given the observation “signal B” (20%):

form2

0.9*0.2 = 0.18 = 18%

The same goes for all the initial states.

The state with probability 0.5% is calculated as the sum of the paths leading to this state multiplied by the probability of this state from observation “signal C” :

form3

When reading the formulas, remember that B is the probability matrix “observation->state” and A is the matrix “previous state->current state”.

0.05*(0.18*0.5 + 0.025*0.4 + 0.0025*0.2) = 0.005025 = 0.5%

The same goes for all other states. At the end we take the sum of the final states’ probabilities:

P = 0.007164 + 0.000306 + 0.005925 = 0.013395 = 1.3%

These probabilities are really tiny…

If we wanted to know which movement a set of signal best represents, we compute this probability for each movement (one HMM per movement). The one with the highest probability wins. We can keep the probability as an indicator to know if the result we have is “solid” or if its just noise being randomly recognized.

Decoding

decode

Remember that decoding means “find the most probable sequence of hidden states given some observations”. This can be traduced by “find the path with highest probability in the state/observation network”. Again, we could go through each path and take the one with highest probability, but this is not effective.

The solution is called the Viterbi algorithm. As you might have guessed, this algorithm is just like the forward algorithm described above but instead of computing the sum of all probabilities for each step, we keep the highest (with some indication from where it comes from). We do this for each step and walk back from the best result, following our bread crumbs. In our example, we would walk back from “foot up” to “foot forward” and finally “rest”.

Initial states (same as in previous section):
form2

For other states (we compute the maximal value instead of the sum) :
viterbi

The “bread curmbs” are formally defined as :

argmax
This is a lot of greek to say “remember the index i of the highest probability”.

Learning

Finally, the real, tricky part ! Learning the parameters for the HMM is done with the Baum-Welch algorithm. This is a gradient descent algorithm that starts with random values and tries to minimize the error (false labeling).

The algorithm computes two probabilities for each state. One with the forward algorithm (alpha values) and one going backward. This last probability tells if the current path is “safe” or not. It tries to answer the question “will the hero reach the castle through this path ?”. The last states have a probability of 100% (the hero survived) :

back1
(T is the number of elements in the sequence)

Other states have a probability calculated from the elements ahead in time. If there are many dangers (low transition probabilities), the hero will not make it :

back

[to be continued]