# HMM? Hidden Markov Models with Figaro

## I. Introduction

This is a short post about making a Hidden Markov Model (HMM) in Scala using Figaro. We’ll see how to code up our own HMM and do inference on an it. Along the way, we’ll see how functional programming fits naturally into programming with probability distributions. All of the code is here so have at it.

## II. Our HMM

An HMM is a probabilistic graphical model that looks like this

It has finite sets of hidden and observable states, each node $H_t$ is a random variable over the hidden states, and each $O_t$ a random variable over the observable states. We will freely conflate random variables–which are neither random nor variables: Discuss!–with their probability distributions. The graph embodies conditional independence relations: given $H_{t-1}$, $H_t$ is independent of everything else–so the sequence of $H_t$s is a Markov process–and, given $H_t$, $O_t$ is independent of everything else. This means that we can factor the joint probability distribution of a sequence of hidden and observable states as

$P(H_1,..., H_T, O_1,...,O_T) = P(H_1) P(O_1 | H_1) \prod\limits_{t=2}^T P(H_j| H_{t-1}) P(O_t | H_t)$

So everything depends on the a priori distribution $P(H_1)$, and the conditional probabilities $P(H_t | H_{t-1})$ and $P(O_t | H_t)$. These conditional probabilities–which by hypothesis, are independent of $t$–form the entries of the transition and emission matrices, the mechanics of which we’ll illustrate in our example. See the references below for some great introductions.

Our example will be the one from wikipedia: our friend Bob does one of three things, go shopping, go for a walk, or clean his house, depending on the weather which can be either rainy or sunny. We cannot observe the weather where Bob is, we just talk to him on the phone, hear what he has done on various days, and try to infer the weather or what he did on other days.

So our states are

sealed trait Hidden
case object Rainy extends Hidden
case object Sunny extends Hidden

sealed trait Observable
case object Shop extends Observable
case object Clean extends Observable
case object Walk extends Observable


We’ll use the convention that vectors are row vectors and multiply matrices from the left; this is consistent with pattern matching syntax as we’ll see. So, our transition matrix is 2 x 2, our emission matrix is 2 x 3, and both are row stochastic, that is, the rows are probability distributions (numbers between 0 and 1 that add up to 1). Abbreviating $H_t = \text{Rainy}$ and $H_t = \text{Sunny}$ to just $R_t$ and $S_t$, we use matrix multiplication to get the distribution at time $t$ from the distribution at time $t-1$ like so

$\left(\begin{array}{cc} P(R_{t}) & P(S_{t}) \end{array}\right) = \left( \begin{array}{cc} P(R_{t-1}) & P(S_{t-1}) \end{array} \right) \left( \begin{array}{cc} P(R_t | R_{t-1}) & P(S_t | R_{t-1}) \\ P(R_t | S_{t-1}) & P(S_t | S_{t-1}) \end{array} \right)$

The left-hand side is a vector, the distribution over states at time t, the right-hand side a vector times a matrix, the distribution over states at time t-1 times the transition matrix. The definition and mechanics of the emission matrix are completely analogous.

Let’s take a moment to bask in the warm glow of being able to make our own union types: rather than using Integers, say, to model our states and then having to remember which Integer corresponds to which state, we can just say what we mean.

We’ll see how to build our HMM, but first let’s make a short digression on programming with probability distributions.

## III. Probabilistic Functional Programming

Morally, a probability distribution of type A is a list of pairs

type Dist[A] = List[(A, Double)]


consisting of each of the values of A along with the probability of drawing that value when you sample from the distribution. So the Doubles should all be between 0 and 1 and sum to 1.

If you sample from the same distribution multiple times, you don’t necessarily get the same answer every time so Here Be Non-determinism which, to a functional programmer, says monad: non-determinism is icky and we had better protect our pure code from it at all cost (by putting it inside a monad). Dist[_] also clearly, and less ickily, says functor so let’s start there.

For Dist[_] to be a functor, we have to say how, given a function A => B, we make a function Dist[A] => Dist[B]. Here’s what we do: if we have a List[(A,Double)], and a function  f : A => B, we first apply f to the first component of every tuple to get a List[(B,Double)], then, if f maps multiple values of type A to the same value of type B, we combine tuples with the same first component by adding the probabilities. (In SQL, this is just a sum…group by…)

Beyond our visceral reaction to non-determinism, it’s reasonable that Dist[_] should be a monad because List[_] is a monad and Dist[_] is a list. Moreover, it’s a list of tuples of things whose second components we know how to add and multiply: we should be able to figure out some clever way to flatten a list of tuples of lists of tuples of such things into a list of tuples of such things. So, given a Dist[A] and a function A => Dist[B], we need to make a Dist[B]. Let’s do an example to see how. Suppose we have a function Hidden => Dist[Hidden]

def transitionGivenState(h : Hidden) : Dist[Hidden] = {
h match {
case Rainy => List((Rainy, 0.7), (Sunny, 0.3)) // List(P(R_t | R_{t-1}), P(S_t | R_{t-1})
case Sunny => List((Rainy, 0.2), (Sunny, 0.8)) // List(P(R_t | S_{t-1}), P(S_t | S_{t-1}))
}
}


which gives us the distribution over (hidden) states at time t given the state at time t-1. But, we don’t know the actual state at time t-1, we again have just a distribution

val d : Dist[Hidden] = List((Rainy, 0.1), (Sunny, 0.9)) // List(P(R_{t-1}), P(S_{t-1}))


We know how to do this because it is exactly our transition matrix example: monadic bind (flatMap) is matrix multiplication! In Scalamatical pseudocode meant to suggest vectors and matrices, we have:

d flatMap transitionGivenState

=                                     Rainy => List((Rainy, 0.7), (Sunny, 0.3))
List((Rainy, 0.1), (Sunny, 0.9)) fM Sunny => List((Rainy, 0.2), (Sunny, 0.8))

= List((Rainy, 0.1 * 0.7 + 0.9 * 0.2), (Sunny, 0.1 * 0.3 + 0.9 * 0.8))
= List((Rainy, 0.25), (Sunny, 0.75))


(There is no SQL analogue of this, but in Hive, we can do it with a lateral view explode.)

The punchline is that matrix multiplication is all we need to define a probability distribution as a monad. It’s pretty fantastic that monadic bind is exactly the thing to model dependencies between probability distributions. That is some deep math at work. This is what Wadler means when he says that FP is discovered, not invented.

## IV. An HMM in Figaro

The Figaro type for modeling a probability distribution is called Element[_]. It does more internal book-keeping than our simple type Dist[_], but it’s the same idea: a set of values and a probability for each. The Figaro way of writing

val d : Dist[Hidden] = List((Rainy, 0.1), (Sunny, 0.9))


is

val fd : Element[Hidden] = Select(0.1 -> Rainy, 0.9 -> Sunny)


Element is both a functor and a monad; in Figaro, map and flatMap are called Apply and Chain, but you can write each either way. Warning: Figaro is not a purely functional library, Elements are mutable as we’ll see in the next section. The Figaro machinery for sampling from distributions requires a little more from our types so we use

type H = Product with Serializable with Hidden
type O = Product with Serializable with Observable


We can define a function that takes the conditional probabilities which define the transition matrix as input, and makes the corresponding function for us

def transition(rainyTransition : Array[Double], sunnyTransition : Array[Double])(e : Element[H]) : Element[H] = {

def transitionGivenState(h : H) : Element[H] = {
h match {
case Rainy => Select(rainyTransition(0) -> Rainy, rainyTransition(1) -> Sunny) // P(H | Rainy)
case Sunny => Select(sunnyTransition(0) -> Rainy, sunnyTransition(1) -> Sunny) // P(H | Sunny)
} // these are the rows of the 2x2 transition matrix
}

e.flatMap(transitionGivenState) // P(H) = P(H | Rainy) P(Rainy) + P(H | Sunny) P(Sunny)

}


exactly as we did in our example. Given the rows of the transition matrix, this produces a function of type Element[H] => Element[H]. We can define a similar function to make the emission matrix

def emission(rainyEmission : Array[Double], sunnyEmission : Array[Double])(e: Element[H]) : Element[O] = {

def emissionGivenState(h : H) : Element[O] = {
h match {
case Rainy => Select(rainyEmission(0) -> Shop, rainyEmission(1) -> Clean, rainyEmission(2) -> Walk) // P(O | Rainy)
case Sunny => Select(sunnyEmission(0) -> Shop, sunnyEmission(1) -> Clean, sunnyEmission(2) -> Walk) // P(O | Sunny)
} // these are the rows of the 2x3 emission matrix
}

e.flatMap(emissionGivenState) // P(O) = P(O | Rainy) P(Rainy) + P(O | Sunny) P(Sunny)

}


which, given the rows of the emission matrix, produces a function of type Element[H] => Element[O]. To make a sequence of hidden states in which each is obtained from the previous by applying the transition matrix we need an a priori distribution, a transition matrix, and the length.

def makeHiddenSequence(length : Int, transitionMatrix : Element[H] => Element[H], aprioriArray : Array[Double]) : List[Element[H]] = {

val apriori = Select(aprioriArray(0) -> Rainy, aprioriArray(1) -> Sunny) // the apriori distribution over hidden states

def buildSequence(n : Int, l: List[Element[H]]) : List[Element[H]] = {
(n, l) match {
case (0,l) => l // Step 3: when we get down to n = 0 we are done
case (n, Nil) => buildSequence(n-1, List(apriori)) // Step 1: start with the apriori distribution
case (n, x :: xs) => buildSequence(n-1, transitionMatrix(x) :: x :: xs) // Step 2: apply the transitionMatrix to the head and cons it onto the list
}
}

// reverse so time increases from left to right.
buildSequence(length, Nil).reverse
}


Finally, to make the corresponding sequence of observable states, we just apply the emission matrix to each hidden state.

def makeObservableSequence(hiddenSequence : List[Element[H]], emissionMatrix : Element[H] => Element[O]) : List[Element[O]] = {
hiddenSequence.map(emissionMatrix)
}


We can put this all together into a class and make a companion object that lets us make an HMM with given or default parameters.

class HMM(
val length : Int
, val apriori : Array[Double]
, val rainyT : Array[Double]
, val sunnyT : Array[Double]
, val rainyE : Array[Double]
, val sunnyE : Array[Double]
) {
val hidden : List[Element[H]] = makeHiddenSequence(length, transition(rainyT, sunnyT), apriori)
val observable : List[Element[O]] = makeObservableSequence(hidden, emission(rainyE, sunnyE))
}

object HMM {
val apriori = Array(0.4, 0.6)
val rainyT = Array(0.6, 0.4)
val sunnyT = Array(0.3, 0.7)
val rainyE = Array(0.2, 0.7, 0.1)
val sunnyE = Array(0.4, 0.1, 0.5)
def apply(length : Int) : HMM = new HMM(length, apriori, rainyT, sunnyT, rainyE, sunnyE)
def apply(
length: Int
, apriori : Array[Double]
, rainyT : Array[Double]
, sunnyT : Array[Double]
, rainyE : Array[Double]
, sunnyE : Array[Double]
) : HMM = new HMM(length, apriori, rainyT, sunnyT, rainyE, sunnyE)
}


## V. Inference

Now that we’ve set everything up, we can use Figaro’s machinery for inference. The examples in this section are here.

First, we can do belief propagation: we can query our HMM to ask the probability that it was Sunny on day 2, say, and, if we learn that Bob took Walks on other days, we can use that evidence to update our belief. We do this with Figaro’s VariableElimination. See Bishop chapters 8 and 13 and/or Koller and Friedman chapter 9 for great discussions of Variable Elimination and the sum-product algorithm.

scala> VEexample()
Probability it is Sunny on day 2
Prior probability: 0.574
After observing Walk on day 2: 0.870752427184466
After observing Walk on day 1: 0.9073645970937912
After observing Walk on day 0: 0.9112495643081212
After observing Walk on day 3: 0.9375250668843456
After observing Walk on day 4: 0.9404138925801485
After observing Walk on day 5: 0.9407314643168659


As we see, the probability that it was Sunny on day 2 increases as we observe that Bob went for walks on various days, even on days later in the week. As warned, Elements are mutable. We incorporate new evidence with hmm.observable(2).observe(Walk), for example, which collapses the distribution on day 2 to the one with probability 1 of observing Walk and 0 of observing any other activity. We can always unobserve, though, so not all is lost.

Next, given a sequence of observations Clean, Shop, Clean, Walk, say, we can ask what the weather most likely was on those days. This can mean one of two things: either we find the sequence of hidden states which maximizes the probabilities individually, so maximizes $P(H_t)$ for each $t$, or we find the sequence which maximizes the joint probability of the whole sequence, $P(H_0, H_1, H_2, H_3)$. These don’t have to be the same. Let’s call the first mostLikelyHiddenStates and the second mostLikelyHiddenSequence.

To compute the mostLikelyHiddenStates, we just use VariableElimination again: we apply it for each t individually and choose the state with the highest probability.

To compute the mostLikelyHiddenSequence, we use Figaro’s Most Probable Explanation (MPE). In the literature, this is called the max-sum algorithm for tree-like PGMs or the Viterbi algorithm for HMMs. See the same references.

We have functions that do each and the function mostLikelyExample() compares the two

scala> mostLikelyExample()
For the observation sequence:
List(Clean, Shop, Clean, Walk)
The most likely hidden states are:
List(Rainy, Sunny, Rainy, Sunny)
which has probability 0.19539752430007473
The most likely hidden sequence is:
List(Rainy, Rainy, Rainy, Sunny)
which has probability 0.2930962864501122


This example comes from the Stamp tutorial where he works it out in detail so see there for careful calculations and discussion. See the code comments in mostLikelyExample() for the mapping between our states and Stamp’s.

So, we can do belief propagation, and, given an observation sequence, we can find the most likely corresponding hidden states two ways. What about inferring the parameters of the model from data? Parameter learning comes in two flavors: supervised and unsupervised.

In the supervised case, our training dataset would consist of a collection of observation sequences and the corresponding sequences of hidden states. We can just compute the parameters of the model by doing frequency counts: the probability that a given state follows another is just the fraction of times that happens in the training set. This is similar to a Naive Bayes classifier.

In the unsupervised case, our training set would consist just of a collection of observation sequences and we would use the Expectation Maximization algorithm to learn the coefficients of the a priori distribution and transition and emission matrices. Figaro implements EM, but using it requires changing some types to make them compatible with parameter learning. Hopefully, this will be the topic of a future post.

## Conclusion

This is not blazingly original work, it just comes from reading the excellent Figaro book and following ideas up in Bishop and other references. Nonetheless, I hope this has been a fun tour through the Math-Machine Learning-Functional Programming 2-simplex and that having code to play with brings the ideas home.

# Two Tricks for Program Optimization

Video and slides ‘n’ code from my talk at the January 2016 edition of the Bay Area Haskell Meetup entitled Two Tricks for Program Optimization: Difference Lists and the Codensity Monad.

This is a short, fast, and somewhat loose post to connect the dots between monads in math and monads in functional programming hopefully in a way that illuminates both and makes sense to readers from all parts of the math-programming spectrum.

Monads are an idea from category theory that have an equivalent but slightly different formulation in functional programming. A monoid is a set with a way to mush (multiply/mappend) two values together to make another value of the same type. A monad is a function with a way to mush two applications of itself together into one. So a monad is a function that behaves like a monoid. Monoids are ubiquitous: integers with addition, integers with multiplication, strings (of characters) with concatenation, booleans with or, booleans with and. The canonical example of a monad is List: List is a function that, given any datatype (set of values) d, returns another datatype List d whose values are lists of things of type d; List is a monad because we can always mush a list of lists into a list no matter what those lists are lists of. In general, a function M is a monad if it has a mushing map M M -> M, that is, a map M (M d) -> M d that works for all d.

Another monad is Maybe: a value of type Maybe Integer is either something of type Just Integer or it is Nothing. Since Nothing is one of its possible values, you can perform operations that fail, but still return legitimate values, and then handle them in your program without having to anticipate failure (if…else…) at every turn. The mushing map Maybe (Maybe d) -> Maybe d is obvious: Nothing of anything and anything of Nothing map to Nothing, Just (Just d) maps to Just d; you only succeed if everything succeeds.

Functional programmers use a slightly different formulation of monads because they want to compose functions that don’t exactly start and end in the right places. They want to do things like compose two functions f:a —> Maybe b and g:b —> Maybe c, each of which might fail, without having to repeat the logic each time of how to handle failure. In terms of a general monad M, this means that given functions f: a —> M b and g:b —> M c, we want a way to stick an M b into g even though g only knows how to eat a b. The trick is: apply M to g to make a new function M g: M b —> M (M c) which knows how to eat an M b, then use the mushing map M M —> M to turn the resulting M (M c) into an M c. This operation, called bind and written >>= in Haskell, is what functional programmers consider the defining characteristic of a monad.

A quick example before we see what “apply M to a function” means in general. Suppose readLog: LogFile -> List String is a function that takes a web log and produces a list of urls and tokenize: String -> List String tokenizes a url by splitting it on ‘/’ so tokenize 'yes.com/no/maybe?q=unsure' = ['yes.com','no','maybe?q=unsure']. Monad-composing these applies tokenize to every element of our list of urls–that’s what “apply M to tokenize” means when M is List–then mushes the resulting list of lists into a list. Monads aren’t fancy, they just abstract a common pattern so you can program at a high level without repeating yourself.

So, what does “apply M to a function” mean and how is this related to category theory? A category (in the sense of Eilenberg and  Mac Lane) consists of three things: objects, functions, and composition. A map between categories is called a functor; a functor F:C -> D has to map objects to objects, functions to functions, and composition to composition. The last two parts mean that F maps functions h: a -> b and k:b -> c in C to functions F h: F a -> F b and F k: F b -> F c in D and that this mapping of functions must satisfy F(k.h) = (F k).(F h): if you compose first in C and then apply F, you must get the same thing as if you first apply F, then compose in D. So, the punchline is: the set of all datatypes is a category and “functions on datatypes” like List and Maybe are functors from this category to itself. Since they are functors, you can apply them not only to datatypes, but also to functions between datatypes: that’s what “apply M to a function” means. A monad is a functor from a category to itself that has a mushing map M M —> M. That List and Maybe are functors means that they map functions like h and k to functions List h:List a —> List b and List k:List b —> List c (and similarly for Maybe)* that you can compose as usual because they, too, start and end in the right places, but the fact that they are monads means that you can also use them to “compose” functions like f:a —> Maybe b and g:b —> Maybe c that don’t.

* When M is List,  List h [x, y, z] = [h x, h y, h z]; when M is Maybe,  Maybe h (Just h) = Just (h x) and Maybe h Nothing = Nothing.
Notes
• Inspiration and references. This post is inspired by Beckman’s Don’t Fear the Monad, a very beginner-friendly video which goes more thoroughly through monoids and monads as a tool for composition. Another wonderfully intuitive and thorough introduction to monads in programming is You Could Have Invented Monads! (And Maybe You Already Have). For more technical connectings of dots between ideas in category theory and ideas in Haskell, see Notions of Computations as Monoids and Category Theory for Programmers.
• List again, List again, jiggity jig. Mathematicians: List d is a monoid–the free monoid on the set d–and List is the free monoid functor: that’s why we get a monad. Non-mathematicians: every free functor comes with a forgetful functor and the two together always make a monad.
• Haskell punchline. For g: b -> M c, if we call the mushing map join and write fmap g instead of M g, we said that (>>= g) = join . (fmap g). See Control.Monad and page 10 of  Notions of Computations as Monoids.
• Neglected. Monoids also have an identity element/mempty value mushing with which is the identity map. The analogous thing for monads is a map d -> M d, called return in Haskell, which is similarly an identity for mushing.
• Thank yous. This conversation started as an email exchange with my dad and continued at the white board with my colleague Scott. I reap maximum joy from these conversations so thank you both.

# PhD => Data Scientist

A few great posts about transitioning from academia to working in tech are

I finished my math PhD in May 2013, did a fantastic program called Insight Data Science in June and July, interviewed for a while, and started my current position at a small startup in November. Below are my own tips about making the transition from PhD to data scientist for those in fields like (pure) Math and theoretical Physics who suspect that data science is probably something they’d be good at, but, like me, did no programming, data analysis, or machine learning as part of their research. Figuring out how to sell yourself is very important, but articulating how your problem-solving and research skills are relevant in the large will only go so far: you have to amass some directly relevant skills and knowledge. Here are the things I think are most important to do and learn.

• Programming: learn Python. It’s easy to get started, everyone knows it, and it will do pretty much anything you want: build a web app (flask), interact with databases (mysqldb), data analysis (pandas and numpy), machine learning (scikit-learn), or make a gif. Pandas is R for Python, so don’t worry about learning R, too. Try Google, MIT, Coursera.
• Databases: learn MySQL. It is the lingua franca of databases. SQL was designed to query relational databases and even though nonrelational databases are the hippest-latest-freshest, it is the way that everyone knows how to query data: the first thing everyone wants to do with a new way of storing data is find a way to query it with SQL.
• Projects: do some small ones. Take a look herehere, or here and try building a spam filter or a recommendation engine, for example. There are lots of public data sets, but even better if you can figure out how  to gather your own by, say, querying an API (requests) or scraping the web (Beautiful Soup).
• Read: see what’s going on. Check out Hacker News and Data Tau, mathbabe, the engineering blogs of tech companies, follow people/companies on Twitter, read research papers. Even simple things like counting get tricky with huge volumes of data so you won’t get bored. Check out: reservoir sampling, mapreduce and mutual friendsprobabilistic counting, bloom filters.
• Interact: go to MeetUps. There are a lot. It’s easy to meet new people, learn new things, and drink free beer. Email discussion groups are another a great way to learn new ideas and ask questions. Take advantage of the fact that there are more than a handful of other people in the world who understand what you are working on!
• The Basics: brush up on and/or learn basic probability & statistics and algorithms & data structures.

In the future, I’ll write about things I learned interviewing and why the position I chose is perfect for me. For now, here are my tips on where to start. This process can be daunting, but it’s also a lot of fun! Enjoy!