Entity Embeddings of Categorical Variables in Neural Networks

NEED FOR ENTITY EMBEDDINGS….

Neural networks has revolutionized computer vision, speech recognition, and natural processing and have replaced or are replacing the long dominating methods. But unlike in the fields above where the data is unstructured, neural networks are not as prominent when dealing with machine learning problems with structured data. In principle a neural network can approximate any continuous function and piece wise continuous functions. However, it is not suitable to approximate arbitrary non-continuous functions as it assumes certain level of continuity in its general form.

Interestingly the problems we usually face in nature are often continuous if we use the right representation of data. But unlike unstructured data found in nature, structured data with categorical features may not have continuity at all and even if it has it may not be so obvious. Therefore, natively applying neural networks on structured data with integer representation for categorical representation may not work well. A common way to circumvent this problem is to use one-hot encoding, but it has two shortcomings. First when we have many high cardinality features one-hot encoding often results in an unrealistic amount of computational resource requirement. Second, it treats different values of categorical variables completely independent of each other and often ignores the informative relations between them.

Thus the use of entity embedding method to automatically learn the representation of categorical features in multi-dimensional spaces which puts values with similar effect in the function approximation problem close to each other, and thereby reveal the intrinsic continuity of the data and help neural networks as well as other common machine learning algorithms to solve the problem.

WHAT IS STRUCTURED DATA?

By structured data we mean data collected and organized in a table format with columns representing different features (variables) or target values and rows representing different samples. The most common variable types in structured data are continuous and discrete variables. Continuous variables such as temperature, price, weight can be represented by real numbers. Discrete variables such as age, color, cities can be represented by integers. For example if we use 1, 2, 3 to represent red, blue and yellow, one cannot assume that “blue is bigger than red” or “the average of red and yellow are blue” or anything that introduces additional information based on the properties of integers. Other times there is intrinsic ordering in the integer index such as age or month of the year. These integers are called cardinal number or ordinal numbers. The task of entity embedding is to map discrete values to a multi-dimensional space where values with similar function output are close to each other.

ENTITY EMBEDDING

The general machine learning problem is to approximate the function:

$y = f(x_1, x_2, ....,x_n) \tag{1}$

To learn this approximation, we map each state of a discrete variable to a vector as:

$e_i = x_i \mapsto {\bf x_i} \tag{2}$

This mapping is equivalent to an extra layer of linear neurons on top of the one-hot encoded input. To show this we represent one-hot encoding of $x_i$ as:

$u_i:x_i\mapsto \delta_{x_i, \alpha},\tag{3}$

where $\delta_{x_i, \alpha}$ is Kronecker delta and the possible values for $\alpha$ are the same as $x_i$. If $m_i$ is the number of values for the categorical variable $x_i$, then $\delta_{x_i, \alpha}$ is a vector of length $m_i$, where the element is only non-zero when $\alpha=x_i$. The output of the extra layer of linear neurons given the input $x_i$ is defined as:

${\bf x_i}\equiv \sum_{\alpha}{w_{\alpha\beta}\delta_{x_i\alpha}}=w_{x_i\beta} \tag{4}$

where $w_{\alpha\beta}$ is the weight connecting the one-hot encoding layer to the embedding layer and $\beta$ is the index of the embedding layer. Now we can see that the mapped embeddings are just the weights of this layer and can be learned in the same way as the parameters of other neural network layers.

The dimensions of the embedding layers $D_i$ are hyper-parameters that need to be per-defined. The bound of the dimensions of entity embeddings are between 1 and $m_i-$1 where $m_i$ is the number of values for the categorical variable $x_i$.

Lets understand this using an example. Let the discrete variable $x_i$ represent the day of the week. For each of day in week (Mon, Tue, Wed,...) we create an one-hot encoded vector (represented by $\delta_{x_i\alpha}$) as seen in Fig. 1.


Fig. 1. One-hot encoding of each day of the week

Since there are seven days in a week, $m_i = 7$, and as seen from Fig.1. the length of each one-hot encoded vector $\delta_{x_i\alpha}$ is also 7.

Fig. 2. One-hot encoded input and its corresponding Embedding layer

For this example the dimensions of the entity embedding are chosen to be a 7x4 matrix. This is mathematically represented by the weight matrix $w_{\alpha \beta}$ (see equation 4), with $\alpha=7$ and $\beta=4$ here. Initially the values of these weight matrix are chosen at random. So, instead of representing each day of the week by its one-encoded vector, it is represented by its corresponding entity embedded vector from the weight matrix. This operation is similar to the product of the one-hot encoded vector and weight matrix. Fig. 3. shows the product of the one-hot encoding of "Sun" and the embedding layer matrix, the resultant output vector (represented as $w_{x_i\beta}$ in equation 4) is the highlighted row of the embedding layer matrix.

Fig. 3. Product of the one-hot encoding of category "Sun" with the Embedding layer matrix results in a vector equal to the highlighted row in the embedding layer matrix

Since, the one-hot encoded matrix represents an Identity matrix, therefore its product with the Entity Embedding matrix yields the same Entity Embedding matrix. This matrix is then fed to the neural net, which in turn gets updated during back-propagation (as depicted in Fig. 4.).

Fig. 4. The weights of the entity embedding matrix, $w_{x_i\beta}$ gets updated after being fed to the neural network.

BUILDING EMBEDDING LAYERS USING KERAS

For this post, I'm using the Kaggle Rossmann Sale Prediction dataset, just to create a high level understanding of how to create embedding layers. The code is the same as described by the winner of this Kaggle competition, Cheng Guo and Felix Berkhahn in [1]. There is a slight modification to the code described below, due to the change in the Keras api.

from keras.models import Sequential
from keras.layers import Dense, Activation, Reshape, Merge, Embedding
models = []
        
model_store = Sequential()
model_store.add(Embedding(1115, 10, input_length=1))
model_store.add(Reshape(target_shape=(10,)))
models.append(model_store)
        
model_dow = Sequential()
model_dow.add(Embedding(7, 6, input_length=1))
model_dow.add(Reshape(target_shape=(6,)))
models.append(model_dow)
       
model_promo = Sequential()
model_promo.add(Dense(1, input_dim=1))
models.append(model_promo)
       
model_year = Sequential()
model_year.add(Embedding(3, 2, input_length=1))
model_year.add(Reshape(target_shape=(2,)))
models.append(model_year)

model_month = Sequential()
model_month.add(Embedding(12, 6, input_length=1))
model_month.add(Reshape(target_shape=(6,)))
models.append(model_month)

model_day = Sequential()
model_day.add(Embedding(31, 10, input_length=1))
model_day.add(Reshape(target_shape=(10,)))
models.append(model_day)
        
model_germanstate = Sequential()
model_germanstate.add(Embedding(12, 6, input_length=1))
        model_germanstate.add(Reshape(target_shape=(6,)))
models.append(model_germanstate)
        
model = Sequential()
model.add(Merge(models, mode='concat'))
model.add(Dense(1000, kernel_initializer='uniform'))
model.add(Activation('relu'))
model.add(Dense(500, kernel_initializer='uniform'))
model.add(Activation('relu'))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(loss='mean_absolute_error',optimizer='adam')

The full code can be found here. There are seven categorical variables, store, day of week, year, month, day(date) and german states. For each of the variables we create a Embedding Layer and added them to a list "models". For each of the Embedding layer we define the number of input categories and the output categories. Like for the dow(day of week) embedding layer we provide 7 as the number of input categories and 6 as the number as the number of output categories. So, this creates a 7x6 embedding matrix, with each row corresponding to each day of the week. All these embedding layers are then concatenated using the Merge function. Now, we can add Dense Layers, Dropouts and other layers accordingly. So, now our categorical features are treated similar to continuous variables and the neural network tries to learn the relation between each of the discrete values for each categorical feature.

REFERENCES

[1.] Entity Embeddings of Categorical Variables, Cheng Guo and Felix Berkhahn, Neokami Inc., Dated: April 25, 2016 ↗
[2.] Fastai forums, http://forums.fast.ai/t/deeplearning-lec4notes/8146
[3.] https://github.com/entron/entity-embedding-rossmann

Comments

Rishi PatilFebruary 13, 2018 at 8:26 AM
This comment has been removed by the author.
aMay 30, 2018 at 11:21 PM
hi, do you know how to use the trained embedding layers to feed into a xgboost model?
UnknownOctober 13, 2018 at 11:52 AM
I don't see any continuous variables in the input. Are we not supposed to add them along with the embedding layers?
Shubham GuptaOctober 24, 2018 at 2:09 AM

Yes, you can combine continuous variables along with the embedded inputs. In this examples we were demonstrating on how categorical variables can be represented through embedding, as a result I didn't include the continuous variables in the code. But you can definitely combine the embedding matrix with the continuous variables and provide it as an input field.
aliFebruary 18, 2019 at 2:04 AM
Thanks for your write up. Is learning embeddings supervised ? If so what is thé response variable (e.g is the sale in your example ?

Couldn’t one do a sparse matrix decomposition on the one hot encoder and feed that as projected feature ?

Search This Blog

Machine Learning Archives