Optimized parallel training of word vectors on multi-core CPU and GPU

Material Information

Optimized parallel training of word vectors on multi-core CPU and GPU
Simonton, Trevor McDonald ( author )
Place of Publication:
Denver, Colo.
University of Colorado Denver
Publication Date:
Physical Description:
1 electronic file (59 pages) : ;

Thesis/Dissertation Information

Master's ( Master of science)
Degree Grantor:
University of Colorado Denver
Degree Divisions:
Department of Computer Science and Engineering, CU Denver
Degree Disciplines:
Computer science


Subjects / Keywords:
Neural networks (Computer science) -- Design ( lcsh )
Information retrieval ( lcsh )
bibliography ( marcgt )
theses ( marcgt )
non-fiction ( marcgt )


Word2Vec is a popular set of machine learning algorithms that use a neural network to generate dense vector representations of words. These vectors have proven to be useful in a variety of machine learning tasks. In this thesis, we explore existing approaches to parallelizing Word2Vec vector training, and propose new methods to increase the speed of the Word2Vec skip gram with hierarchical softmax architecture on multi-core shared memory systems. We accomplish this by batching training operations to increase thread locality and reduce accesses to shared memory. We also take advantage of highly opti- mized linear algebra libraries to further increase performance. We then explore existing GPU implementations, and propose new GPU implementations that utilize shared mem- ory registers and in-warp inter-thread reduction operations. Our GPU implementations produce a higher quality of word vectors than previous GPU implementations. The pro- posed GPU training approach works for both the hierarchical softmax architecture, as well as the negative sampling method. We then apply similar parallelization techniques to Word2Vec's successor, FastText, and achieve greater scalability and speed of execution than previously published models.
Includes bibliographic resource.
System Details:
System requirements: Adobe Reader.
Statement of Responsibility:
by Trevor McDonald Simonton.

Record Information

Source Institution:
University of Colorado Denver
Holding Location:
Auraria Library
Rights Management:
All applicable rights reserved by the source institution and holding location.
Resource Identifier:
on10204 ( NOTIS )
1020495082 ( OCLC )
LD1193.E52 2017m S56 ( lcc )


This item has the following downloads:

Full Text
TREVOR MCDONALD SIMONTON B.A, Colorado State University, 2009
A thesis submitted to the Faculty of the Graduate School of the University of Colorado in partial fulfillment of the requirements for the degree of Master of Science Computer Science Program

This thesis for the Master of Science degree by Trevor McDonald Simonton has been approved for the Computer Science Program by
Gita Alaghband Farnoush Banaei-kashani Tom Altman
Date: May 13, 2017

Simonton, Trevor McDonald (M.S., Computer Science Program)
Optimized Parallel Training of Word Vectors on Multi-core CPU and GPU Thesis directed by Professor Gita Alaghband
Word2Vec is a popular set of machine learning algorithms that use a neural network to generate dense vector representations of words. These vectors have proven to be useful in a variety of machine learning tasks. In this thesis, we explore existing approaches to parallelizing Word2Vec vector training, and propose new methods to increase the speed of the Word2Vec skip gram with hierarchical softmax architecture on multi-core shared memory systems. We accomplish this by batching training operations to increase thread locality and reduce accesses to shared memory. We also take advantage of highly optimized linear algebra libraries to further increase performance. We then explore existing GPU implementations, and propose new GPU implementations that utilize shared memory registers and in-warp inter-thread reduction operations. Our GPU implementations produce a higher quality of word vectors than previous GPU implementations. The proposed GPU training approach works for both the hierarchical softmax architecture, as well as the negative sampling method. We then apply similar parallelization techniques to Word2Vecs successor, Fast Text, and achieve greater scalability and speed of execution than previously published models.
The form and content of this abstract are approved. I recommend its publication.
Approved: Gita Alaghband

To my wife, Lauren.

I. INTRODUCTION......................................................... 1
II. WORD2VEC HISTORY AND MOTIVATIONS.................................... 3
Neural Probabilistic Language Modeling ....................... 4
The hierarchical probabilistic model ..................... 5
The Core Word2Vec Algorithm ..................................... 6
Collecting training data from a corpus.................... 7
Continuous Bag of Words (CBOW)............................... 8
Skip gram.................................................... 8
Word vectors as a neural network weight matrix............... 9
Hierarchical softmax......................................... 9
Negative sampling........................................... 11
Original Parallelization Scheme ................................ 12
Faster Word2Vec Models.......................................... 13
III. WORD MATRIX BATCHES: WOMBAT....................................... 18
New Wombat hierarchical softmax method.......................... 19
Batching Operations for GPU..................................... 20
Data dependencies........................................... 21
Proposed batching method ................................... 22
GPU Kernel Design........................................... 25
Fast matrix kernel...................................... 28
Fast vector kernel...................................... 30
Experiments and results ........................................ 30
CPU Batched Hierarchical Softmax........................ 32
GPU Experiments......................................... 34
IV. FASTER FASTTEXT................................................... 36

Fast Text skip gram with negative sampling..................... 37
Negative sampling in Fast Text......................... 37
Proposed improvements to Fast Text............................. 38
Experiments.................................................... 41
Results ................................................... 43
Conclusion and future work .................................... 46
V. Conclusion.......................................... 47
REFERENCES................................................................ 48

1 Evaluation of hierarchical softmax accuracy in new Wombat model compared
to Cythnn and original Word2Vec....................................... 33
2 Execution time and accuracy of Wombat vs. BIDMach..................... 35
3 Evaluation of model accuracy in Fast Text vs. newly proposed batch method
of Fast Text and Word2Vec models ..................................... 46

1 Illustration of the meaning of nodes in hierarchical softmax.................. 10
2 Original parallelization scheme .............................................. 12
3 Declining per-thread performance as the number of threads increases........... 13
4 Updated parallelization scheme................................................ 16
5 Matrix multiplication: (words from W0L0cai) x (words from WfLocal) in Intels
batched network activation method............................................. 17
6 Implicit label matrix used in Intels batch model............................. 17
7 Matrix multiplication: (words from W0private) x (words from W^Private) in the
proposed batched network activation method.................................. 20
8 Adjusted label matrix for hierarchical softmax. Example word has Huffman
code 1011...01 21
9 Data dependencies between training steps...................................... 22
10 CPU/GPU heterogeneous batching diagram........................................ 23
11 CPU/GPU batch data transfer................................................... 24
12 Loading data to shared memory................................................. 28
13 Wombat matrix multiply........................................................ 29
14 Wombat update calculation illustration........................................ 29
15 Speedup for batched hierarchical softmax CPU method compared to leading
implementations............................................................... 32
16 Vector-to-matrix batch showing duplicate sample vectors in bold............... 39
17 Vector-to-matrix batch with duplicates removed. A single set of k samples
remains....................................................................... 40
18 Multiply sample gradients by n rather than calculate their dot product n
times......................................................................... 41

19 Performance comparison of original Fast Text compared to proposed optimized
versions and Word2Vec. Evaluated on the enwiki9 word set.

There are many approaches to embedding word meanings into computer programs. One such method is to define for each word what is called a distributed representation in vector space. A distributed word representation is a dense vector of numbers that describes a unique word. This kind of word representation is perfect for use as input to machine learning models that are designed to take numeric feature vectors as input. A dense vector representation of a word can be composed of as much as several hundred floating point numbers, each 4 bytes in size. This representation of a word can hold much more information than a simple sequence of letters, but defining a set of word vectors for a vocabulary of hundreds of thousands to millions of words can be difficult and time consuming.
Word2Vec is a set of algorithms that offers several methods to define these word vectors. The algorithms are in a class of statistical methods that use a neural network to learn the word vector representations. The network learns from documents that contain examples of words used properly in natural language. For example, all of the articles on Wikipedia cleaned and reduced to plain text is a typical input used to train this kind of a network.
Once learned, these vector representations of words can be re-used by other computer programs. Even simple vector addition and subtraction programs that use the vectors learned by Word2Vec have yielded entertaining and exciting results. A widely publicized example of this is the operation v(King) v(Man) + v(Woman) ~ v(Queen) where v(vjord) returns the learned word vector for word [7]. The vectors also work well in machine learning models that provide language translation and sentiment analysis [9] [10] [11].
The Word2Vec algorithms are not the first of their kind. There is a long history of similar models that use neural networks to learn these word representations. Word2Vec

is particularly exciting because it has simplified the training process to the point that its possible to generate useful word vectors from billions of words in only a few hours. Recent research that has focused on improving these algorithms has reduced that time even further.
This thesis explores the Word2Vec core algorithms, and presents some of the leading implementations of these algorithms that have drastically reduced the amount of time it takes to train the Word2Vec neural network. These implementations use clever parallel computing techniques to optimize the training on modern CPU and GPU hardware. In the latter half of this thesis, new original techniques are presented to further speed up the learning process for Word2Vec, as well as its recently released successor, Fast Text.
The rest of the thesis is organized as follows. In Chapter II, the history, motivations and inner workings of Word2Vec are discussed. At the end of Chapter II, some of the leading parallel implementations of the algorithms are reviewed. In Chapter III, new parallel approaches are presented, and in Chapter IV, the newer library FastText is discussed, and improvements to this library are presented.

In some cases, the only structure found in a large collection of data is the natural language used to form the sentences of which the data is composed. One tool that is often used to extract information from such data is statistics. Statistical approaches to understanding natural language data often involve counting the number of key words in a document [I]. The goal of this counting is to create numeric values that can be analyzed to better understand the contents of documents that contain only unstructured word data.
Word counts alone can provide simple word vectors. The simplest word vector is the one hot vector. One hot vectors are of size \V\, where V is the vocabulary. All entries in this one hot vector are 0, with the exception of the index of the vector that corresponds to the one word that the vector represents. An extension to one hot vectors is the bag of words vector, which can be used to describe an entire document. To form a bag of words vector, the count of each word in a sentence or document is entered to the corresponding index of the word in a vector, again of size \V\. This is how sentences, paragraphs or entire documents can be transformed to simple feature vectors.
Example one hot and bag of words vectors are as follows:
One-hot vector:
00... I ...000
Bag of words vector:
02... 3... 100
The problem with these vectors is that they are sparse. In the case of a one hot vector all but \V\ 1 values in the vector would be 0. Bag of words approaches would only be slightly more dense for sentences and paragraphs, but do not help describe individual

words. Another issue with these simpler models is that they completely lack context. They do not make associations between neighboring words.
Another technique used to generate statistical models of natural language involves the use of probability. The appearance of a word can be considered as an event with a probability. Counting the number of times each word occurs can help calculate this probability, but the probability of this event can also be calculated as the conditional probability of words appearance given a preceding sequence of words. Where wt is the i-th word, and w\ is the sequence Wi, Wi+1, ... ,Wj-1, Wj.
p(wT) = X\p(wtWi~l)
t= 1
Note that it is obviously time consuming to calculate the probability of each word as a conditional probability of a sequence of words starting at the beginning of a document. These probabilities are typically calculated for some sub-sequence of n words that start not at the beginning of the document, but at an index wt-n that is a relatively short distance in front of the given word wt.
This conditional probability starts to provide context to the words. Words with similar probabilities given the same sequence of preceding words tend to have similar meaning. For example, the cat sat on the might precede mat, bed or anything else suitable for a cats seat. The probability of the next word creates a similarity between any word that makes sense at the end of the sequence. A statistical model that describes the joint probability function of sequences of words in a language can effectively capture these word similarities and word relationships.
Neural Probabilistic Language Modeling
The trouble with statistical models that rely on conditional probabilities of words appearing in sequence is the curse of dimensionality. Words that do not appear in specific

sequences found in the training data should not have a zero probability of appearing in sequences found outside of the training data. Fighting this problem was the key motivation behind the introduction of the first Neural Probabilistic Language Model (NPLM) [2].
The goal of a neural probabilistic language model (NPLM) is to use a neural network to learn a distributed representations for words [2]. These dense word vectors contain embedded context information of each word extracted from large amounts of training data. The model is built from observed sequences of words in training data.
In such a model, each word in a vocabulary is associated with an initially random word feature vector. The feature vectors are then adjusted in a neural network that is trained by examining sequences of words in a training hie. The network learns to express the joint probability function of word sequences in terms of the initially random feature vectors, which are simultaneously adjusted to become the parameters of that probability function [2], The neural network learns the function and its parameters by maximizing the log-likelihood of the word sequences that appear in the training data. The result is a set of continuous feature vectors that represent words in the training vocabulary.
The problem with such a model is the amount of time it takes to update feature vectors for an entire vocabulary. The amount of training data can be in the billions of words, and each training step involves updates to matrices that represent vocabularies of hundreds of thousands to millions of words. However, several alternate approaches to network training have been introduced since the idea of NPLMs was introduced in 2003.[3] [4] [6] [7] [8]
The hierarchical probabilistic model
One way to reduce training time is to cluster words into a hierarchy of related words. The training time can be sped up exponentially by reducing each training operation to require updates of only log(\V\) classes of words as opposed to updating \V\ word vectors at each step [4],

The motivation for grouping words to classes comes from that fact that computing directly P{Y\X) involves normalization across all the values that Y can take. By defining a clustering partition for the Y into classes C, such that c(.) maps Y to C,
P(Y = y\X = x) = P(Y = y\C = C(y),X)P(C = C(y)\X = x)
Then by reducing the words to a hierarchy of classes that can be represented in a balanced binary tree, the amount of normalizations is reduced from \ V\ to log(\V\). Each word v can be represented as a bit vector (b\(v),..., bm(v)) (where m depends on v). For example, b\{v) = 1 represents that v belongs to class 1 at the top of the tree, as opposed to the neighbor to class 1, class 0. Subsequent 62(f) bm(v) indicate a walk from the top-level class 1 down through its binary sub-groups. The next-word conditional probabilty can then be computed as
-1 (v),wt~ i,...,wt-n+i)
In [4] the tree construction depended on previously existing clusters of words, specifically, the WordNet resource [5]. [6] introduces new ways to cluster words for better performance of the hierarchical model, and points out that generating these hierarchies is possible to do at the same time as learning words. Word2Vecs hierarchical softmax method partitions the vocabulary into a Huffman binary tree.
The Core Word2Vec Algorithm
A set of algorithms is introduced in the original Word2Vec. Within this set are two different approaches to collecting words for training from a corpus: skip gram and continuous bag of words. Both methods share a similar basic approach to extracting

data from a training file. This shared basic core algorithm is described in the next section. There are also two different methods introduced to learn word embeddings from the training sets that are collected: hierarchical softmax, which is similar to hierarchical methods described above, and the negative sampling approach, which is introduced and described at the end of this section.
Collecting training data from a corpus
Word2Vecs algorithms use examples of word usage as input data to a neural network training procedure. Word usage examples come from large text hies. The hies are parsed, and the words are assigned tokens. The integer tokens tie the words to vectors in a neural network weight matrix. Once the tokens are obtained, short sub-sequences of words are examined to train the network. These sub-sequences will herein be referred to as context windows. A context windows size is configurable, but typically ranges from 5 to 10 words in either direction of a center target word, meaning that each context window typically contains a total of 11 to 21 words. The size of the window does not relate to the size of the word vectors.
Each window contains a of a set of words that appear in close proximity to each other in the training data. The set can be considered to contain a single target word, with a set of neighboring context words on either side of the target. For example, the training sentence the dog was cute but stinky offers a number of context windows. One such set would put cute as a target to context set {dog, was, but, stinky}.
The dog was cute but stinky
contexti context2
contexts context4
In such words, that
an example the context window is of size 2. This means a total of 4 is 2 on each side of the target, are considered as neighbors to the target

word cute. This window can be dragged from the start of the sentence to the end to generate other training sets, where each set contains a single target word surrounded by context words. The size of the context window is a parameter that can be passed to Word2Vec. Another context word generated from the same sentence could feature but as the target to the context word set {was, cute, stinky}.
For both the skip gram and the continuous bag of words methods, the input matrix is trained with the output matrix to promote activations of the network that represent a high likelihood that target words appear within a window of context words.
Continuous Bag of Words (CBOW)
In the continuous bag of words (CBOW) method, the vectors for each context word are first summed to one single vector. The network is activated with the dot product of this single condensed sum vector and the target word vector. The error is calculated, and then the gradient is used to apply an update to each of the original context words uniformly. Through this process, the target word predicts its surrounding context words. Because of the initial combination of vectors, there is less work to do in the CBOW approach, which makes CBOW faster than skip gram. The effect is a smoothing of the distributional information. CBOW tends to work best on smaller datasets, where the skip gram approach is better suited for larger datasets.
Skip gram
The skip gram method is just the opposite of the CBOW method. Instead of training the context words as a single vector, each context word vector is trained individually against the target vector. This method takes longer to train, but generally provides more accurate word vectors. The effect is that each context word is used to predict a target.

Word vectors as a neural network weight matrix
The basic idea shared by both hierarchical softmax and negative sampling is to use two weight matrices in a simple single-layer neural network to represent the words in a vocabulary. The first matrix, called the input matrix, or IT*, is of size V x w, where V is the size of the vocabulary and w is a configurable parameter. The parameter w defines the size of the word vectors. Each word in V will be represented by a w-dimensional vector in the input matrix. When the training is complete, this input matrix is a trained set of word vectors that can be saved for use in other machine learning models. The second weight matrix, the output matrix, or Wa, is similar in size. It is also V x w, but these vectors have a slightly different meaning depending on which of the two Word2Vec methods is used. Exactly what WQ represents depends on which loss calculation method is selected by the user: either hierarchical softmax or negative sampling.
Hierarchical softmax
Word2Vecs hierarchical softmax model is a hierarchical probabilistic language model, similar to those used in [4] and [6], described in As mentioned above, these hierarchical models require the vocabulary to be broken down into a binary tree. Word2Vecs hierarchical softmax breaks the vocabulary into a Huffman binary tree [7]. A Huffman tree groups words into a tree based on their frequency of appearance. It is often used to assign the shortest coding in a compression scheme to the most frequent words or characters in collection. It is used here for hierarchical softmax to achieve the 0(log2(\V\)) training complexity improvement over the otherwise 0(|H|) complexity without having a knowledge of the vocabulary (e.g. a pre-existing tree) ahead of time.
Leaf nodes in the tree represent words in the vocabulary. A binary code is assigned to every leaf in the tree. A 0 at the start of the code represents a step from the root of the tree down to the left, while a 1 at the start represents a step down to the right. Subsequent 0s represent left side descent, and ls right side descent in a walk down to

leaf nodes. The walk down from the root of the tree to each leaf passes a subset of inner tree nodes. Each of these inner nodes is represented by a vector in WQ. As the model is trained, each training step maximizes the probability that given a context word, we walk down the correct path from the root node to the leaf that represents the target word. The sigmoid of the dot product of the context word vector from Wi and an inner tree node vector from WQ determine which way to walk. Values close to 0 would take a left turn, and values close to 1 would descend to the right. Each training step updates at most log-2(V) inner tree node vectors, rather than all of the vectors in W0l as would be necessary in a neural network model that did not use this hierarchical classification approach.
An example is presented in Figure 1. The figure illustrates the classification tree for a vocabulary of 4 words. Each of the four words in the vocabulary can be represented by a walk from the root to the leaves of the tree. The target words in the tree are labeled to-3, and the nodes in the tree are represented by vectors Wa0_o2. The highlighted nodes are in the walk from the root to the word t\, which has the code 01. In this case, the walk to t\ passes through two nodes, represented by vectors Wo0 and Wol. Where a is the sigmoid function, the result of to
a(Woo Wu) 0
Figure 1: Illustration of the meaning of nodes in hierarchical softmax.

Algorithm 1: Original Word2Vec skip gram with negative sampling training steps
1 foreach training words do
8 9
// global memory
for i to number of context words around target do
/ W0[target] Wi[contexti\] // global memory, vector-to-vector accumulate contexti error using /; update Wi[target\ using /; generate k negative samples; for j to k negatiue samples do / W0[sarnplej\ Wi[contexti\; accumulate context error using /; update Wi[samplej] using /;
update W0[contexti\ with accumulated error; 1
// vector-to-vector
// global memory // global memory
Negative sampling
When using the negative sampling method, the W0 matrix is a second set of word vectors. The context words are represented by vectors in the IT* matrix, and the target words are represented as vectors in Wa. A configurable amount of random words are sampled from Wa at each training step, and the network is trained to predict the absence of the negative sample in the context window, while predicting the presence of the actual target word. The random sample represents words that are not found in the context window around the target. In this case, the sigmoid of the dot product of a vector in Wa that is the actual target word and the context word vector from IT* should be close to
1, while the sigmoid of the dot product of a vector in W0 that is a negative sample and the context word vector from Wi should be close to 0. This sampling technique prevents the need to update all words in WQ that do not appear in the context window at each

training step. The basic skip gram with negative sampling (SGNS) training steps are described in Algorithm 1.
Original Parallelization Scheme
With all of the original Word2Vec architectures, each individual thread operates independently during the training process. After the vocabulary and network initialization is complete, each thread opens up its own hie handle, and trains using a segment of the input hie. As the number of threads increases, the number of hie handles increases. Each hie handle is evenly spaced throughout the training document. During training, each thread hrst goes to disk, and reads into memory a single sentence. Once that sentence is fully read, the thread will begin to hnd context windows within the sentence, and generate training data based on the words found within each context window. The thread then uses these context windows to train the shared-memory matrix structures. Once all of the windows are processed in a single sentence, the process repeats by going back to disk for the next sentence. This is depicted in Figure 2.
Input file on disk
Training function
Main memory
Figure 2: Original parallelization scheme

Figure 3: Declining per-thread performance as the number of threads increases.
The updates to to the shared-memory matrix are not synchronized. This unsynchronized learning technique is called Hogwild [14], The idea is that there are enough training steps to compensate for the loss of information that occurs when threads overwrite ea-chothers updates in memory. The advantage to this technique is that there is no time lost locking and unlocking the shared-memory weight matrices.
Even though there is no locking or memory protection in this approach, there is still a performance problem when sharing the weight matrix structure in multiple cores. For example, some thread, Thread 1, might be reading from a vector in memory at the same time that some other thread, Thread 2, is writing to that vector. This will cause a false sharing issue in local cache and Thread 1 will be slowed by cache coherence enforcement. Its this problem that starts to degrade the effectiveness of using multiple threads in the original implementation, as seen in Figure 3. In this figure, throughput is measured as words processed per second, per thread.
Faster Word2Vec Models
As shown in Figure 3, the original implementation of Word2Vec was not optimized for large-scale multicore systems. The per-thread performance starts to decline rapidly

as the number of threads increases on multi-core, shared memory systems.
The original Word2Vec implementation uses shared writable data structures throughout the execution. These structures do not lend themselves well for optimized performance in current multi-core shared-memory architectures with private level one and two cache organizations. For example, the current vector-to-vector operations are handled in loops that interact with shared memory. When multiple threads are writing updates to the same vectors in shared memory, the thread execution is repeatedly interrupted by cache block invalidations to enforce coherence [12] [13]. By first copying small batches to thread-local memory, the effect of these conflicts is largely reduced. The batches can then be trained with a set of optimized matrix-to-matrix operations.
Algorithm 2: Intel matrix to matrix matrix training i foreach training words do
8 9
generate k negative samples;
for i to number of context words around target do WiLocai[i\ e- Wi[contextj\;
WoLoCai[o] t- Wo[target\] for j to k negatiue samples do WoLocai[ 1 + j] t- Wolsamplef;
G Ui GT X W0L0cai J Uo < G X WiLocah update Wi from Up update W0 from UQ\
' iLocal't
// global memory // global memory
// global memory // local memory, matrix to matrix
// local memory, // local memory,
matrix to matrix matrix to matrix // global memory // global memory
Recent work on speeding up Word2Vec has focused on the skip gram with negative sampling (SGNS) variation of the algorithm [12] [18]. When using SGNS Word2Vec,

re-use of the same negative sample set across an entire sentence has been shown to speed up the training process [12] [18]. By reducing the amount of random sample generation there is an immediate reduction of work. Furthermore, by using a small subset of the same samples across all target/context sets in a sentence, the amount of cache coherence conflicts drops as a fewer number of threads are concurrently working on the same vector.
BIDMach is a machine learning library that includes a variety of different supervised and unsupervised models. It is written in Scala, and allows for quick development and optimization of complex machine learning models. One such model is a custom end to end GPU implementation of Word2Vec. BIDMachs implementation introduces the idea to batch training steps into matrix primitives, and re-use negative samples to cut back on memory bandwidth. These ideas are what led Intel [12] to great success with their own CPU implementation.
BIDMach [18] and Intels work [12] have demonstrated that by batching the training word vector data into separate, thread-local spaces in CPU memory, the cache coherence conflicts are greatly reduced and multi-core operation speeds up. By batching training operations into local memory, network activation, softmax and error gradients can be calculated without going back to shared memory for updated data. The result is a drastic reduction in cache coherence conflicts. The batched training operation is illustrated in Figure 4.
This batching method also allows the training dot product operations to be reduced to a single set of matrix-to-matrix operations [12]. The matrix-to-matrix operations provide greater opportunity to achieve speedup through parallelization and greater optimization of local cache, especially on hardware that is built specifically to optimize such operations. The altered algorithm developed by Intel is shown in Algorithm 2.
In the negative sample and context word batches used by BIDMach and by Intels approach, there is an implicit label matrix associated with the result of the batched matrix multiplication. Although this matrix does not exist in memory, it appears as

Input file on disk
Thread 1
Thread 2
Thread 3
Thread 4
Training function
Main memory
Figure 4: Updated parallelization scheme
Figure 6.
The rows of this matrix correspond to the target word (row 1), and the selected negative sample words that are used by the batch (rows 2 to m). The matrix represents the ideal result of the W im*Won matrix activation. Error is calculated as the difference of the softmax of the result of this activation and the ideal matrix label. It is implicit in the BIDMach and Intel models because each algorithm simply checks to see if the word is the target or not in order to decide the 0 or 1 label.

4word, vector size^t
target sample i sample2
T -

O so so so so ho
X Q 0 0 0.0
^ QJ

o o o o
ooo O

3 1
i -
Figure 5: Matrix multiplication: (words from W0L0cai) x (words from WfLocal) in Intels batched network activation method.
cl c2 c3 ... c(n)
ill... 1 0 0 0... 0 0 0 0... 0 0 0 0... 0
ooo... o
Figure 6: Implicit label matrix used in Intels batch model.

In this chapter we present new and improved Word2Vec skip gram algorithms packaged in our own parallel, object-oriented C++ implementation that is optimized for multi-core shared memory CPU, and NVIDIA GPU. The ideas explored while developing this code have all been focused on batching word vectors into matrices. Hence the name of the package: Wombat short for word matrix batches.
The contributions described in this paper are:
New CPU implementation of the skip gram with hierarchical softmax architecture (SGHS) that achieves as much as 4x speedup over the original Word2Vec SGHS implementation on multi-core shared memory machines. This method does not require special hardware, but will show an even more dramatic increase of performance of up to almost lOx the original Word2Vecs speed on Intel machines with the Intel MKL library support.
Original CUDA kernels developed for speed and accuracy of the skip gram with negative sampling approach (SGNS) that offer a great improvement of accuracy over the current state of the art GPU Word2Vec implementation, BIDMach [18]. Our GPU implementation of SGNS is 8x faster than the original Word2Vec, without a loss of accuracy.
A fast GPU SGHS method, which is the fastest and most flexible GPU implementation of this method that we know of. Our GPU implementation of SGHS is lOx faster than the original Word2Vec implementation, and 4x faster than leading improvements.
The code was developed with flexibility in mind. It can be compiled with GCC, ICPC for Intel MKL support, or with CUDA for NVIDIA GPU support. Note that all variations of the Word2Vec algorithms have yet been implemented in this code the

CBOW methods are currently absent. The focus of this library is the two different skip gram techniques: hierarchical softmax (SGHS) and negative sampling (SGNS).
New Wombat hierarchical softmax method
We propose a batched training approach to hierarchical softmax in which the set of vectors representing the hierarchy of nodes defining the target word are trained with the set of context word vectors in a single set of matrix operations. The batching algorithm is shown in Algorithm 3. Each thread will be executing these steps in parallel.
Algorithm 3: Wombats hierarchical matrix to matrix training
1 foreach training words do
for i to context words around target do WiPrivate[i\ Wi[contexti];
8 9
for j to k vocabulary tree node vectors do ^VoPrivate[j\ ^ Wo\jl0dej\]
G < WoPrivate x >
calculate error in G;
U% ^ G X WoPrivatei UQ i G X \ViPrivatei update Wi from Ug, update W0 from U0\
The vectors in Wa that represent the nodes in the target words node set are batched into matrix WoPrivate, and the vectors in Wi that represent the individual context words are batched into matrix WiPrivate- The work done by activating each individual pair of node vector to context word vector can now be done with a single call to WoPrmate x WiPrivate- illustration of this network activation operation is shown in Figure 7.

4word, vector size^t
node i node2 nodes
T -

1 2 3 n
O ho -fO HO HO HO
'A' <13 HO HO HO HO
£ £ £ £
o o o o

i -
Figure 7: Matrix multiplication: (words from W0private) x (words from W[Private) in the proposed batched network activation method.
The result of this operation is a kxn matrix, where k is the number of vocabulary tree nodes that define the target word, and n is the number of context words surrounding the target word in the context window. The sigmoid of each individual entry in the matrix is calculated, and error is determined from the difference between the sigmoid value and the label of the given row and column in the label matrix. In the case of hierarchical softmax, the label matrix matches the codes of the Huffman binary tree associated with the target word. An example label matrix is exemplified in Figure 8.
Batching Operations for GPU
There are numerous GPU implementations of Word2Vec, but none of them we have found offer a flexible, fast skip gram hierarchical softmax (SGHS). The GPU implementations we have found all focus on either the CBOW method or the negative sampling variant of the skip gram method [18], and/or place limitations on the sizes of word vectors that can be trained [19].
Our work will also focus on the skip gram Word2Vec architecture. We present both a hierarchical softmax (SGHS) and negative sampling (SGNS) implementation within this

cl c2 c3 ...
nodel 111... 1
node2 0 0 0... 0
nodeS 111... 1
node 4 111... 1
node(m 1) 0 0 0... 0
node(m) 111... 1
Figure 8: Adjusted label matrix for hierarchical softmax. Example word has Huffman code 1011...01
architecture. We also present a new heterogeneous method of batching and training the network for the negative sampling approach that yields higher quality word vectors.
Data dependencies
Every sentence in the training data will yield multiple target/context word sets that are used for training. As a context window is dragged from the start of the sentence to the end of the sentence, each individual word will be a target word once, but will be a context word multiple times for different neighboring target words. This means proximate positions of the context window yield repeated updates to the same context word vector in IF*. For this reason, the updates cannot happen concurrently without a loss of information.
This problem is illustrated in Figure 9. In this diagram, each row represents the same 14 training words found in a single sentence. The circled words, w7, w8 and wg are target words in the first, second and third training steps, respectively. The underlined words represent the context words that surround the targets. The arrows show the overlap in updates that occur between subsequent positions of the context window. Although the

w 14

Figure 9: Data dependencies between training steps
target words do not overlap, the context words do. Target/context sets from the same sentence cannot be batched into a single concurrent training step.
When batching context window training on the multi-core, shared memory CPU as described in Section there is no such overlap. The multi-core CPU implementation has each thread training independently with separate hie handles. Each hie handle is used to read a different part of the same input hie, so each context window trained in parallel is coming from a different part of the input, and a different sentence. Plowever, an NVIDIA Tesla P100 GPU can execute 64 cores in parallel on each of its 56 streaming multiprocessors. In order to keep all 3584 CUDA cores well-fed with training operations, more batches than hie handles are required. Batching training operations for this level of parallelism requires careful consideration of this data dependency. Batching subsequent context windows from the same sentence will cause a loss of information between training steps. This problem is also described in [19].
Proposed batching method
Data transfer from CPU to GPU is an expensive operation. Reducing the amount of transfer is important for performance. When CUDA kernels are launched, they should have enough data to work with to make the transfer worthwhile, but larger training batches run the risk of hitting data dependency issues described in the previous section.
We also want to make the most of our CPUs when the GPU is busy. If we offload 100% of the work to GPU, our CPUs become idle and are a wasted resource. However,

Input file on disk CPU parsing GPU Batch Data GPU Global Memory
Figure 10: CPU/GPU heterogeneous batching diagram
because of the expense of transferring data from CPU to GPU and back again, it is difficult to have the two processing environments working together.
We propose a heterogeneous method of Word2Vec training. The input and output weight matrices are kept completely in GPU memory, but the multi-core CPU is still responsible for parsing the input hie and generating the training batches. The multiple CPU cores will parse the input hie and prepare training batches that are carefully constructed to avoid data dependency issues. These batches are then sent to GPU to be processed. Each CPU thread will work with its own CUDA stream to send data and queue kernel launches in the GPU. Each batch prepared on CPU is a set of integer references to vectors in the GPU weight structures. By transferring smaller integer references, rather than complete vectors, we reduce the amount of CPU to GPU data transfer. The batching model is shown in Figure 10. The initial step of tokenizing and sorting the vocabulary, and preparing the required data structures for the training phase is still done on the CPU before heterogeneous training begins.
Recent releases of CUDA have featured the ability to transmit data and execute kernels in parallel. We optimize GPU execution by parsing the input hie on the CPU at the same time that the GPU is using its boating point units to process training batches. The CPU threads each start by parsing part of the file, as was done in the original C

implementation of Word2Vec. However, instead of working with only a single part of the hie, and calculating vector updates, each CPU thread works with multiple hie parts and fills a buffer of tar get/context sets. Each target/context set contains a set of target word integer tokens, a set of context word integer tokens and a set of labels for each target/context word pair in the batch. These target/context integer token sets reference indices in the GPU weight matrices. The target word tokens are indices of the output matrix, and the context word token set are indices of the input matrix. Each training batch also contains a few other integers that point to boundaries of the training set. An example batch is illustrated in Figure 11.
CPU prepares batch index data
targets: 1023, 10120, 534, ... contexts: 1231, 853, 2342, 12312... labels: 1, 0, 0, 0, 1, 0 .... batch meta data
GPU unpacks batch
W on GPU
Figure 11: CPU/GPU batch data transfer
Sending tokenized sentences to the GPU for processing would be a more efficient method of data transfer. There are (x x 2c) + B integers in each training batch, where x is the number of words in a sentence, c is the context window size, and B is the amount of meta data (labels and boundary data) sent in each batch. In [19], only x integers are sent, and the target/context sets are parsed in the GPU. However, by parsing target/context

sets on CPU, the CPUs can be actively participating in the training operations at the same time as the GPU. The GPU is left to do only the floating point operations, which is still the bulk of the work in this application. The CPU to GPU memory transfer latency is hidden and overlapped by the amount of time it takes to complete the floating point operations.
The target/context integer sets are buffered into pinned CPU memory. The number of sets to be buffered is a parameter that the program takes as input. Each thread must make sure that every target/context set in a single batch comes from a different sentence, so the batch size must be small enough relative to the number of hie parts being parsed to ensure a diversity of sentences in each batch.
Once an individual CPU threads buffer is filled, the thread starts a memory transfer to the GPU. In recent releases of CUDA, this is a non-blocking transfer that can be queued into a CUDA stream. Once the transfer is initiated, the thread can go back to its duty of parsing the input hie and collecting the next training batch. While the CPU threads collect the next training batch, the GPU can actively work on the previous batch. The kernel execution can be queued into multiple streams per CPU thread to prevent any blocking of the CPU or GPU during the exchange of target/context set data.
This approach does not require preprocessing and therefore does not hide the need to repeatedly revisit the input training data from the count of words per second. The training hie is parsed at the same time that training steps are processed on the GPU. Furthermore, by having a diverse set of training data in each batch, the issues relating to accuracy are addressed.
GPU Kernel Design
For both SGHS and SGNS, the work done during training is similar. A sub-matrix is collected from WQ and a sub-matrix is collected from W*. This similarity allows us to re-use the same CUDA kernels for both the hierarchical softmax method and the

negative sampling method. There is a difference in how the label matrix is constructed and applied, but the concepts described here are consistent for both SGHS and SGNS.
The cuBLAS library provides the fastest implementations of simple matrix multiply operations. However, by using library functions, we give up the ability to keep vector data in GPU shared register memory as we move from the matrix multiply operation to the other actions required during each training step. For this reason, we have designed our own GPU kernels to handle each target/context set training operation.
Its important to note that the matrix sizes are not consistent in each training set. The sentence sizes are not always the same, not all words in the vocabulary tree have the same number of nodes, and there is a random dropping of training words that occurs during parsing. This complicates efficient use of the CUDA cores. The most efficient implementation of CUDA should not allow warp divergence.
We developed a kernel that executes Algorithm 3, lines 2 to 11, in blocks of exactly 32 threads for each training step. By using 32 threads per block, we can drastically reduce or completely eliminate warp divergence, depending on the size of the word vectors. We can then utilize GPU shared memory registers and shuffle operations to reduce interaction with GPU global memory, thus increasing performance.
In order to get uneven matrix multiplications into evenly sized blocks of 32 threads, some pre-processing is required on the CPU. As target/context sets are reduced to training operations, there is a variable number of target vectors and context vectors. In SGNS, the number of target vectors is 1 + k where k is the number of negative samples. In SGHS, the number of target vectors is 1 to MAX_CODE_LEN where MAX_CODE_LEN is the longest binary code assignable to a word in the classification tree. The number of context vectors ranges from 2 to 2c where c is the context window size. There is a variable number of context vectors because Word2Vec randomly drops context words during training.
Since there are typically at least 4 target vectors, and at least 8 context vectors, we can batch training operations into 4x8 sub-operations. These operations can be

carried out efficiently in a matrix-based kernel. The remaining training operations can be handled with a separate vector-based kernel.
The program takes a batch size parameter. This parameter is used to determine how many target/context sets should be sent to GPU at once. Out of this number, we batch as many 4x8 sets of training sub-batches as we can, and then batch the remaining training operations separately. Once the given number of target/context sets are buffered into either the matrix batch or the vector batch, we launch 2 kernels. The first kernel launch handles the matrix batches, and the second handles the vector batches. The two combined kernels handle the total given number of target/context sets. The step-by-step process is illustrated in Algorithm 4. Each CPU thread executes this sequence in parallel.
Algorithm 4: Wombat GPU kernel actions
1 foreach training words do
2 collect context window batch data on CPU;
3 split batch into 4x8 matrix;
4 append matrix batch to matrix batch;
5 append remaining dot-products to vector batch;
6 if have given number of context windows in buffer then
7 send buffered batches to GPU;
foreach matrix kernel block do Execute fast matrix kernel
foreach dotproduct kernel block do Execute fast vector kernel

+4 +4
Figure 12: Loading data to shared memory.
Fast matrix kernel
The fast matrix kernel uses shared memory and shuffle operations to quickly perform all of the batch operations described in Algorithm 3, lines 2 to 11. The block size is 32 threads. These threads can be considered to be a 4 x 8 grid and a 8 x 4 grid. The first step is to take the GPU batch data from the CPU kernel launch, and to load vector data.
Figure 12 illustrates the shared memory load operations. First, as a 4 x 8 grid in w/8 steps, the threads work together to load 4 WQ vectors from GPU global memory to shared block memory. Then, as a 8 x 4 grid in wj4 steps, the threads work to load 8 Wi vectors from GPU global memory, where w is the word vector size. There is no synchronization required during these first two operations.
The next step is to perform a matrix multiply of the 4 vectors from WQ1 and the 8 vectors from Wi. Note that these 32 vectors are all now in shared register memory as WgPrivate an<4 bhiprivate, respectively. Each thread will perform a single dot product in a loop of w steps. This is illustrated in Figure 13. This approach is simpler than some more advanced matrix multiplication techniques, but works well because the complete set of 32 vectors will fit into shared memory. The size of w typically ranges between 100 and 300.
After each thread independently calculates its proper dot-product of vectors in shared register memory, leaving the result in thread-local memory. Each thread then applies the

sigmoid function and calculates error for its local value given its position in the label matrix.
... w

Figure 13: Wombat matrix multiply.
The next step involves two matrix multiplications. The 4x8 block of threads can be thought of as a single matrix, G. Each thread holds a single value in its local memory representing the value of G at its respective row and column. The trick now is to perform GT x WoPrivate and G x Wiprivate, without putting the individual row and column values of G into shared or global memory. We already have what we need in each threads local memory. Using a handful of shuffle operations, we can avoid using the next level of memory hierarchy. We can also do both matrix multiplications at once with a single loop of size w. The process is illustrated in Figure 14.
. w
I... w
... w

Figure 14: Wombat update calculation illustration.
In the case of the (4 x 8 G matrix) x (the 8 x w Wiprivate matrix), the result is a 4 x w matrix. We can calculate this in w steps. In the first step, each row of threads in G performs an element-wise multiplication with the elements in the first column of

Wipriva,te Each thread in the row does a single scalar multiplication with an element in the shared memory Wiprivate column that corresponds to the threads column. For example, row 1, column 4 of the block of threads will do a single multiplication of its single local Gi;4 value with column 1, row 4 of Wiprivate. The result of this first step is that each of the 4 rows in our block of threads now has a 8 threads each containing the single product of an element in G and the corresponding element in Wiprivate. The next step in a matrix multiplication is to sum these 8 elements to form an entry in each of the 4 rows in our result matrix. This summation is done with a series of shuffle down operations that pass data between threads within our single warp. The reduction sum operation across our 4 rows yields 4 entries to the result matrix. This operation is repeated w times to complete the matrix multiplication. Within the same loop of w steps, the transpose operation is done to perform GT x W0private-
Fast vector kernel
The vector kernel behaves similarly to the matrix kernel. Each kernel uses a block of 32 threads. The threads first load two vectors into shared memory from global memory. Each of the threads then works on a subset of elements in each vector to perform elementwise multiplication, and then the threads use a shuffle operation to reduce the individual products down to a single dot product. The single dot product is broadcast to all 32 threads, which then update segments of the global memory vectors in parallel. The vector kernel does not perform as fast as the matrix kernel, but it is necessary to handle the training data that does not fit nicely into 4x8 sub-matrices.
Experiments and results
The experiments presented below were performed on a single node of the Heracles cluster in our Parallel and Distributed Systems lab ( that has 2 Intel Xeon E5-2650v4 Broadwell-EP CPUs with 24 total cores (12 cores per processor).

The CPUs run at 2.20 GHz, and the node has 128GB RAM. The operating system is CentOS Linux release 7.2.1511. Code was compiled with the Intel C++ Compiler version 17.0.1 and Intel MKL version 17.0 Update 1 for testing Wombat + MKL, and GCC version 4.8.5 was used to compile and test Wombat without MKL. Also on this node are 4 NVIDIA Tesla P100 16GB Pascal SXM2 GPUs. One GPU was used for all Wombat and BIDMach experiments. GPU code was compiled with NVCC V8.0.61. The CUDA driver is version 8.0, with compute capability 6.0.
Gensim is a popular Python NLP framework that has one of the most popular implementations of Word2Vec [20]. Its popular because it integrates with a rich Python ecosystem of math and machine learning libraries, and there is a large amount of online documentation that explains how to use the library. We compare our results to the performance of Gensim, as well as to Cythnn and the original C implementation. Cythnn contains a Word2Vec implementation that attempts to speed up hierarchical softmax by reducing the amount of cache coherence conflicts [13], but the technique proposed does not batch training sets into matrix-to-matrix operations, which is a big part of what has made Intels work in [12] with the negative sampling so successful. Instead, Cythnn simply copies sets of training data into thread-local memory. The local data is updated in place of shared memory updates, and the local copies are written back to shared memory only occasionally. The result is a 2x to 4x speed up, as the cache coherence conflicts are indeed reduced. However, the technique does not scale well to multiple processors, and does not provide the opportunity to use highly optimized matrix-to-matrix library calls.
The BIDMach implementation of Word2Vec claims to have the fastest GPU implementation of SGNS Word2Vec [18]. However, our analysis and tests of the BIDMach model have revealed flaws in the implementation that affect the accuracy of the model output. We have adjusted the parameters of BIDMach to get useable output, and found speeds to be lower than what has been reported. We compare our implementation to the speed and accuracy of BIDMach.

Gensim Wombat
Number of cores
Figure 15: Speedup for batched hierarchical softmax CPU method compared to leading implement at ions.
Accuracy was evaluated for word similarity scores and word analogy scores for baseline comparison with various models. The WS-353 dataset [22] contains a large set of word pairs. Each word pair is given a similarity score that is sourced from human judgement. To evaluate the accuracy of our model, we use the Spearmans rank correlation coefficient [24] between the human judgement and the cosine similarity of our word vectors that represent the words in each pair. Values of this similarity that measure close to 100 represent a model that can detect similarities between words as well as a human. We also used the Google analogy dataset [7] to evaluate the performance of the vectors in answering ffil-in-the-blank analogy questions. The score ranges from 0 to 100 and represents the percentage of analogies correctly answered.
CPU Batched Hierarchical Softmax
The Word2Vec parameters used were as follows: window size was 100, training hie is the texts dataset from Matt Mahoneys website1 (this is a truncated and cleaned export of Wikipedia articles that is often used to test the performance of Word2Vec), the
1 http: / / .zip

Table 1: Evaluation of hierarchical softmax accuracy in new Wombat model compared to Cythnn and original Word2Vec
Program Similarity score Analogy score
Wombat 68.7 31.5
Cythnn with top-31 cache 67.7 31.1
Original Word2Vec 68.0 34.0
downsampling parameter is le-4, 10 training iterations are used and the window size is 8. Cythnn was configured to cache the top 31 nodes in the hierarchy.
The results show more than a 4x speedup when using Wombats batched training method without Intel MKL over the original C implementation on all 24 cores. The seal-ability is linear across the two processors for the non-MKL batched method. Cythnn does not scale to the second processor. This is likely due to restrictions of Python itself. The Global Interpreter Lock (GIL) in Python prevents the threads from scaling to multiple processors. The same problem exists for Gensim. The proposed C++ implementation does not suffer from this problem, which allows it to reach greater speeds on multiple processors. Although Cythnn outperforms the Wombat method on a single processor, Wombats MKL-enhanced implementation is much faster in all numbers of cores and processors.
There is a dip in Wombats performance when the threads start to use the second processor in the batched version with MKL. Each processor has 12 cores, and each set of 12 cores shares an L3 cache. When less than 13 cores are used, all of the work is done in a single processor. When 13 or more cores are used, a second L3 cache and a second processor are introduced. There is a lot of data being shared by each processor, so there is an additional layer of cache coherence conflicts between the two processors. These conflicts are slower to resolve because the two L3 caches reside in separate processors. An increase of threads running in the second processor yields an increase of cache conflicts,

and an increase of data transfers through main memory. These increases result in a degradation of overall performance. When MKL is absent, this degradation is hidden by the extra time it takes to do the matrix operations. This problem only becomes visible when those operations are fast enough to leave behind the memory transfer as the primary bottleneck to performance. At 12 cores, Wombat with MKL is more than lOx faster than the original C implementation. As far as we know, this is the fastest implementation of skip gram hierarchical softmax Word2Vec to date.
The accuracy of each implementation is extremely similar. There is a slight loss of accuracy in both Cythnn and the Wombat model in the analogy scores, but the word similarity scores remain high. The increased speed comes at a minimal loss of accuracy.
GPU Experiments
The Word2Vec parameters are as follows: word vector size 100, context window is 8, 5 negative samples are used, downsampling set to le-4, 10 iterations and a learning rate of 0.025.
The Wombat program also takes the number of CPU threads as a parameter. These experiments were executed with 8 CPU threads. This increases the rate at which GPU training batches are collected from the training hie.
Batch sizes greater than 512 for Wombat do not increase speed, and start to cause problems with the way that sentences are batched, so all experiments for Wombat use a batch size of 512 target/context sets (split into a variable number of matrix and vector batches). The BIDMach batch sizes are not documented. Its not clear what this means, but we discovered that reducing the batch size does seem to have an effect on accuracy and speed. Reducing the batch size to values smaller than 100,000 causes the program to generate errors.
Note that the speeds reported here for the BIDMach GPU implementation are much lower for BIDMach than what has been widely reported in other research. In [18] a figure

Table 2: Execution time and accuracy of Wombat vs. BIDMach.
Program Batch size Word similarity score Analogy score Words/sec
BIDMach GPU SGNS 1,000,000 24 2.9 3.236M
BIDMach GPU SGNS 100,000 33.5 1.6 1.492M
Wombat GPU SGNS 512 68.2 31.9 4.633M
Wombat GPU SGHS 512 67.2 32.2 1.942M
of 8.5 million words/second is reported. This speed is possible with the default learning rate and vexp parameters, but the output model returns 0% accuracy in all tests with these parameters. The exact function of the vexp parameter is not documented. Only by setting this to 0 and reducing the learning rate does the model actually produce useful vectors. It then produces them at the slower speeds reported in Table 2. Also note that BIDMachs words per second reports do not factor in the required preprocessing of the training data that is unique to BIDMach. Wombat does not require any preprocessing. The Wombat GPU implementations for SGNS are similar in speed to BIDMach or faster, and they provide much greater accuracy.

Fast Text is an open source library developed by Facebook AI Research. The library has shown that use of sub-word information greatly improves the accuracy of the vector representations of words made popular by Word2Vec [16]. However, the per-thread performance of the Fast Text library degrades as the number of threads increases on multi-core, shared-memory systems. Our analysis shows this to be largely due to cache coherence conflicts. Addressing this issue is the goal of this work.
We restructure the implementation and enhance the level of parallelism and the algorithms scalability by batching training steps into isolated vector-to-matrix operations in thread-local memory. The existing implementation uses shared writable data structures throughout the execution. These structures do not lend themselves well for optimized performance in current multi-core shared-memory architectures with private level one and two cache organizations. For example, the current vector-to-vector operations are handled in loops that interact with shared memory. When multiple threads are writing updates to the same vectors in shared memory, the thread execution is repeatedly interrupted by cache block invalidations to enforce coherence [12] [13]. By first copying small batches to thread-local memory, the effect of these conflicts is largely reduced. The batches can then be trained with a set of optimized vector-to-matrix operations.
Use of highly optimized vector and matrix libraries can greatly reduce execution time of applications like Fast Text that have linear algebra routines in their core [15]. We have restructured Fast Text to use a linear algebra library. This alone speeds up Fast Text, and provides even greater improvement to speed when coupled with our proposed vector-to-matrix training batches. Matrix-to-matrix operations would provide further optimization [12], but we have found that it is not possible to reduce the operations of each Fast Text training step to a matrix-to-matrix operation without introducing significant overhead, or reduced accuracy of the output model.

The reduction in performance created by cache conflicts can be further reduced by re-using the negative samples collected for each training step. Negative sample re-use reduces the number of shared-memory updates that are necessary during each training operation [12] [18]. We also propose a way to reduce repeated negative sample use to a simpler scalar multiplication of single floating point values. This reduces the amount of calculations required at each training step, and also further reduces the chances of cache coherence conflicts.
FastText skip gram with negative sampling
In addition to learning vectors for individual words, FastText also learns vectors for the n-grams that are found within each target word. At each training step in FastText, the mean of the target word vector and its component n-gram vectors are used to activate the hidden layer. The gradient vector that is calculated in each training step is then used to uniformly update each of the vectors that were combined to form the target. Going back to the example sentence above, the target word cute would yield a hidden layer that is the mean of the n-gram vectors for cut, ute, and cute. These 3 vectors would then be updated with the same gradient vector after the error is calculated in the training step.
This adds a lot of additional computation to each training step, especially for longer words that are composed of more than just 3 n-grams. At each point, a word needs to sum and average its n-gram component parts. The trade-off is a set of word-vectors that contain embedded sub-word information. These vectors have been shown to be more accurate than Word2Vec vectors by a number of different measures [16].
Negative sampling in FastText
In FastText, the negative words are sampled from the context word vector matrix rather than the target word matrix. The context words and negative samples reside in WQ1

not Wi. Target words are a summation and average of the target word vector in IF* and the target words set of n-gram vectors in IT*. At each training step, a vector-to-vector dot product is used to calculate the network activation for context/sample words from W0 and the mean of the vectors from Wi. The mean is stored in thread-local memory, but the W0 context/sample vectors used in this operation remain in shared memory.
Since both the samples and context words are coming from WQ in Fast Text, it is difficult if not impossible to create a matrix-to-matrix operation. In order to reduce the training step to a matrix-to-matrix operation, the negative words would have to be sampled from the same source as the target word. The two sets of word vectors in the training step must come from different weight matrices, or else the training procedure is reduced to updating only a single weight matrix in the neural network. The input matrix, Wi, is the set of word vectors that will be the output word vectors of the algorithm. It makes no sense to create extra work in training only Wa.
It is also not useful to try to create a set of negative samples from Wi in Fast Text. Because the vectors taken from Wi are summed and averaged, sampling from Wi would add a considerable amount of computation to the procedure. Each negative sample would have to be summed and averaged with its sub-word n-gram vectors. Skipping the sum-and-average step for negative samples could prevent the need for extra work, but doing so has a negative effect on the models accuracy.
Proposed improvements to FastText
We propose a set of vector-to-matrix operations as an alternative to the SGNS Fast-Text core training loop. Rather than train each target/context word pair individually, the complete set of context words surrounding a target word within a context window are trained in a single batch in thread-local memory.
This is accomplished by first aggregating the set of context word vectors from the global Wa weight matrix into a local context word matrix, W0L0Cai For each vector

appended to W0Locai, we need to consider k negative samples, where k is the number of negative samples used when training each target/context word pair. The word vectors in W0 that represent our k negative samples should also be added to the context word matrix, WoLoccU.
Rather than sample a new set of words for each target/context word pair, we propose to instead use a single set of negative samples for all target/context word pairs that appear in the batched context window. However, because we are putting context vectors and negative samples into the same matrix, this sample re-use results in duplicate vectors in WoLocai, as shown in Figure 16.
4word, vector size^t
^2 targets count{targets)

HO 0)
CD *+o CD
' H
CO a. a CO a. a
e S:S e S:S '
o cc rt o nj nj
^ m vi V) m m
£ rH X
^ CD 03
qj a ^ a
a a
o rt 03
Figure 16: Vector-to-matrix batch showing duplicate sample vectors in bold.
This duplication of vectors in the matrix would result in duplication of work in the vector-to-matrix operation. Removing these duplicate vectors would greatly affect the ratio of negative samples to context words in each training batch. This negatively affects the accuracy of the models output. The sample/context ratio should remain the same as it is in the original Fast Text implementation to ensure that the effect of the negative samples on the model accuracy is not reduced. The effect of using the same

samples repeatedly can be reproduced with a simpler multiplication operation of the
error gradient, in place of keeping duplicate vectors in the context matrix.
4word, vector size^t
^2 targets count (tar gets)


-to -to
CO CO -to -to
o o
£ rH X
ho CD
oj a ^ a
a a
o rt 03

Figure 17: Vector-to-matrix batch with duplicates removed. A single set of k samples remains.
When the target vector is multiplied by the context word matrix, as is shown in Figure 17, the result is a 1 x (n + A;) vector, where n is the number of context words in the training set, and k is the number of negative samples. Applying the sigmoid function to each element in this vector yields a vector of values that can be used to calculate updates to the target vector and to each context word vector. This vector can be adjusted to reproduce the effect of each negative sample appearing n times for each of the n context words. The elements in the gradient vector representing samples simply need to be multiplied by n. The result, when using this vector to calculate updates to Wi and Wa, is mathematically equivalent to having duplicate columns in the context matrix, without generating the duplicate work. Experiments show that including this operation does make the resulting output more accurate than simply eliminating the duplicate vectors. This operation is illustrated in Figure 18.

contexti ... contextn sample ... sample^
y. targets count (targets)
/ / nx f ... nx f
Figure 18: Multiply sample gradients by n rather than calculate their dot product n times.
The complete training algorithm used is described in Algorithm 5.
Hardware: The experiments presented in this paper were performed on a single compute node of the Heracles cluster in our Parallel and Distributed Systems lab ( that has 2 Intel Xeon E5-2650v4 Broadwell-EP CPUs with 24 total cores (12 cores per processor). The CPUs run at 2.20 GHz, and the node has 128GB RAM.
Software: The operating system is CentOS Linux release 7.2.1511. Code was compiled with the Intel C++ Compiler version 17.0.1 and Intel MKL version 17.0 Update 1.
Eigen for BLAS, MKL: Eigen is a C++ template library for linear algebra. It has no external dependencies other than the C++ standard library. It works as an interface to Intel MKL, Apples Accelerate framework on OSX, OpenBLAS, Netlib LAPACK and other BLAS back ends. The library offers an intuitive syntax that makes it easy to integrate a number of different linear algebra optimizations with a variety of compilers [17].

Algorithm 5: Proposed vector-to-matrix training
1 foreach training words do
generate k negative samples;
n number of context words around target; for i to n do
WoLocai[i\ r- Wa[contextj\;
8 9
for j r- to k negatiue samples do
WoLocai[n + j] r- W0[samplej];
ViLocai = zeros
foreach subword of target do ViLocai = ViLocai + VPj [subword] ]
// global memory
// global memory
// global memory
iLocal number of subwords
G t- Woiocoj x lG[ocap // local memory,
calculate error in G;
for j r- to k negatiue samples do
Wol.ornl | O I j ^ IPoLocaZ I j X 11,
// FastText vector-to-matrix
18 19
Ui < GT X IPoLocaZ; Ua < G X ViLocai J update VF0 from [/*; update VP* from [70;
// local memory, vector-to-matrix // local memory, vector-to-vector // global memory // global memory

Fast Texts first release already uses matrix and vector classes that were written in pure C++. By rewriting the core operations in these C++ classes to call the Eigen library, it is possible to achieve immediate speedup even for the vector-to-vector operations. For the experiments in this paper, all matrix and vector calls in the original Fast Text implementation have been replaced with calls to the Eigen library, version 3.3.3. The Fast Text library has then been compiled with Intels ICPC compiler with the macro EIGEN_USE_MKL_ALL defined. As a result, all vector and matrix operations are compiled to use Intels MKL subroutines.
The speedup that comes from a result of using Eigen is reported separately from the speedup achieved through batched training steps in Figure 19.
Training corpora: Code was tested using the enwik9 Wikipedia dump hosted on Matt Mahoneys website. This test set is the same hie linked from Facebooks example shell scripts.
Evaluation: As with the majority of literature on Word2Vec, the speed results are measured in terms of words processed per second, and accuracy was evaluated in terms of a word similarity test and a word analogy test. The tests and results are explained in greater detail in Section .
Code: Experiments were conducted with a modified version of Facebooks open source Fast Text code. The modifications will be released in a forked branch on GitHub, and a pull request will be issued on publication.
FastText parameters: The parameters are similar to those used by Facebooks initial report of FastText [16]. Learning rate is 0.025, w is 100, window size is 5, 5 training epochs are used, 5 negative samples, 2000000 n-grams, 24 threads and n-grams between 3 and 6 characters were used.
Without a loss of speed with a smaller number of cores, the new approach to training the FastText model shows a speedup of about 2.5. Some of this speedup comes merely

from introducing Eigen and MKL to the matrix and vector operations. By replacing vector operations with Eigen calls, which in turn call Intel MKLs subroutines, there is a speedup of approximately 1.4, this is labeled in Figure 19 as FastText with Eigen. By batching the operations into local memory and using vector-to-matrix operations, theres an additional speedup of 1.78, this is labeled in Figure 19 as Batched SGNS With Eigen. These speedups are calculated based on numbers achieved with all 24 physical cores working. A large percentage of the speedup here is coming from a reduction of cache coherence conflicts. Our results show an almost linear increase in the number of words processed per second indicating further improvement and scale with additional cores is expected.
The batch method even outperforms the original Word2Vec code, but all implementations of FastText are dwarfed by the performance of Intels matrix-to-matrix batch approach to Word2Vec. The matrix-to-matrix batch operations are highly optimized, but there are also fewer vectors to update in the Word2Vec model than there are in FastText. It makes sense that a more optimized algorithm that ultimately has less work to do would be faster, but this faster Word2Vec model lacks the sub-word information contained in the FastText models.
The accuracy of our models is reported in Table 3. The WS-353 dataset [22] and the rare word (WS-RW) dataset [23] both contain a large set of word pairs. Each word pair is given a similarity score that is sourced from human judgement. To evaluate the accuracy of our model, we use the Spearmans rank correlation coefficient [24] between the human judgement and the cosine similarity of our word vectors that represent the words in each pair. Values of this similarity that measure close to 100 represent a model that can detect similarities between words as well as a human. We also used the Google analogy dataset [7] to evaluate the performance of the vectors in answering fill-in-the-blank analogy questions. The score ranges from 0 to 100 and represents the percentage of analogies correctly answered.


^Batched SGNS with Eigen Fast Text with Eigen -a- Fast Text
Intel Batched W2V Original W2V
Number of cores
Figure 19: Performance comparison of original Fast Text compared to proposed optimized versions and Word2Vec. Evaluated on the enwiki9 word set.
The rare words test shows the usefulness of the n-gram sub-word information. Fast-Text models outperform the Word2Vec models on this test because the sub-word data drawn from more common words enriches the vector representations of the rare words. We use this test to show that our proposed training methods do not degrade this important benefit of the Fast Text models.
In Table 3, Batched SGNS 1 is the batched method that uses scalar multiplication on the error gradients (described in Section ). Batched SGNS 2 shows the results of the same technique, without the gradient multiplication. The speeds of these two methods are approximately the same. The word similarity and word analogy scores are both increased with the multiplication operation, but it seems the rare words test actually shows a decrease in accuracy when the gradients are adjusted with scalar multiplication. Overall, both batched methods show a decrease in word analogy scores from the original Fast Text implementation and from Word2Vec. Re-using a smaller number of negative

Table 3: Evaluation of model accuracy in Fast Text vs. newly proposed batch method of Fast Text and Word2Vec models
Program WS-353 Word Analogy WS-RW
Fast Text SGNS 69.9 49.1 44
Batched SGNS 1 70.1 39.3 42
Batched SGNS 2 67.4 35.7 44
Word2Vec SGNS 70.7 47.6 37
Intels Batch SGNS 71.2 48.0 37
samples decreases the diversity in the training steps, which would explain the loss of accuracy. The rare words scores remain high for the batched Fast Text implementation, which suggests that the sub-word information continues to enrich the vectors.
Conclusion and future work
By restructuring the training operations to thread-local vector-to-matrix operations, and batching negative samples, a clear increase in scalability and performance of Fast Text is obtained. If possible, batching operations to enable matrix-to-matrix operations that would use a more optimized level-3 BLAS routine could yield even better performance.
The kind of batching proposed here has also proven to work well on GPU and multinode systems [12] [18]. Extending this idea to a multi-node approach or to a GPU would likely produce even further speedup.
Lastly, the batching method used here for SGNS Fast Text would likely work well with hierarchical softmax, too. The speedup from re-using negative samples would be absent from a batched hierarchical softmax, but the thread-local, vector-to-matrix training method would certainly improve performance.

In this work we have presented new methods for calculating Word2Vec word vectors at speeds faster than previously reported. We have applied these techniques to both the CPU and the GPU with success. While researching these techniques we found and reported flaws in a popular machine learning library, and provided suggestions for improvement. We have also applied our techniques to the newer Fast Text library with success.
Compiling this body of work has provided great insight into what techniques are available for speeding up machine learning models. Increasing thread locality plays a great role in reducing cache coherence conflicts, which in turn provides enormous speedup. Use of highly optimized linear algebra routines can also provide a quick and easy route to speeding up applications that have linear algebra in their core. Careful analysis of all aspects of a given program problem are necessary to prevent overlooking the accuracy of the program in hopes of achieving higher speeds.
We have compiled our Word2Vec ideas into a single framework called Wombat, and will release it on GitHub to the public. We hope to continue developing this to extend our methods to other Word2Vec approaches, and to extend the GPU speedup to the newer Fast Text library.

[1] Manning C. and Shutze, H. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, USA, 1999.
[2] Bengio, Y., Ducharme, R., Vincent, P. and Jauvin, C. A Neural Probabilistic Language Model. Journal of Machine Learning Research, 3 (2003). 1137-1155.
[3] Collobert, R. and Weston, J.. A unified architecture for natural language processing: deep neural networks with multitask learning, in Proceedings of the 25th international conference on machine learning, (Helsinki, Finland, 2008), Omnipress, 160-167.
[4] Mnih, A. and Hinton, G. A scalable hierarchical distributed language model. Advances in neural information processing systems, 21 (2009). 1081-1088.
[5] Fellbaum C. WordNet: an electronic lexical database. MIT Press, Cambridge, MA, USA, 1998.
[6] Bengio M. and Bengio Y. Hierarchical probabilistic neural network language model, in Proceedings of the international workshop on artificial intelligence and statistics, (Barbados, 2005). Society of Artificial Intelligence and Statistics, 246-252.
[7] Mikolov, T., Chen, K., Corrado, G. and Dean, J., Efficient estimation of word representations in vector space, in ICLR Workshop, (Scottsdale, AZ, USA, 2013).
[8] Mikolov, T., Sutskever, K., Chen, K., Corrado, G. and Dean, J. Distributed Representation of Words and Phrases and their Compositionality. Advances in Neural Information Processing Systems, 26 (2013). 3111-3119.
[9] Wang, Z., Ma, L. and Zhang, Y. A novel method for document summarization using Word2Vec. in ICCI*CC (Standford University, CA., USA, 2016), IEEE.
[10] Mayank, D., Padmanabhan, K. and Pal, K. Multi-sentiment modeling with scalable systematic labeled data generation via Word2Vec clustering, in ICDMW (Barcelona, Spain, 2016), IEEE.
[11] Mikolov, T., Le, Q. and Sutskever, I., Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.fl68, 2013.

[12] Ji, S., Satish, N., Li, S. and Dubey P. Parallelizing Word2Vec in shared and distributed memory, preprint arXiv:1604-04 661v2, 2016.
[13] Vuurens, J., Eickhoff, C. and DE Vries, A. Efficient Parallel Learning of Word2Vec. preprint arXiv:1606.07822, 2016.
[14] Niu, F., Recht, B., Re, C. and Wright, S., Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent, in Advances in Neural Information Processing Systems, (Granada, Spain, 2011), Neural Information Processing Systems Foundation, Inc, 693-701.
[15] Blackford, L.S., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux, M., Kaufman, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K. and Whaley, R.C. An updated set of basic linear algebra subprograms (bias). ACM Trans. Mathematical Software, 28 (2) 135-151, 2002.
[16] Bojanowski, P., Grave, E., Joulin, A., Mikolov, T. Enriching Word Vectors with Subword Information, preprint arXiv:1607.04 606, 2016.
[17] Guennebaud, G., Jacob, B. and others. Eigen v3. 2010.
[18] Canny J., Zhao H., Chen Y., Jaros B. and Mao J., Machine learning at the limit, in IEEE International Conference on Big Data, (Santa Clara, CA, USA, 2015), IEEE, 233-242.
[19] BAE, S. and Yl, Y. Acceleration of Word2vec Using GPUsA. in Hirose A., Ozawa S., Doya K., Ikeda K., Lee M., Liu D. (eds) Neural Information Processing. ICONIP 2016. Lecture Notes in Computer Science, vol 9948, (2016), Springer, 269-279.
[20] Rehurek, R. and Sojka, P., Software Framework for Topic Modelling with Large Corpora, in Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, (Vallettta, Malta, 2010), ELRA, 45-50.
[21] Levy, O. Goldberg, Y. and Dagan, I. Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, 3. 211-225, 2015.
[22] Finkelstein, L., Gabrilovich E., Mafias Y., Rivlin E., Solan Z., Wolfman G. and Ruppin E., Placing search in context: The concept revisited, in ACM Transactions on Information Systems, 20 (2002) 116-131.

[23] Luong T., Socher R. and Manning C., Better word representations with recursive neural networks for morphology, in Proceedings of the Seventeenth Conference on Computational Natural Language Learning (Sofia, Bulgaria, 2013), Association for Computational Linguistics.
[24] Spearman, C., The proof and measurement of association between two things. American Journal of Psychology, 15 (1904).

Full Text


OPTIMIZEDPARALLELTRAININGOFWORDVECTORSONMULTI-CORE CPUANDGPU by TREVORMCDONALDSIMONTON B.A,ColoradoStateUniversity,2009 Athesissubmittedtothe FacultyoftheGraduateSchoolofthe UniversityofColoradoinpartialfulllment oftherequirementsforthedegreeof MasterofScience ComputerScienceProgram 2017


ThisthesisfortheMasterofSciencedegreeby TrevorMcDonaldSimonton hasbeenapprovedforthe ComputerScienceProgram by GitaAlaghband FarnoushBanaei-kashani TomAltman Date:May13,2017 ii


Simonton,TrevorMcDonald(M.S.,ComputerScienceProgram)OptimizedParallelTrainingofWordVectorsonMulti-coreCPUandGPUThesisdirectedbyProfessorGitaAlaghband ABSTRACT Word2Vecisapopularsetofmachinelearningalgorithmsthatuseane uralnetwork togeneratedensevectorrepresentationsofwords.Thesevec torshaveproventobeuseful inavarietyofmachinelearningtasks.Inthisthesis,weexploreexist ingapproachesto parallelizingWord2Vecvectortraining,andproposenewmethodsto increasethespeed oftheWord2Vecskipgramwithhierarchicalsoftmaxarchitecture onmulti-coreshared memorysystems.Weaccomplishthisbybatchingtrainingoperations toincreasethread localityandreduceaccessestosharedmemory.Wealsotakeadvan tageofhighlyoptimizedlinearalgebralibrariestofurtherincreaseperformance.Wet henexploreexisting GPUimplementations,andproposenewGPUimplementationsthatutiliz esharedmemoryregistersandin-warpinter-threadreductionoperations.Ou rGPUimplementations produceahigherqualityofwordvectorsthanpreviousGPUimplemen tations.TheproposedGPUtrainingapproachworksforboththehierarchicalsoft maxarchitecture,as wellasthenegativesamplingmethod.Wethenapplysimilarparallelizatio ntechniques toWord2Vec'ssuccessor,FastText,andachievegreaterscalab ilityandspeedofexecution thanpreviouslypublishedmodels. Theformandcontentofthisabstractareapproved.Irecommen ditspublication. Approved:GitaAlaghband iii


Tomywife,Lauren. iv


TABLEOFCONTENTS CHAPTER I.INTRODUCTION...............................1II.WORD2VECHISTORYANDMOTIVATIONS...............3 NeuralProbabilisticLanguageModeling................4 Thehierarchicalprobabilisticmodel................5 TheCoreWord2VecAlgorithm.....................6 Collectingtrainingdatafromacorpus...............7ContinuousBagofWords(CBOW).................8Skipgram...............................8Wordvectorsasaneuralnetworkweightmatrix..........9Hierarchicalsoftmax.........................9Negativesampling..........................11 OriginalParallelizationScheme.....................12FasterWord2VecModels.........................13 III.WORDMATRIXBATCHES:WOMBAT..................18 NewWombathierarchicalsoftmaxmethod...............19BatchingOperationsforGPU......................20 Datadependencies..........................21Proposedbatchingmethod.....................22GPUKernelDesign.........................25 Fastmatrixkernel........................28Fastvectorkernel........................30 Experimentsandresults.........................30 CPUBatchedHierarchicalSoftmax..............32GPUExperiments........................34 IV.FASTERFASTTEXT.............................36 v


FastTextskipgramwithnegativesampling...............37 NegativesamplinginFastText.................37 ProposedimprovementstoFastText...................38Experiments................................41 Results................................43 Conclusionandfuturework.......................46 V.Conclusion...................................47 REFERENCES.....................................48 vi


LISTOFTABLES TABLE1EvaluationofhierarchicalsoftmaxaccuracyinnewWombatmodel compared toCythnnandoriginalWord2Vec........................33 2ExecutiontimeandaccuracyofWombatvs.BIDMach.......... ...35 3EvaluationofmodelaccuracyinFastTextvs.newlyproposedbat chmethod ofFastTextandWord2Vecmodels.......................46 vii


LISTOFFIGURES FIGURE1Illustrationofthemeaningofnodesinhierarchicalsoftmax..... .....10 2Originalparallelizationscheme.........................123Decliningper-threadperformanceasthenumberofthreadsincr eases.....13 4Updatedparallelizationscheme.........................165Matrixmultiplication:(wordsfrom W oLocal ) (wordsfrom W T iLocal )inIntel's batchednetworkactivationmethod....................... 17 6ImplicitlabelmatrixusedinIntel'sbatchmodel................1 7 7Matrixmultiplication:(wordsfrom W oPrivate ) (wordsfrom W T iPrivate )inthe proposedbatchednetworkactivationmethod............... ..20 8Adjustedlabelmatrixforhierarchicalsoftmax.Examplewordha sHuman code 1011...01 ..................................21 9Datadependenciesbetweentrainingsteps................. ..22 10CPU/GPUheterogeneousbatchingdiagram................ ..23 11CPU/GPUbatchdatatransfer......................... 24 12Loadingdatatosharedmemory......................... 28 13Wombatmatrixmultiply.............................2914Wombatupdatecalculationillustration..................... 29 15SpeedupforbatchedhierarchicalsoftmaxCPUmethodcompar edtoleading implementations..................................32 16Vector-to-matrixbatchshowingduplicatesamplevectorsinbold ......39 17Vector-to-matrixbatchwithduplicatesremoved.Asinglesetof k samples remains......................................40 18Multiplysamplegradientsby n ratherthancalculatetheirdotproduct n times........................................41 viii


19PerformancecomparisonoforiginalFastTextcomparedtopro posedoptimized versionsandWord2Vec.Evaluatedontheenwiki9wordset....... ...45 ix


CHAPTERI INTRODUCTION Therearemanyapproachestoembeddingwordmeaningsintocompu terprograms. Onesuchmethodistodeneforeachwordwhatiscalledadistributed representation invectorspace.Adistributedwordrepresentationisadensevect orofnumbersthat describesauniqueword.Thiskindofwordrepresentationisperfec tforuseasinputto machinelearningmodelsthataredesignedtotakenumericfeaturev ectorsasinput.A densevectorrepresentationofawordcanbecomposedofasmuc hasseveralhundred roatingpointnumbers,each4bytesinsize.Thisrepresentationof awordcanholdmuch moreinformationthanasimplesequenceofletters,butdeningase tofwordvectors foravocabularyofhundredsofthousandstomillionsofwordscanb edicultandtime consuming. Word2Vecisasetofalgorithmsthatoersseveralmethodstode netheseword vectors.Thealgorithmsareinaclassofstatisticalmethodsthatu seaneuralnetwork to\learn"thewordvectorrepresentations.Thenetworklearns fromdocumentsthat containexamplesofwordsusedproperlyinnaturallanguage.Fore xample,allofthe articlesonWikipediacleanedandreducedtoplaintextisatypicalinput usedtotrain thiskindofanetwork. Oncelearned,thesevectorrepresentationsofwordscanbereusedbyothercomputer programs.Evensimplevectoradditionandsubtractionprogramst hatusethevectors learnedbyWord2Vechaveyieldedentertainingandexcitingresults. Awidelypublicized exampleofthisistheoperation v ( King ) v ( Man )+ v ( Woman ) v ( Queen )where v ( word )returnsthelearnedwordvectorfor word [7].Thevectorsalsoworkwellin machinelearningmodelsthatprovidelanguagetranslationandsentim entanalysis[9] [10][11]. TheWord2Vecalgorithmsarenottherstoftheirkind.Thereisalon ghistoryof similarmodelsthatuseneuralnetworkstolearnthesewordrepres entations.Word2Vec 1


isparticularlyexcitingbecauseithassimpliedthetrainingprocessto thepointthat it'spossibletogenerateusefulwordvectorsfrombillionsofwordsin onlyafewhours. Recentresearchthathasfocusedonimprovingthesealgorithmsh asreducedthattime evenfurther. ThisthesisexplorestheWord2Veccorealgorithms,andpresentss omeoftheleading implementationsofthesealgorithmsthathavedrasticallyreducedt heamountoftimeit takestotraintheWord2Vecneuralnetwork.Theseimplementatio nsusecleverparallel computingtechniquestooptimizethetrainingonmodernCPUandGPU hardware.In thelatterhalfofthisthesis,neworiginaltechniquesarepresente dtofurtherspeedup thelearningprocessforWord2Vec,aswellasitsrecentlyreleased successor,FastText. Therestofthethesisisorganizedasfollows.InChapterII,thehis tory,motivations andinnerworkingsofWord2Vecarediscussed.AttheendofChapt erII,someofthe leadingparallelimplementationsofthealgorithmsarereviewed.InCh apterIII,new parallelapproachesarepresented,andinChapterIV,thenewer libraryFastTextis discussed,andimprovementstothislibraryarepresented. 2


CHAPTERII WORD2VECHISTORYANDMOTIVATIONS Insomecases,theonlystructurefoundinalargecollectionofdata isthenatural languageusedtoformthesentencesofwhichthedataiscomposed .Onetoolthatis oftenusedtoextractinformationfromsuchdataisstatistics.St atisticalapproachesto understandingnaturallanguagedataofteninvolvecountingthen umberofkeywords inadocument[1].Thegoalofthiscountingistocreatenumericvalues thatcanbe analyzedtobetterunderstandthecontentsofdocumentsthat containonlyunstructured worddata. Wordcountsalonecanprovidesimplewordvectors.Thesimplestwor dvectoristhe onehot vector. Onehot vectorsareofsize j V j ,where V isthevocabulary.Allentriesin this onehot vectorare0,withtheexceptionoftheindexofthevectorthatco rresponds totheonewordthatthevectorrepresents.Anextensionto onehot vectorsisthe bagof words vector,whichcanbeusedtodescribeanentiredocument.Toform a bagofwords vector,thecountofeachwordinasentenceordocumentisenter edtothecorresponding indexofthewordinavector,againofsize j V j .Thisishowsentences,paragraphsor entiredocumentscanbetransformedtosimplefeaturevectors. Example onehot and bagofwords vectorsareasfollows: One-hotvector: 00...1...000 Bagofwordsvector: 02...3...100 Theproblemwiththesevectorsisthattheyaresparse.Inthecas eofa onehot vector allbut j V j 1valuesinthevectorwouldbe0. Bagofwords approacheswouldonly beslightlymoredenseforsentencesandparagraphs,butdonoth elpdescribeindividual 3


words.Anotherissuewiththesesimplermodelsisthattheycomplete lylackcontext. Theydonotmakeassociationsbetweenneighboringwords. Anothertechniqueusedtogeneratestatisticalmodelsofnatura llanguageinvolves theuseofprobability.Theappearanceofawordcanbeconsidered asaneventwith aprobability.Countingthenumberoftimeseachwordoccurscanhe lpcalculatethis probability,buttheprobabilityofthiseventcanalsobecalculatedas theconditional probabilityofword'sappearancegivenaprecedingsequenceofwor ds.Where w t isthe t -thword,and w j i isthesequence w i w i +1 ,..., w j 1 w j P ( w T 1 )= T Y t =1 P ( w t j w t 1 1 ) Notethatitisobviouslytimeconsumingtocalculatetheprobabilityofe achwordas aconditionalprobabilityofasequenceofwordsstartingatthebeg inningofadocument. Theseprobabilitiesaretypicallycalculatedforsomesub-sequenceo f n wordsthatstart notatthebeginningofthedocument,butatanindex w t n thatisarelativelyshort distanceinfrontofthegivenword w t Thisconditionalprobabilitystartstoprovidecontexttothewords .Wordswith similarprobabilitiesgiventhesamesequenceofprecedingwordstend tohavesimilar meaning.Forexample,\thecatsatonthe"mightprecede\mat",\ bed"oranything elsesuitableforacat'sseat.Theprobabilityofthenextwordcreat esasimilaritybetween anywordthatmakessenseattheendofthesequence.Astatistic almodelthatdescribes thejointprobabilityfunctionofsequencesofwordsinalanguageca neectivelycapture thesewordsimilaritiesandwordrelationships. NeuralProbabilisticLanguageModeling Thetroublewithstatisticalmodelsthatrelyonconditionalprobabilit iesofwords appearinginsequenceisthecurseofdimensionality.Wordsthatdon otappearinspecic 4


sequencesfoundinthetrainingdatashouldnothaveazeroprobab ilityofappearing insequencesfoundoutsideofthetrainingdata.Fightingthisproble mwasthekey motivationbehindtheintroductionoftherstNeuralProbabilisticL anguageModel (NPLM)[2]. Thegoalofaneuralprobabilisticlanguagemodel(NPLM)istousean euralnetwork tolearnadistributedrepresentationsforwords[2].Thesedensew ordvectorscontain embeddedcontextinformationofeachwordextractedfromlarge amountsoftraining data.Themodelisbuiltfromobservedsequencesofwordsintrainin gdata. Insuchamodel,eachwordinavocabularyisassociatedwithaninitiallyr andom wordfeaturevector.Thefeaturevectorsarethenadjustedin aneuralnetworkthatis trainedbyexaminingsequencesofwordsinatrainingle.Thenetwor klearnstoexpress thejointprobabilityfunctionofwordsequencesintermsoftheinitia llyrandomfeature vectors,whicharesimultaneouslyadjustedtobecometheparame tersofthatprobability function[2].Theneuralnetworklearnsthefunctionanditsparame tersbymaximizing thelog-likelihoodofthewordsequencesthatappearinthetrainingd ata.Theresultis asetofcontinuousfeaturevectorsthatrepresentwordsinthe trainingvocabulary. Theproblemwithsuchamodelistheamountoftimeittakestoupdate feature vectorsforanentirevocabulary.Theamountoftrainingdatacan beinthebillionsof words,andeachtrainingstepinvolvesupdatestomatricesthatre presentvocabularies ofhundredsofthousandstomillionsofwords.However,severala lternateapproaches tonetworktraininghavebeenintroducedsincetheideaofNPLM'swa sintroducedin 2003.[3][4][6][7][8]Thehierarchicalprobabilisticmodel Onewaytoreducetrainingtimeistoclusterwordsintoahierarchyof relatedwords. Thetrainingtimecanbespedupexponentiallybyreducingeachtrainin goperationto requireupdatesofonly log ( j V j )classesofwordsasopposedtoupdating j V j wordvectors ateachstep[4]. 5


Themotivationforgroupingwordstoclassescomesfromthatfact thatcomputing directly P ( Y j X )involvesnormalizationacrossallthevaluesthat Y cantake.Bydening aclusteringpartitionforthe Y intoclasses C ,suchthat c ( : )maps Y to C P ( Y = y j X = x )= P ( Y = y j C = C ( y ) ;X ) P ( C = C ( y ) j X = x ) Thenbyreducingthewordstoahierarchyofclassesthatcanbere presentedina balancedbinarytree,theamountofnormalizationsisreducedfrom j V j to log ( j V j ).Each word v canberepresentedasabitvector( b 1 ( v ) ;:::;b m ( v ))(where m dependson v ).For example, b 1 ( v )=1representsthat v belongstoclass1atthetopofthetree,asopposed totheneighbortoclass1,class0.Subsequent b 2 ( v ) :::b m ( v )indicatea\walk"from thetop-levelclass1downthroughitsbinarysub-groups.Thenex t-wordconditional probabiltycanthenbecomputedas m Y j =1 P ( b j ( v ) j b 1( v ) ;:::;b j 1( v ) ;w t 1 ;:::;w t n +1 ) In[4]thetreeconstructiondependedonpreviouslyexistingcluste rsofwords,specifically,theWordNetresource[5].[6]introducesnewwaystoclusterw ordsforbetter performanceofthehierarchicalmodel,andpointsoutthatgener atingthesehierarchies ispossibletodoatthesametimeaslearningwords.Word2Vec'shierar chicalsoftmax methodpartitionsthevocabularyintoaHumanbinarytree. TheCoreWord2VecAlgorithm AsetofalgorithmsisintroducedintheoriginalWord2Vec.Withinthiss etare twodierentapproachestocollectingwordsfortrainingfromacor pus:skipgramand continuousbagofwords.Bothmethodsshareasimilarbasicapproa chtoextracting 6


datafromatrainingle.Thissharedbasiccorealgorithmisdescribed inthenext section.Therearealsotwodierentmethodsintroducedtolearnw ordembeddingsfrom thetrainingsetsthatarecollected:hierarchicalsoftmax,whichis similartohierarchical methodsdescribedabove,andthenegativesamplingapproach,wh ichisintroducedand describedattheendofthissection.Collectingtrainingdatafromacorpus Word2Vec'salgorithmsuseexamplesofwordusageasinputdatatoa neuralnetwork trainingprocedure.Wordusageexamplescomefromlargetextles .Thelesareparsed, andthewordsareassignedtokens.Theintegertokenstiethewor dstovectorsina neuralnetworkweightmatrix.Oncethetokensareobtained,sho rtsub-sequencesof wordsareexaminedtotrainthenetwork.Thesesub-sequencesw illhereinbereferred toas contextwindows .Acontextwindow'ssizeiscongurable,buttypicallyranges from5to10wordsineitherdirectionofacentertargetword,mean ingthateachcontext windowtypicallycontainsatotalof11to21words.Thesizeofthewin dowdoesnot relatetothesizeofthewordvectors. Eachwindowcontainsaofasetofwordsthatappearincloseproximit ytoeachother inthetrainingdata.Thesetcanbeconsideredtocontainasingletar getword,witha setofneighboringcontextwordsoneithersideofthetarget.For example,thetraining sentence\thedogwascutebutstinky"oersanumberofcontex twindows.Onesuch setwouldput\cute"asatargettocontextset f dog,was,but,stinky g The dog wascutebut stinky target context 3 context 2 context 4 context 1 Insuchanexamplethe\contextwindow"isofsize2.Thismeansatot alof4 words,thatis2oneachsideofthetarget,areconsideredasneigh borstothetarget 7


word\cute."Thiswindowcanbedraggedfromthestartofthesent encetotheendto generateothertrainingsets,whereeachsetcontainsasingletar getwordsurroundedby contextwords.Thesizeofthe\contextwindow"isaparameterth atcanbepassedto Word2Vec.Anothercontextwordgeneratedfromthesamesent encecouldfeature\but" asthetargettothecontextwordset f was,cute,stinky g Forboththeskipgramandthecontinuousbagofwordsmethods,t heinputmatrix istrainedwiththeoutputmatrixtopromoteactivationsofthenetw orkthatrepresenta highlikelihoodthat\target"wordsappearwithinawindowof\contex t"words. ContinuousBagofWords(CBOW) Inthecontinuousbagofwords(CBOW)method,thevectorsfore achcontextword arerstsummedtoonesinglevector.Thenetworkisactivatedwith thedotproduct ofthissinglecondensedsumvectorandthetargetwordvector.T heerroriscalculated, andthenthegradientisusedtoapplyanupdatetoeachoftheorigin alcontextwords uniformly.Throughthisprocess,thetargetwordpredictsitssur roundingcontextwords. Becauseoftheinitialcombinationofvectors,thereislessworktod ointheCBOW approach,whichmakesCBOWfasterthanskipgram.Theeectisas moothingofthe distributionalinformation.CBOWtendstoworkbestonsmallerdata sets,wherethe skipgramapproachisbettersuitedforlargerdatasets.Skipgram TheskipgrammethodisjusttheoppositeoftheCBOWmethod.Inst eadoftraining thecontextwordsasasinglevector,eachcontextwordvectoris trainedindividually againstthetargetvector.Thismethodtakeslongertotrain,but generallyprovides moreaccuratewordvectors.Theeectisthateachcontextwor disusedtopredicta target. 8


Wordvectorsasaneuralnetworkweightmatrix Thebasicideasharedbybothhierarchicalsoftmaxandnegativesa mplingistouse twoweightmatricesinasimplesingle-layerneuralnetworktorepres entthewordsina vocabulary.Therstmatrix,calledthe\inputmatrix,"or W i ,isofsize V w ,where V isthesizeofthevocabularyand w isacongurableparameter.Theparameter w denes thesizeofthewordvectors.Eachwordin V willberepresentedbya w -dimensional vectorintheinputmatrix.Whenthetrainingiscomplete,thisinputma trixisatrained setofwordvectorsthatcanbesavedforuseinothermachinelear ningmodels.The secondweightmatrix,the\outputmatrix,"or W o ,issimilarinsize.Itisalso V w ,but thesevectorshaveaslightlydierentmeaningdependingonwhichof thetwoWord2Vec methodsisused.Exactlywhat W o representsdependsonwhichlosscalculationmethod isselectedbytheuser:eitherhierarchicalsoftmaxornegativesa mpling. Hierarchicalsoftmax Word2Vec's hierarchicalsoftmax modelisahierarchicalprobabilisticlanguagemodel, similartothoseusedin[4]and[6],describedin.Asmentionedabove,th esehierarchicalmodelsrequirethevocabularytobebrokendownintoabinarytr ee.Word2Vec's hierarchicalsoftmaxbreaksthevocabularyintoaHumanbinaryt ree[7].AHuman treegroupswordsintoatreebasedontheirfrequencyofappear ance.Itisoftenusedto assigntheshortestcodinginacompressionschemetothemostfre quentwordsorcharactersincollection.Itisusedherefor hierarchicalsoftmax toachievethe O ( log 2 ( j V j )) trainingcomplexityimprovementovertheotherwise O ( j V j )complexitywithouthaving aknowledgeofthevocabulary(e.g.apre-existingtree)aheadoft ime. Leafnodesinthetreerepresentwordsinthevocabulary.Abinary codeisassigned toeveryleafinthetree.A0atthestartofthecoderepresentsa stepfromtherootof thetreedowntothe\left,"whilea1atthestartrepresentsastep downtothe\right." Subsequent0'srepresentleftsidedescent,and1'srightsidedesc entinawalkdownto 9


leafnodes.Thewalkdownfromtherootofthetreetoeachleafpas sesasubsetofinner treenodes.Eachoftheseinnernodesisrepresentedbyavector in W o .Asthemodel istrained,eachtrainingstepmaximizestheprobabilitythatgivenaco ntextword,we walkdownthecorrectpathfromtherootnodetotheleafthatrep resentsthetarget word.Thesigmoidofthedotproductofthecontextwordvectorf rom W i andaninner treenodevectorfrom W o determinewhichwaytowalk.Valuescloseto0wouldtakea leftturn,andvaluescloseto1woulddescendtotheright.Eachtra iningstepupdates atmost log 2 ( V )innertreenodevectors,ratherthanallofthevectorsin W o ,aswould benecessaryinaneuralnetworkmodelthatdidnotusethishierar chicalclassication approach. AnexampleispresentedinFigure1.Thegureillustratestheclassic ationtreefor avocabularyof4words.Eachofthefourwordsinthevocabularyc anberepresentedby awalkfromtheroottotheleavesofthetree.Thetargetwordsint hetreearelabeled t 0 3 ,andthenodesinthetreearerepresentedbyvectors W o 0 o 2 .Thehighlightednodes areinthewalkfromtheroottotheword t 1 ,whichhasthecode 01 .Inthiscase,the walkto t 1 passesthroughtwonodes,representedbyvectors W o 0 and W o 1 .Where is thesigmoidfunction,theresultof ( W o 0 W i 1 )shouldbecloseto0,sincetheproper walkgoes\left"fromtheroot.Theresultof ( W o 1 W i 1 )shouldbecloseto1,sincethe walkdescendstothe\right"downtotheleafat t 1 ( W o 0 W i1 ) 0 0 0 1 11 0 t 0 t 1 t 2 t 3 W o 1 W o 2 W o 0 ( W o 1 W i1 ) 1 Figure1:Illustrationofthemeaningofnodesinhierarchicalsoftma x. 10


Algorithm1: OriginalWord2Vecskipgramwithnegativesamplingtrainingsteps 1 foreach trainingwords do 2 for i to numberofcontextwordsaroundtarget do 3 f W o [ target ] W i [ context i ]; //globalmemory,vector-to-vector 4 accumulate context i errorusing f ; 5 update W i [ target ]using f ; //globalmemory 6 generate k negativesamples ; 7 for j to knegativesamples do 8 f W o [ sample j ] W i [ context i ]; //vector-to-vector 9 accumulate context i errorusing f ; 10 update W i [ sample j ]using f ; //globalmemory 11 update W o [ context i ]withaccumulatederror; //globalmemory Negativesampling Whenusingthenegativesamplingmethod,the W o matrixisasecondsetofword vectors.Thecontextwordsarerepresentedbyvectorsinthe W i matrix,andthetarget wordsarerepresentedasvectorsin W o .Acongurableamountofrandomwordsare sampledfrom W o ateachtrainingstep,andthenetworkistrainedtopredicttheabs ence ofthenegativesampleinthecontextwindow,whilepredictingthepre senceoftheactual targetword.Therandomsamplerepresentswordsthatarenotf oundinthecontext windowaroundthetarget.Inthiscase,thesigmoidofthedotprod uctofavectorin W o thatistheactualtargetwordandthecontextwordvectorfrom W i shouldbecloseto 1,whilethesigmoidofthedotproductofavectorin W o thatisanegativesampleand thecontextwordvectorfrom W i shouldbecloseto0.Thissamplingtechniqueprevents theneedtoupdateallwordsin W o thatdonotappearinthecontextwindowateach 11


trainingstep.Thebasicskipgramwithnegativesampling(SGNS)train ingstepsare describedinAlgorithm1. OriginalParallelizationScheme WithalloftheoriginalWord2Vecarchitectures,eachindividualthr eadoperatesindependentlyduringthetrainingprocess.Afterthevocabularyan dnetworkinitialization iscomplete,eachthreadopensupitsownlehandle,andtrainsusing asegmentofthe inputle.Asthenumberofthreadsincreases,thenumberofleha ndlesincreases.Each lehandleisevenlyspacedthroughoutthetrainingdocument.Durin gtraining,each threadrstgoestodisk,andreadsintomemoryasinglesentence. Oncethatsentenceis fullyread,thethreadwillbegintondcontextwindowswithinthesen tence,andgeneratetrainingdatabasedonthewordsfoundwithineachcontextw indow.Thethread thenusesthesecontextwindowstotraintheshared-memorymat rixstructures.Onceall ofthewindowsareprocessedinasinglesentence,theprocessrep eatsbygoingbackto diskforthenextsentence.ThisisdepictedinFigure2. FILEPART1FILEPART2FILEPART3FILEPART4 Thread1 Thread2 Thread3 Thread4 Inputleondisk Trainingfunction Mainmemory Figure2:Originalparallelizationscheme 12


01020304050 0 20 40 60 80 100 NumberofcoresThousandwords/second/thread HierarchicalSoftmax NegativeSampling Figure3:Decliningper-threadperformanceasthenumberofthre adsincreases. Theupdatestototheshared-memorymatrixarenotsynchronize d.ThisunsynchronizedlearningtechniqueiscalledHogwild[14].Theideaisthattherearee noughtraining stepstocompensateforthelossofinformationthatoccurswhen threadsoverwriteeachother'supdatesinmemory.Theadvantagetothistechniqueisth atthereisnotime lostlockingandunlockingtheshared-memoryweightmatrices. Eventhoughthereisnolockingormemoryprotectioninthisapproac h,thereisstill aperformanceproblemwhensharingtheweightmatrixstructurein multiplecores.For example,somethread, Thread1 ,mightbereadingfromavectorinmemoryatthe sametimethatsomeotherthread, Thread2 ,iswritingtothatvector.Thiswillcause afalsesharingissueinlocalcacheand Thread1 willbeslowedbycachecoherence enforcement.It'sthisproblemthatstartstodegradetheeect ivenessofusingmultiple threadsintheoriginalimplementation,asseeninFigure3.Inthisgu re,throughputis measuredaswordsprocessedpersecond,perthread. FasterWord2VecModels AsshowninFigure3,theoriginalimplementationofWord2Vecwasnot optimized forlarge-scalemulticoresystems.Theper-threadperformance startstodeclinerapidly 13


asthenumberofthreadsincreasesonmulti-core,sharedmemory systems. TheoriginalWord2Vecimplementationusessharedwritabledatastr ucturesthroughouttheexecution.Thesestructuresdonotlendthemselveswellf oroptimizedperformanceincurrentmulti-coreshared-memoryarchitectureswithpr ivateleveloneandtwo cacheorganizations.Forexample,thecurrentvector-to-vect oroperationsarehandledin loopsthatinteractwithsharedmemory.Whenmultiplethreadsarew ritingupdatesto thesamevectorsinsharedmemory,thethreadexecutionisrepea tedlyinterruptedby cacheblockinvalidationstoenforcecoherence[12][13].Byrstcop yingsmallbatches tothread-localmemory,theeectoftheseconrictsislargelyred uced.Thebatchescan thenbetrainedwithasetofoptimizedmatrix-to-matrixoperations Algorithm2: Intelmatrixtomatrixmatrixtraining 1 foreach trainingwords do 2 generate k negativesamples ; 3 for i to numberofcontextwordsaroundtarget do 4 W iLocal [ i ] W i [ context i ]; //globalmemory 5 W oLocal [0] W o [ target ]; //globalmemory 6 for j to knegativesamples do 7 W oLocal [1+ j ] W o [ sample j ]; //globalmemory 8 G W oLocal W T iLocal ; //localmemory,matrixtomatrix 9 calculateerrorin G ; 10 U i G T W oLocal ; //localmemory,matrixtomatrix 11 U o G W iLocal ; //localmemory,matrixtomatrix 12 update W i from U i ; //globalmemory 13 update W o from U o ; //globalmemory RecentworkonspeedingupWord2Vechasfocusedontheskipgram withnegative sampling(SGNS)variationofthealgorithm[12][18].WhenusingSGNSWo rd2Vec, 14


re-useofthesamenegativesamplesetacrossanentiresentence hasbeenshowntospeed upthetrainingprocess[12][18].Byreducingtheamountofrandoms amplegeneration thereisanimmediatereductionofwork.Furthermore,byusingasm allsubsetofthe samesamplesacrossalltarget/contextsetsinasentence,thea mountofcachecoherence conrictsdropsasafewernumberofthreadsareconcurrentlywo rkingonthesamevector. BIDMachisamachinelearninglibrarythatincludesavarietyofdieren tsupervised andunsupervisedmodels.ItiswritteninScala,andallowsforquickde velopmentand optimizationofcomplexmachinelearningmodels.Onesuchmodelisacu stomendto endGPUimplementationofWord2Vec.BIDMach'simplementationintro ducestheidea tobatchtrainingstepsintomatrixprimitives,andre-usenegatives amplestocutback onmemorybandwidth.TheseideasarewhatledIntel[12]togreats uccesswiththeir ownCPUimplementation. BIDMach[18]andIntel'swork[12]havedemonstratedthatbybatc hingthetraining wordvectordataintoseparate,thread-localspacesinCPUmemo ry,thecachecoherence conrictsaregreatlyreducedandmulti-coreoperationspeedsup. Bybatchingtraining operationsintolocalmemory,networkactivation,softmaxander rorgradientscanbe calculatedwithoutgoingbacktosharedmemoryforupdateddata. Theresultisadrastic reductionincachecoherenceconricts.Thebatchedtrainingoper ationisillustratedin Figure4. Thisbatchingmethodalsoallowsthetrainingdotproductoperations tobereducedto asinglesetofmatrix-to-matrixoperations[12].Thematrix-to-mat rixoperationsprovide greateropportunitytoachievespeedupthroughparallelizationan dgreateroptimization oflocalcache,especiallyonhardwarethatisbuiltspecicallytooptim izesuchoperations. ThealteredalgorithmdevelopedbyIntelisshowninAlgorithm2. InthenegativesampleandcontextwordbatchesusedbyBIDMach andbyIntel's approach,thereisanimplicitlabelmatrixassociatedwiththeresulto fthebatched matrixmultiplication.Althoughthismatrixdoesnotexistinmemory,ita ppearsas 15


FILEPART1 FILEPART2FILEPART3 FILEPART4 Thread1 Thread2 Thread3 Thread4 Inputleondisk Trainingfunction Mainmemory Localtemporarymemory Figure4:Updatedparallelizationscheme Figure6. Therowsofthismatrixcorrespondtothetargetword(row1),an dtheselected negativesamplewordsthatareusedbythebatch(rows2to m ).Thematrixrepresents the"ideal"resultofthe Wi m Wo n matrixactivation.Erroriscalculatedasthedierence ofthesoftmaxoftheresultofthisactivationandtheidealmatrixla bel.Itisimplicitin theBIDMachandIntelmodelsbecauseeachalgorithmsimplychecks toseeiftheword isthetargetornotinordertodecidethe0or1label. 16


2666666666664 wordvectorsize target sample 1 sample 2 ... sample k 3777777777775 266666666664 wordvectorsize context 1 context 2 context 3 ... context n377777777775 Figure5:Matrixmultiplication:(wordsfrom W oLocal ) (wordsfrom W T iLocal )inIntel's batchednetworkactivationmethod. 2666666666666664 c 1 c 2 c 3 :::c ( n ) target 111 ::: 1 noise 1 000 ::: 0 noise 2 000 ::: 0 noise 3 000 ::: 0 ... ... ... ... noise ( m ) 000 ::: 0 3777777777777775 Figure6:ImplicitlabelmatrixusedinIntel'sbatchmodel. 17


CHAPTERIII WORDMATRIXBATCHES:WOMBAT InthischapterwepresentnewandimprovedWord2Vecskipgramalg orithmspackagedinourownparallel,object-orientedC++implementationthatiso ptimizedfor multi-coresharedmemoryCPU,andNVIDIAGPU.Theideasexplored whiledevelopingthiscodehaveallbeenfocusedonbatchingwordvectorsintoma trices.Hencethe nameofthepackage: Wombat {shortfor wo rd m atrix bat ches. Thecontributionsdescribedinthispaperare: NewCPUimplementationoftheskipgramwithhierarchicalsoftmaxar chitecture (SGHS)thatachievesasmuchas4xspeedupovertheoriginalWord 2VecSGHS implementationonmulti-coresharedmemorymachines.Thismethodd oesnot requirespecialhardware,butwillshowanevenmoredramaticincre aseofperformanceofuptoalmost10xtheoriginalWord2Vec'sspeedonIntelma chineswith theIntelMKLlibrarysupport. OriginalCUDAkernelsdevelopedforspeedandaccuracyoftheskip gramwith negativesamplingapproach(SGNS)thatoeragreatimprovement ofaccuracy overthecurrentstateoftheartGPUWord2Vecimplementation,B IDMach[18]. OurGPUimplementationofSGNSis8xfasterthantheoriginalWord2V ec,without alossofaccuracy. AfastGPUSGHSmethod,whichisthefastestandmostrexibleGPUimp lementationofthismethodthatweknowof.OurGPUimplementationof SGHSis 10xfasterthantheoriginalWord2Vecimplementation,and4xfast erthanleading improvements. Thecodewasdevelopedwithrexibilityinmind.ItcanbecompiledwithGCC ICPCforIntelMKLsupport,orwithCUDAforNVIDIAGPUsupport .Notethatall variationsoftheWord2Vecalgorithmshaveyetbeenimplementedint hiscode{the 18


CBOWmethodsarecurrentlyabsent.Thefocusofthislibraryisthe twodierentskip gramtechniques:hierarchicalsoftmax(SGHS)andnegativesamp ling(SGNS). NewWombathierarchicalsoftmaxmethod Weproposeabatchedtrainingapproachtohierarchicalsoftmaxin whichthesetof vectorsrepresentingthehierarchyofnodesdeningthetarget wordaretrainedwiththe setofcontextwordvectorsinasinglesetofmatrixoperations.Th ebatchingalgorithm isshowninAlgorithm3.Eachthreadwillbeexecutingthesestepsinpa rallel. Algorithm3: Wombat'shierarchicalmatrixtomatrixtraining 1 foreach trainingwords do 2 for i to contextwordsaroundtarget do 3 W iPrivate [ i ] W i [ context i ]; 4 for j to kvocabularytreenodevectors do 5 W oPrivate [ j ] W o [ node j ]; 6 G W oPrivate W T iPrivate ; 7 calculateerrorin G ; 8 U i G T W oPrivate ; 9 U o G W iPrivate ; 10 update W i from U i ; 11 update W o from U o ; Thevectorsin W o thatrepresentthenodesinthetargetword'snodesetarebatch ed intomatrix W oPrivate ,andthevectorsin W i thatrepresenttheindividualcontextwords arebatchedintomatrix W iPrivate .Theworkdonebyactivatingeachindividualpairof nodevectortocontextwordvectorcannowbedonewithasingleca llto W oPrivate W T iPrivate .AnillustrationofthisnetworkactivationoperationisshowninFigure 7. 19


2666666666664 wordvectorsize node 1 node 2 node 3 ... node k 3777777777775 266666666664 wordvectorsize context 1 context 2 context 3 ... context n377777777775 Figure7:Matrixmultiplication:(wordsfrom W oPrivate ) (wordsfrom W T iPrivate )inthe proposedbatchednetworkactivationmethod. Theresultofthisoperationisa k n matrix,where k isthenumberofvocabularytree nodesthatdenethetargetword,and n isthenumberofcontextwordssurroundingthe targetwordinthecontextwindow.Thesigmoidofeachindividualent ryinthematrix iscalculated,anderrorisdeterminedfromthedierencebetweent hesigmoidvalueand thelabelofthegivenrowandcolumninthelabelmatrix.Inthecaseo fhierarchical softmax,thelabelmatrixmatchesthecodesoftheHumanbinary treeassociatedwith thetargetword.AnexamplelabelmatrixisexempliedinFigure8. BatchingOperationsforGPU TherearenumerousGPUimplementationsofWord2Vec,butnoneof themwehave foundoerarexible,fastskipgramhierarchicalsoftmax(SGHS). TheGPUimplementationswehavefoundallfocusoneithertheCBOWmethodorthene gativesampling variantoftheskipgrammethod[18],and/orplacelimitationsonthesiz esofwordvectors thatcanbetrained[19]. OurworkwillalsofocusontheskipgramWord2Vecarchitecture.We presentbotha hierarchicalsoftmax(SGHS)andnegativesampling(SGNS)implemen tationwithinthis 20


2666666666666666664 c 1 c 2 c 3 :::c ( n ) node 1 111 ::: 1 node 2 000 ::: 0 node 3 111 ::: 1 node 4 111 ::: 1 ... ... ... ... node ( m 1) 000 ::: 0 node ( m ) 111 ::: 1 3777777777777777775 Figure8:Adjustedlabelmatrixforhierarchicalsoftmax.Example wordhasHuman code 1011...01 architecture.Wealsopresentanewheterogeneousmethodofba tchingandtrainingthe networkforthenegativesamplingapproachthatyieldshigherqualit ywordvectors. Datadependencies Everysentenceinthetrainingdatawillyieldmultipletarget/context wordsetsthat areusedfortraining.Asacontextwindowisdraggedfromthestar tofthesentence totheendofthesentence,eachindividualwordwillbeatargetwor donce,butwill beacontextwordmultipletimesfordierentneighboringtargetwor ds.Thismeans proximatepositionsofthecontextwindowyieldrepeatedupdatest othesamecontext wordvectorin W i .Forthisreason,theupdatescannothappenconcurrentlywitho uta lossofinformation. ThisproblemisillustratedinFigure9.Inthisdiagram,eachrowrepres entsthesame 14trainingwordsfoundinasinglesentence.Thecircledwords, w 7 w 8 and w 9 aretarget wordsintherst,secondandthirdtrainingsteps,respectively.T heunderlinedwords representthecontextwordsthatsurroundthetargets.Thea rrowsshowtheoverlapin updatesthatoccurbetweensubsequentpositionsofthecontex twindow.Althoughthe 21


w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 w 9 w 10 w 11 w 12 w 13 w 14 w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 w 9 w 10 w 11 w 12 w 13 w 14 w 1 w 2 w 3 w 4 w 5 w 6 w 7 w 8 w 9 w 10 w 11 w 12 w 13 w 14 Figure9:Datadependenciesbetweentrainingsteps targetwordsdonotoverlap,thecontextwordsdo.Target/con textsetsfromthesame sentencecannotbebatchedintoasingleconcurrenttrainingstep Whenbatchingcontextwindowtrainingonthemulti-core,sharedme moryCPUas describedinSection,thereisnosuchoverlap.Themulti-coreCPUimp lementation haseachthreadtrainingindependentlywithseparatelehandles.E achlehandleis usedtoreadadierentpartofthesameinputle,soeachcontext windowtrainedin paralleliscomingfromadierentpartoftheinput,andadierentse ntence.However, anNVIDIATeslaP100GPUcanexecute64coresinparalleloneachof its56streaming multiprocessors.Inordertokeepall3584CUDAcoreswell-fedwith trainingoperations, morebatchesthanlehandlesarerequired.Batchingtrainingoper ationsforthislevelof parallelismrequirescarefulconsiderationofthisdatadependency .Batchingsubsequent contextwindowsfromthesamesentencewillcausealossofinforma tionbetweentraining steps.Thisproblemisalsodescribedin[19].Proposedbatchingmethod DatatransferfromCPUtoGPUisanexpensiveoperation.Reducing theamountof transferisimportantforperformance.WhenCUDAkernelsarelau nched,theyshould haveenoughdatatoworkwithtomakethetransferworthwhile,bu tlargertraining batchesruntheriskofhittingdatadependencyissuesdescribedin theprevioussection. WealsowanttomakethemostofourCPUswhentheGPUisbusy.Ifwe ooad 100%oftheworktoGPU,ourCPUsbecomeidleandareawastedreso urce.However, 22


FILEPART1FILEPART2...FILEPART n Thread1 Thread2 ... Thread n Inputleondisk CPUparsing GPUGlobalMemory GPUBatchData GPU Figure10:CPU/GPUheterogeneousbatchingdiagram becauseoftheexpenseoftransferringdatafromCPUtoGPUand backagain,itis diculttohavethetwoprocessingenvironmentsworkingtogether WeproposeaheterogeneousmethodofWord2Vectraining.Theinp utandoutput weightmatricesarekeptcompletelyinGPUmemory,butthemulti-cor eCPUisstill responsibleforparsingtheinputleandgeneratingthetrainingbat ches.ThemultipleCPUcoreswillparsetheinputleandpreparetrainingbatchesth atarecarefully constructedtoavoiddatadependencyissues.Thesebatchesar ethensenttoGPUto beprocessed.EachCPUthreadwillworkwithitsownCUDAstreamto senddata andqueuekernellaunchesintheGPU.EachbatchpreparedonCPU isasetofintegerreferencestovectorsintheGPUweightstructures.Bytran sferringsmallerinteger references,ratherthancompletevectors,wereducetheamou ntofCPUtoGPUdata transfer.ThebatchingmodelisshowninFigure10.Theinitialstepo ftokenizingand sortingthevocabulary,andpreparingtherequireddatastructu resforthetrainingphase isstilldoneontheCPUbeforeheterogeneoustrainingbegins. RecentreleasesofCUDAhavefeaturedtheabilitytotransmitdata andexecute kernelsinparallel.WeoptimizeGPUexecutionbyparsingtheinputleon theCPUat thesametimethattheGPUisusingitsroatingpointunitstoprocesst rainingbatches. TheCPUthreadseachstartbyparsingpartofthele,aswasdone intheoriginalC 23


implementationofWord2Vec.However,insteadofworkingwithonlya singlepartof thele,andcalculatingvectorupdates,eachCPUthreadworkswit hmultipleleparts andllsabueroftarget/contextsets.Eachtarget/contexts etcontainsasetoftarget wordintegertokens,asetofcontextwordintegertokensandas etoflabelsforeach target/contextwordpairinthebatch.Thesetarget/contextin tegertokensetsreference indicesintheGPUweightmatrices.Thetargetwordtokensareindice softheoutput matrix,andthecontextwordtokensetareindicesoftheinputmat rix.Eachtraining batchalsocontainsafewotherintegersthatpointtoboundarieso fthetrainingset.An examplebatchisillustratedinFigure11. targets:1023,10120,534,...contexts:1231,853,2342,12312...labels:1,0,0,0,1,0....batchmetadataW o [1023], W o [10120],... W i [1231], W i [853],... CPUpreparesbatchindexdataGPUunpacksbatch W o onGPU W i onGPU Figure11:CPU/GPUbatchdatatransfer SendingtokenizedsentencestotheGPUforprocessingwouldbeam oreecient methodofdatatransfer.Thereare( x 2 c )+ B integersineachtrainingbatch,where x isthenumberofwordsinasentence, c isthecontextwindowsize,and B istheamountof metadata(labelsandboundarydata)sentineachbatch.In[19],on ly x integersaresent, andthetarget/contextsetsareparsedintheGPU.However,by parsingtarget/context 24


setsonCPU,theCPU'scanbeactivelyparticipatinginthetrainingope rationsatthe sametimeastheGPU.TheGPUislefttodoonlytheroatingpointopera tions,whichis stillthebulkoftheworkinthisapplication.TheCPUtoGPUmemorytra nsferlatency ishiddenandoverlappedbytheamountoftimeittakestocompleteth eroatingpoint operations. Thetarget/contextintegersetsarebueredintopinnedCPUmem ory.Thenumber ofsetstobebueredisaparameterthattheprogramtakesasinp ut.Eachthreadmust makesurethateverytarget/contextsetinasinglebatchcomesf romadierentsentence, sothebatchsizemustbesmallenoughrelativetothenumberoflep artsbeingparsed toensureadiversityofsentencesineachbatch. OnceanindividualCPUthread'sbuerislled,thethreadstartsame morytransfer totheGPU.InrecentreleasesofCUDA,thisisanon-blockingtrans ferthatcanbe queuedintoaCUDAstream.Oncethetransferisinitiated,thethre adcangobackto itsdutyofparsingtheinputleandcollectingthenexttrainingbatch .WhiletheCPU threadscollectthenexttrainingbatch,theGPUcanactivelyworko nthepreviousbatch. ThekernelexecutioncanbequeuedintomultiplestreamsperCPUth readtoprevent anyblockingoftheCPUorGPUduringtheexchangeoftarget/cont extsetdata. Thisapproachdoesnotrequirepreprocessingandthereforedoe snothidetheneed torepeatedlyrevisittheinputtrainingdatafromthecountof\wor dspersecond."The trainingleisparsedatthesametimethattrainingstepsareproces sedontheGPU. Furthermore,byhavingadiversesetoftrainingdataineachbatch ,theissuesrelating toaccuracyareaddressed.GPUKernelDesign ForbothSGHSandSGNS,theworkdoneduringtrainingissimilar.Asub -matrix iscollectedfrom W o andasub-matrixiscollectedfrom W i .Thissimilarityallowsus tore-usethesameCUDAkernelsforboththehierarchicalsoftma xmethodandthe 25


negativesamplingmethod.Thereisadierenceinhowthelabelmatrix isconstructed andapplied,buttheconceptsdescribedhereareconsistentforb othSGHSandSGNS. ThecuBLASlibraryprovidesthefastestimplementationsofsimplema trixmultiply operations.However,byusinglibraryfunctions,wegiveuptheabilit ytokeepvector datainGPUsharedregistermemoryaswemovefromthematrixmultip lyoperationto theotheractionsrequiredduringeachtrainingstep.Forthisreas on,wehavedesigned ourownGPUkernelstohandleeachtarget/contextsettrainingop eration. It'simportanttonotethatthematrixsizesarenotconsistentinea chtrainingset. Thesentencesizesarenotalwaysthesame,notallwordsinthevoc abularytreehave thesamenumberofnodes,andthereisarandomdroppingoftrainin gwordsthatoccurs duringparsing.ThiscomplicatesecientuseoftheCUDAcores.The mostecient implementationofCUDAshouldnotallowwarpdivergence. WedevelopedakernelthatexecutesAlgorithm3,lines2to11,inbloc ksofexactly32 threadsforeachtrainingstep.Byusing32threadsperblock,wec andrasticallyreduceor completelyeliminatewarpdivergence,dependingonthesizeofthewo rdvectors.Wecan thenutilizeGPUsharedmemoryregistersandshueoperationstor educeinteraction withGPUglobalmemory,thusincreasingperformance. Inordertogetunevenmatrixmultiplicationsintoevenlysizedblockso f32threads, somepre-processingisrequiredontheCPU.Astarget/contexts etsarereducedtotrainingoperations,thereisavariablenumberoftargetvectorsandco ntextvectors.InSGNS, thenumberoftargetvectorsis1+ k where k isthenumberofnegativesamples.InSGHS, thenumberoftargetvectorsis1to MAX CODE LEN where MAX CODE LEN isthe longestbinarycodeassignabletoawordintheclassicationtree.Th enumberofcontext vectorsrangesfrom2to2 c where c isthecontextwindowsize.Thereisavariablenumber ofcontextvectorsbecauseWord2Vecrandomlydropscontextw ordsduringtraining. Sincetherearetypicallyatleast4targetvectors,andatleast8co ntextvectors, wecanbatchtrainingoperationsinto4 8sub-operations.Theseoperationscanbe 26


carriedoutecientlyinamatrix-basedkernel.Theremainingtraining operationscan behandledwithaseparatevector-basedkernel. Theprogramtakesa\batchsize"parameter.Thisparameterisus edtodetermine howmanytarget/contextsetsshouldbesenttoGPUatonce.Out ofthisnumber,we batchasmany4 8setsoftrainingsub-batchesaswecan,andthenbatchtherema ining trainingoperationsseparately.Oncethegivennumberoftarget/ contextsetsarebuered intoeitherthematrixbatchorthevectorbatch,welaunch2kerne ls.Therstkernel launchhandlesthematrixbatches,andthesecondhandlesthevec torbatches.Thetwo combinedkernelshandlethetotalgivennumberoftarget/contex tsets.Thestep-by-step processisillustratedinAlgorithm4.EachCPUthreadexecutesthiss equenceinparallel. Algorithm4: WombatGPUkernelactions 1 foreach trainingwords do 2 collectcontextwindowbatchdataonCPU ; 3 splitbatchinto 4 8 matrix ; 4 appendmatrixbatchtomatrixbatch ; 5 appendremainingdot-productstovectorbatch ; 6 if havegivennumberofcontextwindowsinbuer then 7 sendbueredbatchestoGPU ; 8 OnGPU: 9 foreach matrixkernelblock do 10 Executefastmatrixkernel 11 foreach dotproductkernelblock do 12 Executefastvectorkernel 27


+8 4 8 +8 ... w +4 4 8 +4 ... w Figure12:Loadingdatatosharedmemory. Fastmatrixkernel Thefastmatrixkernelusessharedmemoryandshueoperations toquicklyperform allofthebatchoperationsdescribedinAlgorithm3,lines2to11.The blocksizeis32 threads.Thesethreadscanbeconsideredtobea4 8gridanda8 4grid.Therst stepistotaketheGPUbatchdatafromtheCPUkernellaunch,and toloadvectordata. Figure12illustratesthesharedmemoryloadoperations.First,asa 4 8gridin w= 8steps,thethreadsworktogethertoload4 W o vectorsfromGPUglobalmemory tosharedblockmemory.Then,asa8 4gridin w= 4steps,thethreadsworktoload 8 W i vectorsfromGPUglobalmemory,where w isthewordvectorsize.Thereisno synchronizationrequiredduringthesersttwooperations. Thenextstepistoperformamatrixmultiplyofthe4vectorsfrom W o ,andthe8 vectorsfrom W i .Notethatthese32vectorsareallnowinsharedregistermemory as W oPrivate and W iPrivate ,respectively.Eachthreadwillperformasingledotproductin aloopof w steps.ThisisillustratedinFigure13.Thisapproachissimplerthansom e moreadvancedmatrixmultiplicationtechniques,butworkswellbeca usethecomplete setof32vectorswilltintosharedmemory.Thesizeof w typicallyrangesbetween100 and300. Aftereachthreadindependentlycalculatesitsproperdot-produ ctofvectorsinshared registermemory,leavingtheresultinthread-localmemory.Eacht hreadthenappliesthe 28


sigmoidfunctionandcalculateserrorforitslocalvaluegivenitsposit ioninthelabel matrix. 4 ... w 8... w Figure13:Wombatmatrixmultiply. Thenextstepinvolvestwomatrixmultiplications.The4 8blockofthreadscan bethoughtofasasinglematrix, G .Eachthreadholdsasinglevalueinitslocalmemory representingthevalueof G atitsrespectiverowandcolumn.Thetricknowistoperform G T W oPrivate and G W iPrivate ,withoutputtingtheindividualrowandcolumnvalues of G intosharedorglobalmemory.Wealreadyhavewhatweneedineacht hread's localmemory.Usingahandfulofshueoperations,wecanavoidus ingthenextlevel ofmemoryhierarchy.Wecanalsodobothmatrixmultiplicationsatonc ewithasingle loopofsize w .TheprocessisillustratedinFigure14. 4 ... w 8 ... w 4 8 = 8 4 = 4 ... w 8 ... w Figure14:Wombatupdatecalculationillustration. Inthecaseofthe(4 8 G matrix) (the8 wW iPrivate matrix),theresultisa 4 w matrix.Wecancalculatethisin w steps.Intherststep,eachrowofthreads in G performsanelement-wisemultiplicationwiththeelementsintherstc olumnof 29


W iPrivate .Eachthreadintherowdoesasinglescalarmultiplicationwithanelemen t inthesharedmemory W iPrivate columnthatcorrespondstothethread'scolumn.For example,row1,column4oftheblockofthreadswilldoasinglemultiplica tionofits singlelocal G 1 ; 4 valuewithcolumn1,row4of W iPrivate .Theresultofthisrststep isthateachofthe4rowsinourblockofthreadsnowhasa8threads eachcontaining thesingleproductofanelementin G andthecorrespondingelementin W iPrivate .The nextstepinamatrixmultiplicationistosumthese8elementstoforman entryineach ofthe4rowsinourresultmatrix.Thissummationisdonewithaserieso f\shue down"operationsthatpassdatabetweenthreadswithinoursingle warp.Thereduction sumoperationacrossour4rowsyields4entriestotheresultmatrix .Thisoperation isrepeated w timestocompletethematrixmultiplication.Withinthesameloopof w steps,thetransposeoperationisdonetoperform G T W oPrivate Fastvectorkernel Thevectorkernelbehavessimilarlytothematrixkernel.Eachkern elusesablockof 32threads.Thethreadsrstloadtwovectorsintosharedmemor yfromglobalmemory. Eachofthethreadsthenworksonasubsetofelementsineachvec tortoperformelementwisemultiplication,andthenthethreadsuseashueoperationtore ducetheindividual productsdowntoasingledotproduct.Thesingledotproductisbro adcasttoall32 threads,whichthenupdatesegmentsoftheglobalmemoryvecto rsinparallel.Thevector kerneldoesnotperformasfastasthematrixkernel,butitisnece ssarytohandlethe trainingdatathatdoesnottnicelyinto4 8sub-matrices. Experimentsandresults Theexperimentspresentedbelowwereperformedonasinglenodeo ftheHeracles clusterinourParallelandDistributedSystemslab(http://PDS.ucde 2IntelXeonE5-2650v4Broadwell-EPCPUswith24totalcores(12 coresperprocessor). 30


TheCPUsrunat2.20GHz,andthenodehas128GBRAM.Theoperatin gsystem isCentOSLinuxrelease7.2.1511.CodewascompiledwiththeIntelC++ Compiler version17.0.1andIntelMKLversion17.0Update1fortestingWomba t+MKL,and GCCversion4.8.5wasusedtocompileandtestWombatwithoutMKL.Als oonthis nodeare4NVIDIATeslaP10016GB\Pascal"SXM2GPUs.OneGPUwas usedforall WombatandBIDMachexperiments.GPUcodewascompiledwithNVCCV 8.0.61.The CUDAdriverisversion8.0,withcomputecapability6.0. GensimisapopularPythonNLPframeworkthathasoneofthemostp opularimplementationsofWord2Vec[20].It'spopularbecauseitintegratesw itharichPython ecosystemofmathandmachinelearninglibraries,andthereisalarge amountofonline documentationthatexplainshowtousethelibrary.Wecompareour resultstotheperformanceofGensim,aswellastoCythnnandtheoriginalCimplement ation.Cythnn containsaWord2Vecimplementationthatattemptstospeeduphier archicalsoftmaxby reducingtheamountofcachecoherenceconricts[13],butthetec hniqueproposeddoes notbatchtrainingsetsintomatrix-to-matrixoperations,whichisa bigpartofwhathas madeIntel'sworkin[12]withthenegativesamplingsosuccessful.Ins tead,Cythnnsimplycopiessetsoftrainingdataintothread-localmemory.Thelocal dataisupdatedin placeofsharedmemoryupdates,andthelocalcopiesarewrittenb acktosharedmemory onlyoccasionally.Theresultisa2xto4xspeedup,asthecachecohe renceconrictsare indeedreduced.However,thetechniquedoesnotscalewelltomult ipleprocessors,and doesnotprovidetheopportunitytousehighlyoptimizedmatrix-tomatrixlibrarycalls. TheBIDMachimplementationofWord2Vecclaimstohavethefastest GPUimplementationofSGNSWord2Vec[18].However,ouranalysisandtestso ftheBIDMach modelhaverevealedrawsintheimplementationthataecttheaccu racyofthemodel output.WehaveadjustedtheparametersofBIDMachtogetuse ableoutput,andfound speedstobelowerthanwhathasbeenreported.Wecompareourim plementationtothe speedandaccuracyofBIDMach. 31


24681012141618202224 0 1 ; 000 2 ; 000 3 ; 000 NumberofcoresThousandwords/second(total) C Gensim Wombat Wombat+MKL Cythnn Figure15:SpeedupforbatchedhierarchicalsoftmaxCPUmethod comparedtoleading implementations. Accuracywasevaluatedforwordsimilarityscoresandwordanalogy scoresforbaselinecomparisonwithvariousmodels.TheWS-353dataset[22]contain salargesetof wordpairs.Eachwordpairisgivenasimilarityscorethatissourcedfr omhumanjudgement.Toevaluatetheaccuracyofourmodel,weusetheSpearman srankcorrelation coecient[24]betweenthehumanjudgementandthecosinesimilarit yofourwordvectorsthatrepresentthewordsineachpair.Valuesofthissimilarityt hatmeasureclose to100representamodelthatcandetectsimilaritiesbetweenword saswellasahuman. WealsousedtheGoogleanalogydataset[7]toevaluatetheperform anceofthevectors inansweringll-in-the-blankanalogyquestions.Thescorerangesf rom0to100and representsthepercentageofanalogiescorrectlyanswered.CPUBatchedHierarchicalSoftmax TheWord2Vecparametersusedwereasfollows:windowsizewas100 ,trainingle isthe\text8"datasetfromMattMahoney'swebsite 1 (thisisatruncatedandcleaned exportofWikipediaarticlesthatisoftenusedtotesttheperforma nceofWord2Vec),the 1 32


Table1:EvaluationofhierarchicalsoftmaxaccuracyinnewWombat modelcompared toCythnnandoriginalWord2Vec ProgramSimilarityscoreAnalogyscore Wombat68.731.5Cythnnwithtop-31cache67.731.1OriginalWord2Vec68.034.0 downsamplingparameteris1e-4,10trainingiterationsareusedand thewindowsizeis 8.Cythnnwasconguredtocachethetop31nodesinthehierarch y. Theresultsshowmorethana4xspeedupwhenusingWombat'sbatch edtraining methodwithoutIntelMKLovertheoriginalCimplementationonall24 cores.Thescalabilityislinearacrossthetwoprocessorsforthenon-MKLbatched method.Cythnndoes notscaletothesecondprocessor.Thisislikelyduetorestrictionso fPythonitself.The GlobalInterpreterLock(GIL)inPythonpreventsthethreadsf romscalingtomultiple processors.ThesameproblemexistsforGensim.TheproposedC+ +implementation doesnotsuerfromthisproblem,whichallowsittoreachgreatersp eedsonmultiple processors.AlthoughCythnnoutperformstheWombatmethodo nasingleprocessor, Wombat'sMKL-enhancedimplementationismuchfasterinallnumbers ofcoresand processors. ThereisadipinWombat'sperformancewhenthethreadsstarttous ethesecond processorinthebatchedversionwithMKL.Eachprocessorhas12 cores,andeachset of12coressharesanL3cache.Whenlessthan13coresareused, alloftheworkis doneinasingleprocessor.When13ormorecoresareused,asecon dL3cacheanda secondprocessorareintroduced.Thereisalotofdatabeingshar edbyeachprocessor,so thereisanadditionallayerofcachecoherenceconrictsbetweent hetwoprocessors.These conrictsareslowertoresolvebecausethetwoL3cachesresideins eparateprocessors.An increaseofthreadsrunninginthesecondprocessoryieldsanincre aseofcacheconricts, 33


andanincreaseofdatatransfersthroughmainmemory.Theseinc reasesresultina degradationofoverallperformance.WhenMKLisabsent,thisdeg radationishidden bytheextratimeittakestodothematrixoperations.Thisproblemo nlybecomes visiblewhenthoseoperationsarefastenoughtoleavebehindtheme morytransferas theprimarybottlenecktoperformance.At12cores,Wombatwith MKLismorethan 10xfasterthantheoriginalCimplementation.Asfarasweknow,th isisthefastest implementationofskipgramhierarchicalsoftmaxWord2Vectodate Theaccuracyofeachimplementationisextremelysimilar.Thereisaslig htlossof accuracyinbothCythnnandtheWombatmodelintheanalogyscore s,buttheword similarityscoresremainhigh.Theincreasedspeedcomesataminimallo ssofaccuracy. GPUExperiments TheWord2Vecparametersareasfollows:wordvectorsize100,co ntextwindowis8, 5negativesamplesareused,downsamplingsetto1e-4,10iteration sandalearningrate of0.025. TheWombatprogramalsotakesthenumberofCPUthreadsasapar ameter.These experimentswereexecutedwith8CPUthreads.Thisincreasesthe rateatwhichGPU trainingbatchesarecollectedfromthetrainingle. Batchsizesgreaterthan512forWombatdonotincreasespeed,a ndstarttocause problemswiththewaythatsentencesarebatched,soallexperime ntsforWombatusea batchsizeof512target/contextsets(splitintoavariablenumber ofmatrixandvector batches).TheBIDMachbatchsizesarenotdocumented.It'snot clearwhatthismeans, butwediscoveredthatreducingthebatchsizedoesseemtohavea neectonaccuracy andspeed.Reducingthebatchsizetovaluessmallerthan100,000ca usestheprogram togenerateerrors. NotethatthespeedsreportedherefortheBIDMachGPUimplemen tationaremuch lowerforBIDMachthanwhathasbeenwidelyreportedinotherrese arch.In[18]agure 34


Table2:ExecutiontimeandaccuracyofWombatvs.BIDMach. ProgramBatchsizeWordsimilarityscoreAnalogyscoreWords/sec BIDMachGPUSGNS1,000,000242.93.236MBIDMachGPUSGNS100,00033.51.61.492M WombatGPUSGNS51268.231.94.633M WombatGPUSG HS 51267.232.21.942M of8.5millionwords/secondisreported.Thisspeedispossiblewiththed efaultlearning rateand vexp parameters,buttheoutputmodelreturns0%accuracyinalltes tswith theseparameters.Theexactfunctionofthe vexp parameterisnotdocumented.Onlyby settingthisto0andreducingthelearningratedoesthemodelactu allyproduceuseful vectors.Itthenproducesthemattheslowerspeedsreportedin Table2.Alsonotethat BIDMach'swordspersecondreportsdonotfactorintherequired preprocessingofthe trainingdatathatisuniquetoBIDMach.Wombatdoesnotrequirean ypreprocessing. TheWombatGPUimplementationsforSGNSaresimilarinspeedtoBIDMa chorfaster, andtheyprovidemuchgreateraccuracy. 35


CHAPTERIV FASTERFASTTEXT FastTextisanopensourcelibrarydevelopedbyFacebookAIResea rch.Thelibrary hasshownthatuseofsub-wordinformationgreatlyimprovesthea ccuracyofthevectorrepresentationsofwordsmadepopularbyWord2Vec[16].Howe ver,theper-thread performanceoftheFastTextlibrarydegradesasthenumberoft hreadsincreaseson multi-core,shared-memorysystems.Ouranalysisshowsthistobe largelyduetocache coherenceconricts.Addressingthisissueisthegoalofthiswork. Werestructuretheimplementationandenhancethelevelofparalle lismandthealgorithmsscalabilitybybatchingtrainingstepsintoisolatedvector-tomatrixoperationsin thread-localmemory.Theexistingimplementationusessharedwrit abledatastructures throughouttheexecution.Thesestructuresdonotlendthemse lveswellforoptimized performanceincurrentmulti-coreshared-memoryarchitecture swithprivatelevelone andtwocacheorganizations.Forexample,thecurrentvector-t o-vectoroperationsare handledinloopsthatinteractwithsharedmemory.Whenmultiplethre adsarewriting updatestothesamevectorsinsharedmemory,thethreadexecu tionisrepeatedlyinterruptedbycacheblockinvalidationstoenforcecoherence[12][1 3].Byrstcopying smallbatchestothread-localmemory,theeectoftheseconric tsislargelyreduced.The batchescanthenbetrainedwithasetofoptimizedvector-to-mat rixoperations. Useofhighlyoptimizedvectorandmatrixlibrariescangreatlyreduce executiontime ofapplicationslikeFastTextthathavelinearalgebraroutinesintheir core[15].Wehave restructuredFastTexttousealinearalgebralibrary.Thisalonesp eedsupFastText,and providesevengreaterimprovementtospeedwhencoupledwithour proposedvector-tomatrixtrainingbatches.Matrix-to-matrixoperationswouldprovid efurtheroptimization [12],butwehavefoundthatitisnotpossibletoreducetheoperation sofeachFastText trainingsteptoamatrix-to-matrixoperationwithoutintroducings ignicantoverhead, orreducedaccuracyoftheoutputmodel. 36


Thereductioninperformancecreatedbycacheconrictscanbefu rtherreducedby re-usingthenegativesamplescollectedforeachtrainingstep.Neg ativesamplere-use reducesthenumberofshared-memoryupdatesthatarenecess aryduringeachtraining operation[12][18].Wealsoproposeawaytoreducerepeatednegat ivesampleusetoa simplerscalarmultiplicationofsingleroatingpointvalues.Thisreduces theamountof calculationsrequiredateachtrainingstep,andalsofurtherreduc esthechancesofcache coherenceconricts. FastTextskipgramwithnegativesampling Inadditiontolearningvectorsforindividualwords,FastTextalsole arnsvectorsfor then-gramsthatarefoundwithineachtargetword.Ateachtrain ingstepinFastText, themeanofthetargetwordvectoranditscomponentn-gramvec torsareusedtoactivate thehiddenlayer.Thegradientvectorthatiscalculatedineachtrain ingstepisthenused touniformlyupdateeachofthevectorsthatwerecombinedtofor mthetarget.Going backtotheexamplesentenceabove,thetargetword"cute"wou ldyieldahiddenlayer thatisthemeanofthen-gramvectorsfor"cut,ute,andcute."T hese3vectorswould thenbeupdatedwiththesamegradientvectoraftertheerrorisc alculatedinthetraining step. Thisaddsalotofadditionalcomputationtoeachtrainingstep,espe ciallyforlonger wordsthatarecomposedofmorethanjust3n-grams.Ateachpo int,awordneedsto sumandaverageitsn-gramcomponentparts.Thetrade-oisase tofword-vectorsthat containembeddedsub-wordinformation.Thesevectorshavebee nshowntobemore accuratethanWord2Vecvectorsbyanumberofdierentmeasur es[16]. NegativesamplinginFastText InFastText,thenegativewordsaresampledfromthecontextwo rdvectormatrix ratherthanthetargetwordmatrix.Thecontextwordsandnega tivesamplesresidein W o 37


not W i .Targetwordsareasummationandaverageofthetargetwordve ctorin W i and thetargetword'ssetofn-gramvectorsin W i .Ateachtrainingstep,avector-to-vector dotproductisusedtocalculatethenetworkactivationforcontex t/samplewordsfrom W o andthemeanofthevectorsfrom W i .Themeanisstoredinthread-localmemory, butthe W o context/samplevectorsusedinthisoperationremaininsharedmem ory. Sinceboththesamplesandcontextwordsarecomingfrom W o inFastText,itis dicultifnotimpossibletocreateamatrix-to-matrixoperation.Ino rdertoreduce thetrainingsteptoamatrix-to-matrixoperation,thenegativewo rdswouldhavetobe sampledfromthesamesourceasthetargetword.Thetwosetsof wordvectorsinthe trainingstepmustcomefromdierentweightmatrices,orelsethet rainingprocedureis reducedtoupdatingonlyasingleweightmatrixintheneuralnetwork .Theinputmatrix, W i ,isthesetofwordvectorsthatwillbetheoutputwordvectorsof thealgorithm.It makesnosensetocreateextraworkintrainingonly W o Itisalsonotusefultotrytocreateasetofnegativesamplesfrom W i inFastText. Becausethevectorstakenfrom W i aresummedandaveraged,samplingfrom W i would addaconsiderableamountofcomputationtotheprocedure.Each negativesamplewould havetobesummedandaveragedwithitssub-wordn-gramvectors .Skippingthesumand-averagestepfornegativesamplescouldpreventtheneedfo rextrawork,butdoing sohasanegativeeectonthemodel'saccuracy. ProposedimprovementstoFastText Weproposeasetofvector-to-matrixoperationsasanalternativ etotheSGNSFastTextcoretrainingloop.Ratherthantraineachtarget/contextw ordpairindividually, thecompletesetofcontextwordssurroundingatargetwordwith inacontextwindow aretrainedinasinglebatchinthread-localmemory. Thisisaccomplishedbyrstaggregatingthesetofcontextwordve ctorsfromthe global W o weightmatrixintoalocalcontextwordmatrix, W oLocal .Foreachvector 38


appendedto W oLocal ,weneedtoconsider k negativesamples,where k isthenumberof negativesamplesusedwhentrainingeachtarget/contextwordpa ir.Thewordvectors in W o thatrepresentour k negativesamplesshouldalsobeaddedtothecontextword matrix, W oLocal Ratherthansampleanewsetofwordsforeachtarget/contextw ordpair,wepropose toinsteaduseasinglesetofnegativesamplesforalltarget/conte xtwordpairsthatappear inthebatchedcontextwindow.However,becauseweareputtingc ontextvectorsand negativesamplesintothesamematrix,thissamplere-useresultsind uplicatevectorsin W oLocal ,asshowninFigure16. wordvectorsize P targets count ( targets ) 266666666664 wordvectorsize context 1 sample 1 ... sample k context 2 sample 1 ... sample k:::context n sample 1 ... sample k377777777775 Figure16:Vector-to-matrixbatchshowingduplicatesamplevecto rsinbold. Thisduplicationofvectorsinthematrixwouldresultinduplicationofwo rkin thevector-to-matrixoperation.Removingtheseduplicatevecto rswouldgreatlyaect theratioofnegativesamplestocontextwordsineachtrainingbatc h.Thisnegatively aectstheaccuracyofthemodel'soutput.Thesample/contextr atioshouldremainthe sameasitisintheoriginalFastTextimplementationtoensurethatth eeectofthe negativesamplesonthemodelaccuracyisnotreduced.Theeect ofusingthesame 39


samplesrepeatedlycanbereproducedwithasimplermultiplicationope rationofthe errorgradient,inplaceofkeepingduplicatevectorsinthecontext matrix. wordvectorsize P targets count ( targets ) 266666666664 wordvectorsize context 1 context 2:::context n sample 1 ... sample k377777777775 Figure17:Vector-to-matrixbatchwithduplicatesremoved.Asing lesetof k samples remains. Whenthetargetvectorismultipliedbythecontextwordmatrix,asis shownin Figure17,theresultisa1 ( n + k )vector,where n isthenumberofcontextwords inthetrainingset,and k isthenumberofnegativesamples.Applyingthesigmoid functiontoeachelementinthisvectoryieldsavectorofvaluesthat canbeusedto calculateupdatestothetargetvectorandtoeachcontextword vector.Thisvectorcan beadjustedtoreproducetheeectofeachnegativesampleappe aring n timesforeach ofthe n contextwords.Theelementsinthegradientvectorrepresenting samplessimply needtobemultipliedby n .Theresult,whenusingthisvectortocalculateupdatesto W i and W o ,ismathematicallyequivalenttohavingduplicatecolumnsinthecontex t matrix,withoutgeneratingtheduplicatework.Experimentsshowt hatincludingthis operationdoesmaketheresultingoutputmoreaccuratethansimp lyeliminatingthe duplicatevectors.ThisoperationisillustratedinFigure18. 40


context 1 :::context n sample 1 :::sample k P targets count ( targets ) f:::fn f:::n f Figure18:Multiplysamplegradientsby n ratherthancalculatetheirdotproduct n times. ThecompletetrainingalgorithmusedisdescribedinAlgorithm5. Experiments Hardware: TheexperimentspresentedinthispaperwereperformedonasinglecomputenodeoftheHeraclesclusterinourParallelandDistribut edSystemslab ( adwell-EPCPUswith 24totalcores(12coresperprocessor).TheCPUsrunat2.20GH z,andthenodehas 128GBRAM. Software: TheoperatingsystemisCentOSLinuxrelease7.2.1511.Codewascom piledwiththeIntelC++Compilerversion17.0.1andIntelMKLversion1 7.0Update 1. EigenforBLAS,MKL: EigenisaC++templatelibraryforlinearalgebra.Ithas noexternaldependenciesotherthantheC++standardlibrary.I tworksasaninterface toIntelMKL,Apple'sAccelerateframeworkonOSX,OpenBLAS,Ne tlibLAPACKand otherBLASbackends.Thelibraryoersanintuitivesyntaxthatma kesiteasyto integrateanumberofdierentlinearalgebraoptimizationswithavar ietyofcompilers [17]. 41


Algorithm5: Proposedvector-to-matrixtraining 1 foreach trainingwords do 2 generate k negativesamples ; 3 n numberofcontextwordsaroundtarget ; 4 for i to n do 5 W oLocal [ i ] W o [ context i ]; //globalmemory 6 for j to knegativesamples do 7 W oLocal [ n + j ] W o [ sample j ]; //globalmemory 8 V iLocal = zeros ; 9 foreach subwordoftarget do 10 V iLocal = V iLocal + W i [ subword ]; //globalmemory 11 V iLocal = V iLocal numberofsubwords ; //FastText 12 G W oLocal V T iLocal ; //localmemory,vector-to-matrix 13 calculateerrorin G ; 14 for j to knegativesamples do 15 W oLocal [ n + j ] W oLocal [ n + j ] n ; 16 U i G T W oLocal ; //localmemory,vector-to-matrix 17 U o G V iLocal ; //localmemory,vector-to-vector 18 update W o from U i ; //globalmemory 19 update W i from U o ; //globalmemory 42


FastText'srstreleasealreadyusesmatrixandvectorclassesth atwerewrittenin pureC++.ByrewritingthecoreoperationsintheseC++classestoc alltheEigen library,itispossibletoachieveimmediatespeedupevenforthevecto r-to-vectoroperations.Fortheexperimentsinthispaper,allmatrixandvectorcalls intheoriginal FastTextimplementationhavebeenreplacedwithcallstotheEigenlibr ary,version 3.3.3.TheFastTextlibraryhasthenbeencompiledwithIntel'sICPCco mpilerwiththe macroEIGEN USE MKL ALLdened.Asaresult,allvectorandmatrixoperationsare compiledtouseIntel'sMKLsubroutines. ThespeedupthatcomesfromaresultofusingEigenisreportedsep aratelyfromthe speedupachievedthroughbatchedtrainingstepsinFigure19. Trainingcorpora: Codewastestedusingtheenwik9Wikipediadumphostedon MattMahoney'swebsite.ThistestsetisthesamelelinkedfromFace book'sexample shellscripts. Evaluation: AswiththemajorityofliteratureonWord2Vec,thespeedresultsa re measuredintermsofwordsprocessedpersecond,andaccuracy wasevaluatedinterms ofawordsimilaritytestandawordanalogytest.Thetestsandresu ltsareexplainedin greaterdetailinSection. Code: ExperimentswereconductedwithamodiedversionofFacebook'so pen sourceFastTextcode.Themodicationswillbereleasedinaforked branchonGitHub, andapullrequestwillbeissuedonpublication. FastTextparameters: TheparametersaresimilartothoseusedbyFacebook's initialreportofFastText[16].Learningrateis0.025, w is100,windowsizeis5,5 trainingepochsareused,5negativesamples,2000000n-grams,2 4threadsandn-grams between3and6characterswereused.Results Withoutalossofspeedwithasmallernumberofcores,thenewappro achtotraining theFastTextmodelshowsaspeedupofabout2.5.Someofthisspe edupcomesmerely 43


fromintroducingEigenandMKLtothematrixandvectoroperations .Byreplacing vectoroperationswithEigencalls,whichinturncallIntelMKL'ssubr outines,thereisa speedupofapproximately1.4,thisislabeledinFigure19as FastTextwithEigen .By batchingtheoperationsintolocalmemoryandusingvector-to-ma trixoperations,there's anadditionalspeedupof1.78,thisislabeledinFigure19as BatchedSGNSWith Eigen .Thesespeedupsarecalculatedbasedonnumbersachievedwithall 24physical coresworking.Alargepercentageofthespeeduphereiscomingfr omareductionof cachecoherenceconricts.Ourresultsshowanalmostlinearincrea seinthenumberof wordsprocessedpersecondindicatingfurtherimprovementands calewithadditional coresisexpected. ThebatchmethodevenoutperformstheoriginalWord2Veccode, butallimplementationsofFastTextaredwarfedbytheperformanceofIntel'sma trix-to-matrixbatch approachtoWord2Vec.Thematrix-to-matrixbatchoperationsa rehighlyoptimized, buttherearealsofewervectorstoupdateintheWord2Vecmodel thantherearein FastText.Itmakessensethatamoreoptimizedalgorithmthatultim atelyhaslesswork todowouldbefaster,butthisfasterWord2Vecmodellacksthesu b-wordinformation containedintheFastTextmodels. TheaccuracyofourmodelsisreportedinTable3.TheWS-353datas et[22]and therareword(WS-RW)dataset[23]bothcontainalargesetofwor dpairs.Eachword pairisgivenasimilarityscorethatissourcedfromhumanjudgement. Toevaluatethe accuracyofourmodel,weusetheSpearmansrankcorrelationcoe cient[24]between thehumanjudgementandthecosinesimilarityofourwordvectorst hatrepresentthe wordsineachpair.Valuesofthissimilaritythatmeasurecloseto100r epresentamodel thatcandetectsimilaritiesbetweenwordsaswellasahuman.Wealso usedtheGoogle analogydataset[7]toevaluatetheperformanceofthevectorsin answeringll-in-theblankanalogyquestions.Thescorerangesfrom0to100andrepre sentsthepercentage ofanalogiescorrectlyanswered. 44


24681012141618202224 0 2 4 6 8 10 NumberofcoresMillionwords/second BatchedSGNSwithEigen FastTextwithEigen FastText IntelBatchedW2V OriginalW2V Figure19:PerformancecomparisonoforiginalFastTextcompare dtoproposedoptimized versionsandWord2Vec.Evaluatedontheenwiki9wordset. Therarewordstestshowstheusefulnessofthen-gramsub-wor dinformation.FastTextmodelsoutperformtheWord2Vecmodelsonthistestbecause thesub-worddata drawnfrommorecommonwordsenrichesthevectorrepresentat ionsoftherarewords. Weusethistesttoshowthatourproposedtrainingmethodsdonot degradethisimportantbenetoftheFastTextmodels. InTable3, BatchedSGNS1 isthebatchedmethodthatusesscalarmultiplication ontheerrorgradients(describedinSection). BatchedSGNS2 showstheresultsofthe sametechnique,withoutthegradientmultiplication.Thespeedsoft hesetwomethods areapproximatelythesame.Thewordsimilarityandwordanalogysco resareboth increasedwiththemultiplicationoperation,butitseemstherarewor dstestactually showsadecreaseinaccuracywhenthegradientsareadjustedwit hscalarmultiplication. Overall,bothbatchedmethodsshowadecreaseinwordanalogysco resfromtheoriginal FastTextimplementationandfromWord2Vec.Re-usingasmallernum berofnegative 45


Table3:EvaluationofmodelaccuracyinFastTextvs.newlypropos edbatchmethodof FastTextandWord2Vecmodels ProgramWS-353WordAnalogyWS-RW FastTextSGNS69.949.144BatchedSGNS170.139.342BatchedSGNS267.435.744Word2VecSGNS70.747.637Intel'sBatchSGNS71.248.037 samplesdecreasesthediversityinthetrainingsteps,whichwouldex plainthelossof accuracy.TherarewordsscoresremainhighforthebatchedFas tTextimplementation, whichsuggeststhatthesub-wordinformationcontinuestoenrich thevectors. Conclusionandfuturework Byrestructuringthetrainingoperationstothread-localvector -to-matrixoperations, andbatchingnegativesamples,aclearincreaseinscalabilityandperf ormanceofFastText isobtained.Ifpossible,batchingoperationstoenablematrix-to-m atrixoperationsthat woulduseamoreoptimizedlevel-3BLASroutinecouldyieldevenbetter performance. ThekindofbatchingproposedherehasalsoproventoworkwellonG PUandmultinodesystems[12][18].Extendingthisideatoamulti-nodeapproacho rtoaGPUwould likelyproduceevenfurtherspeedup. Lastly,thebatchingmethodusedhereforSGNSFastTextwouldlike lyworkwell withhierarchicalsoftmax,too.Thespeedupfromre-usingnegat ivesampleswouldbe absentfromabatchedhierarchicalsoftmax,butthethread-loc al,vector-to-matrixtrainingmethodwouldcertainlyimproveperformance. 46


CHAPTERV Conclusion InthisworkwehavepresentednewmethodsforcalculatingWord2V ecwordvectors atspeedsfasterthanpreviouslyreported.Wehaveappliedthese techniquestoboth theCPUandtheGPUwithsuccess.Whileresearchingthesetechniqu eswefoundand reportedrawsinapopularmachinelearninglibrary,andprovidedsug gestionsforimprovement.WehavealsoappliedourtechniquestothenewerFastT extlibrarywith success. Compilingthisbodyofworkhasprovidedgreatinsightintowhattechn iquesare availableforspeedingupmachinelearningmodels.Increasingthread localityplaysa greatroleinreducingcachecoherenceconricts,whichinturnprov idesenormousspeedup. Useofhighlyoptimizedlinearalgebraroutinescanalsoprovideaquicka ndeasyroute tospeedingupapplicationsthathavelinearalgebraintheircore.Car efulanalysisofall aspectsofagivenprogramproblemarenecessarytopreventove rlookingtheaccuracyof theprograminhopesofachievinghigherspeeds. WehavecompiledourWord2VecideasintoasingleframeworkcalledWom bat,and willreleaseitonGitHubtothepublic.Wehopetocontinuedevelopingth istoextend ourmethodstootherWord2Vecapproaches,andtoextendtheG PUspeeduptothe newerFastTextlibrary. 47


REFERENCES [1] ManningC.andSh utze,H. FoundationsofStatisticalNaturalLanguageProcessing.MITPress,Cambridge,MA,USA,1999. [2] Bengio,Y.,Ducharme,R.,Vincent,P.andJauvin,C. ANeuralProbabilisticLanguageModel. JournalofMachineLearningResearch,3 (2003).1137-1155. [3] Collobert,R.andWeston,J. .Auniedarchitecturefornaturallanguage Proceedingsofthe25th internationalconferenceonmachinelearning ,(Helsinki,Finland,2008),Omnipress, 160-167. [4] Mnih,A.andHinton,G. Ascalablehierarchicaldistributedlanguagemodel. Advancesinneuralinformationprocessingsystems,21 (2009).1081-1088. [5] FellbaumC. WordNet:anelectroniclexicaldatabase.MITPress,Cambridge, MA,USA,1998. [6] BengioM.andBengioY. Hierarchicalprobabilisticneuralnetworklanguage Proceedingsoftheinternationalworkshoponarticialint elligenceand statistics ,(Barbados,2005).SocietyofArticialIntelligenceandStatistic s,246-252. [7] Mikolov,T.,Chen,K.,Corrado,G.andDean,J. ,Ecientestimationof ICLRWorkshop, (Scottsdale,AZ,USA, 2013). [8] Mikolov,T.,Sutskever,K.,Chen,K.,Corrado,G.andDean,J DistributedRepresentationofWordsandPhrasesandtheirComposit ionality. Advances inNeuralInformationProcessingSystems,26 (2013).3111-3119. [9] Wang,Z.,Ma,L.andZhang,Y. Anovelmethodfordocumentsummarization ICCI*CC (StandfordUniversity,CA.,USA,2016),IEEE. [10] Mayank,D.,Padmanabhan,K.andPal,K. Multi-sentimentmodelingwith scalablesystematiclabeleddatagenerationviaWord2Vecclustering .in ICDMW (Barcelona,Spain,2016),IEEE. [11] Mikolov,T.,Le,Q.andSutskever,I. ,Exploitingsimilaritiesamonglanguagesformachinetranslation. arXivpreprintarXiv:1309.4168 ,2013. 48


[12] Ji,S.,Satish,N.,Li,S.andDubeyP. ParallelizingWord2Vecinsharedand distributedmemory. preprintarXiv:1604.04661v2 ,2016. [13] Vuurens,J.,Eickhoff,C.anddeVries,A. EcientParallelLearningof Word2Vec. preprintarXiv:1606.07822 ,2016. [14] Niu,F.,Recht,B.,Re,C.andWright,S. ,Hogwild!:ALock-FreeApproach AdvancesinNeuralInformation ProcessingSystems ,(Granada,Spain,2011),NeuralInformationProcessingSyste ms Foundation,Inc,693-701. [15] Blackford,L.S.,Demmel,J.,Dongarra,J.,Duff,I.,Hammarl ing,S., Henry,G.,Heroux,M.,Kaufman,L.,Lumsdaine,A.,Petitet,A.,P ozo, R.,Remington,K.andWhaley,R.C. Anupdatedsetofbasiclinearalgebra subprograms(blas). ACMTrans.MathematicalSoftware,28 (2)135-151,2002. [16] Bojanowski,P.,Grave,E.,Joulin,A.,Mikolov,T. EnrichingWordVectors withSubwordInformation. preprintarXiv:1607.04606 ,2016. [17] Guennebaud,G.,Jacob,B.andothers. Eigenv3. 2010. [18] CannyJ.,ZhaoH.,ChenY.,JarosB.andMaoJ. ,Machinelearningat IEEEInternationalConferenceonBigData ,(SantaClara,CA,USA, 2015),IEEE,233-242. [19] BAE,S.andYi,Y. HiroseA.,Ozawa S.,DoyaK.,IkedaK.,LeeM.,LiuD.(eds)NeuralInformation Processing.ICONIP 2016.LectureNotesinComputerScience,vol9948 ,(2016),Springer,269-279. [20] Reh u rek,R.andSojka,P. ,SoftwareFrameworkforTopicModellingwithLarge ProceedingsoftheLREC2010WorkshoponNewChallengesforN LP Frameworks ,(Vallettta,Malta,2010),ELRA,45-50. [21] Levy,O.Goldberg,Y.andDagan,I. ImprovingDistributionalSimilarity withLessonsLearnedfromWordEmbeddings. TransactionsoftheAssociationfor ComputationalLinguistics,3 .211-225,2015. [22]Finkelstein,L.,GabrilovichE.,MatiasY.,RivlinE.,SolanZ.,WolfmanG.a nd RuppinE., ACMTransactions onInformationSystems,20 (2002)116-131. 49


[23]LuongT.,SocherR.andManningC.,Betterwordrepresentatio nswithrecursive ProceedingsoftheSeventeenthConferenceon ComputationalNaturalLanguageLearning (Soa,Bulgaria,2013),Associationfor ComputationalLinguistics. [24]Spearman,C.,Theproofandmeasurementofassociationbetw eentwothings. AmericanJournalofPsychology,15(1904). 50