nerv/doc/source/overview.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299

Overview
========

What is NERV?
-------------

NERV is a general-purpose deep learning toolkit designed to be be simple,
lightweight, extensible and complete. The name "NERV" comes from the German
word "nerv" for "nerve" in English. It is also a fictional organization in the
*Neon Genesis Evangelion*, a famous anime.


Introduction and Philosophy
---------------------------

In recent years, the invention and prevalence of the deep neural network (DNN)
and related deep learning techniques have given rise to many tools and toolkits
which are designed for constructing and training neural networks which could
facilitate and routinize the research cycle of deep learning applied to areas
such as speech processing, computer vision, natural language processing and so
on. Such tools or toolkits can be categorized into two by design: task-specific or general-purpose. 

The first category tries to address the deep learning in a direct way. These
tools usually have a very specific goal, which means to support a certain type
of neural network model and provides with peripheral facilities dedicated to
one specific task, such as image classification or phone state prediction. Not
only the network model is hard-coded into the very core of the tool, but some
modifications or tricks that are only specific to a certain task are wired into
the code.

The effect of this approach is two-fold. On the one hand, they have a complete
and tightly designed system that appears to provide with a simple user
interface. Thus, researchers who are familiar with that specific area can use
it easily. Also, because the network is hard-coded and task-specific, the
implementation could be easy and optimization could be effective so that offers
a very efficient running speed (such as Current). However, on the other hand,
these usability and efficiency come at the cost of lacking reusability and
flexibility.  People need to hack the code and make modifications to suit the
tool to another structure of neural network model, and it is very difficult and
tricky to use the tool designed for one specific area (for example, image
classification) in another area (like speech recognition). Engineering details
and implementation tricks refrain people from doing so.

Caffee, which is designed for computer vision, has a comprehensive variety of
tools for processing images and training convolution neural networks (CNN). But
it cannot be directly applied to speech processing tasks. Luckily, there is
Kaldi, a counterpart in speech processing that can process wave-form files,
extract the acoustic features, train GMM-HMM models, fully-connected DNN
models, LSTM models and so on.

The second category strives to be general-purpose. As mentioned above, there are two types of generalities:

- General among different network structures
- General among different tasks

The advantages of such general approach is obvious: we can train different
network structures as long as the required basic computation units are provided
by the toolkit. Besides, general-purpose toolkits usually have a unified
interface for data input and output (I/O) which is the most task-specific
part. Therefore, the task-specific implementation details are limited to
separated I/O modules as possible. This concept of design also presents in
operating systems where device-specific code is modularized and isolated from
the core part and several abstraction layers are often used. Therefore, by this
approach, general-purpose toolkits usually can be potentially adapted to
various kinds of tasks without hacking or changing core code. Admittedly, the
generalities are accomplished at the cost of losing implementation simplicity
and efficiency. However, the overhead brought by abstraction and a relatively
complex design would not be a huge problem given the importance of those
benefits. Over the recent years, there have been many new network structures
being proposed, examined, and applied to various kinds of tasks. Given the
trend and difficulty to hack or modify the task-specific tools, the
benefits in generalities outweigh those concerns and are handy. 

There are some well-known and successful general-purpose deep learning toolkit.
They all have their strengths and weaknesses.

Theano is numerical computation library for Python. It supports mainstream
neural network structures such as fully-connected DNN, CNN, recurrent neural
network (RNN) and its variants like long-short term memory network (LSTM). It
also has a short learning curve once the user get to know about the symbolic
computation and appears to be friendly to new users. However, some concepts may
not be very friendly to users (such as "scans"), the compilation time of
network could be very long for complex models, Python-based environment implies
the toolkit cannot be lightweight, and large runtime footprint makes it hard to
port the toolkit to embedding environment (where resources are very limited).
So here comes a strong competitor, Torch.

Torch is an open source machine learning library whose initial release dates
back to 2002. Their design goals are similar: to provide with a "MATLAB-like"
low-level computation library that comes with separated functional blocks that
can be used by users to build and train their own network. Torch has a steeper
learning curve than Theano, while it is more lightweight than the latter using
Lua as the scripting language and only implement the time-consuming operations
in C.

However, Torch is not perfect. Its limitation comes with its advantage: there
are few general training or network building patterns inside the toolkit.
Torch plays a role more like MATLAB, so users need to write their own code to
put all things together: to deal with data I/O, mini-batching, training
scheduling and so on, many of which are not a trivial task, but are repeated by
each user in their scripts. This may lead to the phenomenon that each user has
her own code base, and on this level, it degrades Torch to a task-specific
tool.

Imagine a user wants to build and train an state-of-the-art LSTM model for
acoustic modeling, and she needs to read from some pre-existing feature files
extracted by popular speech processing framework like HTK or Kaldi. She has to
implement the data I/O all on her own.  Moreover, she has to implement
mini-batching, network unrolling, BPTT, etc., to deal with loops and schedule
the training. What's worse, when another user wants to train a different model
for the same task or train the same model for a different task, he has two
choices: to write his own training script or to copy the script from the
previous person. However, either choice is not ideal, because it turns
scripting into reinventing the wheel or hacking other's code, which goes
against the goal of a general-purpose toolkit. In fact, Torch seemingly goes
towards a more distributed and isolated development style, which can ease the
project management and collaboration, but also implies less collaboration
because people no longer work together at all but tend to write their own
scripts with duplicate functionalities. So there will be less and less common
code base among users' scripts.

CNTK, developed by Microsoft Research, is an attempt to address these problems.
It is a complete solution having the advantages of task-specific tools, but
also support various kinds of neural networks via introducing a scripting
language called NDL (now it is moving to a new language, BrainScript). It can
be adapted to different tasks from speech processing to computer vision.
Nevertheless, CNTK does not satisfy all researchers. Due to its highly coupled
design, which is the opposite choice made by Torch, researchers find it extremely
hard to add new functionalities. Unfortunately, this need always exists as the
brand new network structures or components being proposed. Extensibility is a
real problem for CNTK. Also, without being a general-purpose programming
language, the custom scripting language seems to have limited power of
expression. To increase the power, more syntax or semantics need to be
introduced and thus makes the learning curve even longer.

Finally, we propose NERV, a simple but effective deep learning toolkit that
tries to address the issues above. It is designed with the following
characteristics in mind:

- simplicity: the learning curve is not steep and code is straight forward

- extensibility: users can quickly become developers and add the missing
  modules or tailor the toolkit for their needs; major building blocks in NERV
  are modularized and the interfaces are standardized so that users can plug-in
  their additionally implemented ones and even use the modules implemented by others

- lightweight: NERV strives to stay at minimal core code base and dependencies,
  which makes it fairly easy to embed it into other task-specific tools, such
  as Kaldi (in speech-processing)

- completeness: at the same time, NERV provides with a complete pipeline of
  training and managing a neural network

Abstraction
-----------

NERV has a decoupled design, like Torch. It also uses Lua as its main scripting
language. Lua is extensively used in video game industry because of its short
learning curve and the tiny footprint of the interpreter/compiler. LuaJIT is a
just-in-time compiler for Lua providing with very high efficiency. Therefore,
our design strives to confine those performance demanding parts (such as matrix
operation, low-level I/O) into pure C/C++ modules, and by giving them Lua
interface, the rest of the project fully stands upon pure Lua, to avoid letting
users coding in C/C++.

To achieve the goal to be decoupled but complete, NERV uses a bottom-up style
of development and has several levels of abstraction. The following figure
depicts such abstraction from low (at the top of the figure) to high (at the
bottom of the figure) level.

.. image:: _static/nerv-hierarchy.svg

As shown in the figure, the brown boxes are implemented in pure C. Some part of
the classes (such as ``HTKReader``, ``KaldiReader``) inheriting from ``DataReader``
and some part of ``ChunkFile`` (defines the standard storage format) are also
implemented in C/C++. Except for these, others are all written in pure Lua.

This hierarchical design is the solution to the previous mentioned conflict
between extensibility and completeness. People who are merely interested in
using current functionalities do not have to write much code in order to have a
running example. They only need to invoke the ``Scheduler`` and provides with a
Lua script having declarative information of the model to be trained, the data
to be used and the training policy, which are task-specific. The rest,
non-task-specific jobs, are accomplished by corresponding components/modules in NERV.
When a research wants to have more control of the training process, she may
write a custom trainer inherits from the general trainer or even write her own
trainer, and that is the whole story. When she needs even more control or try
to hack an exotic model that temporarily does not fit into the current
pipeline, she may try to build her own script using lower level components. In
short, hierarchical design gives people choices and is expected to satisfy
different user requirements. The following subsections briefly introduce major
categories of components in NERV.

Matrix
******

This part corresponds to the brown boxes in the figure. Matrices are the basic
data container for computation. Although in NERV, matrices are not assumed to
be the only type of data container, the current available subclass of
``nerv.Param`` is ``nerv.MatrixParam`` (other classes such as
``nerv.BiasParam`` are currently aliases of ``nerv.MatrixParam``), meaning
parameters stored by matrices are sufficient for most tasks. The base class for
all matrices is ``nerv.Matrix``, which defines the common behavior of a matrix
regardless its type of value (float or double) or storage location (on host or
on device). Inheriting from that, there are two major groups of matrix types:
``nerv.MMatrix`` and ``nerv.CuMatrix``. As the name suggests, ``nerv.MMatrix``
defines the matrices in the main memory (on host), whereas ``nerv.CuMatrix``
defines those in the GPU memory (on device) using CUDA as programming interface
(there may be some class like ``nerv.CLMatrix`` in the future, to support other
programming interface for GPU). Then for each storage category, there are
subclass implementing matrices with different types of values: 

- ``nerv.MMatrixFloat``
- ``nerv.MMatrixDouble``
- ``nerv.CuMatrixFloat``
- ``nerv.CuMatrixDouble``

Layer
*****

A layer (``nerv.Layer``) in NERV conceptually represents a computation node
which declaratively defines the computation logic needed to produce the output
from the input. This means a layer itself is "oblivious" in a sense that its
computation is time-invariant (except that some layers maintain some auditing
information, which do not change the behavior of the output though) when
hyper-parameters are fixed. This concept is important in understanding the
relationship between layers and parameters in NERV and is not quite so in other
toolkits. In some toolkits, a layer or a computation node stands for the
computation logic together with the contained parameters. This coupled design
causes inconvenience in network model management when users want to manipulate
or duplicate the parameters and also makes parameter tying (different layer
share the same group of parameters) difficult to implement. In NERV, layers are
abstract concept that defines the *structure* of a neural network, whereas the
parameters defines the *state* of the neural network.

Layers have two categories: *graph layers* and *primitive layers*. The former
are a special kind of layer that acts like a container for a set of sub-level
layers by specifying both the temporal and structural connections between these
sub-level layers in the layer configuration of the containing graph layer.  The
latter, primitive layers, are the layers that define basic computation without
any sub-level structures. Graph layers can be nested to introduce a complex
hierarchy, but the final contained layers must be primitive.

Network
*******

Layers alone only declare the structure and behavior of the intended network.
To turn a layer into a trainable model, we need to *compile* the layer into an
*network* instance (``nerv.Network``). What compilation does is to flatten out
the nested graph layer according to the specified structure and allocate the
space for those intermediate activations between the layers. Although the
internal implementation of a network is flatten out and may lose nested
structures, it is designed to be opaque to users, which means users should not
bother much about the compiled network where many technical tricks are applied,
but should access the state of computation or change the hyper-parameters to
alter the behavior of the network via layer abstraction (the instances of
subclasses of ``nerv.Layer``) where the structures are preserved.

Parameter
*********

Parameters (``nerv.Param``) represents the state of layers (``nerv.Layer``) in
NERV. They are time-variant during training because of the update. They can be
read from files (in NERV ``nerv.ChunkFile`` format) and written to files. Take
an fully-connected linear layer in a neural network as an example, the layer
defines the calculation of :math:`y = W x + b`, where x is the input, whereas
the parameters W and b provide with the state of the layer. The association
between layers and parameters is called *binding*. For a layer like :math:`y_1
= W x_1 + b`, when it is bound to parameter :math:`W_1` and :math:`b_1`, the
calculation becomes :math:`y_1 = W_1 x_1 + b_1`.  However, when a network need
to rollback to a previous state, the layer may bind to parameter :math:`W_0`
and :math:`b_0`, so becomes :math:`y_1 = W_0 x_1 + b_0`. The concept of binding
decouples the layer and the corresponding parameters in a clear way.

Buffer
******

Buffers (``nerv.DataBuffer``), as the name suggests, connect I/O ends with
different speed (granularity). Buffers in NERV accept variable length of
samples (frames) from readers (``nerv.DataReader``) and produce a regularized
sequence of data to feed as the input to a network (``nerv.Network``). This
process is often called "mini-batching" which roughly means to stack the
samples together and cut samples into mini-batches.

Scheduler (Trainer)
*******************

Schedulers refer to those top-level scripts that implements the main training
loop and ticks the training process. A general-purpose scheduler typically
takes in a Lua script written by an end user that contains description of the
network, task-specific processing for reading data and some hyper-parameters.
Then, according to these metadata, it instantiates the major components that
make up the training pipeline and build up the pipeline by connecting them
according to their interface. Schedulers are user-friendly and thought to be
the main tool used by the end user. The end user could also change the behavior
of the scheduler by overriding functions in their Lua script read by the
scheduler. Experienced users can also directly write their own schedulers to
train exotic neural networks that temporarily does not fit into the current
pipeline.