Overview ======== What is NERV? ------------- NERV is a general-purpose deep learning toolkit designed to be be simple, lightweight, extensible and complete. The name "NERV" comes from the German word "nerv" for "nerve" in English. It is also a fictional organization in the *Neon Genesis Evangelion*, a famous anime. Introduction and Philosophy --------------------------- In recent years, the invention and prevalence of the deep neural network (DNN) and related deep learning techniques have given rise to many tools and toolkits which are designed for constructing and training neural networks which could facilitate and routinize the research cycle of deep learning applied to areas such as speech processing, computer vision, natural language processing and so on. Such tools or toolkits can be categorized into two by design: task-specific or general-purpose. The first category tries to address the deep learning in a direct way. These tools usually have a very specific goal, which means to support a certain type of neural network model and provides with peripheral facilities dedicated to one specific task, such as image classification or phone state prediction. Not only the network model is hard-coded into the very core of the tool, but some modifications or tricks that are only specific to a certain task are wired into the code. The effect of this approach is two-fold. On the one hand, they have a complete and tightly designed system that appears to provide with a simple user interface. Thus, researchers who are familiar with that specific area can use it easily. Also, because the network is hard-coded and task-specific, the implementation could be easy and optimization could be effective so that offers a very efficient running speed (such as Current). However, on the other hand, these usability and efficiency come at the cost of lacking reusability and flexibility. People need to hack the code and make modifications to suit the tool to another structure of neural network model, and it is very difficult and tricky to use the tool designed for one specific area (for example, image classification) in another area (like speech recognition). Engineering details and implementation tricks refrain people from doing so. Caffee, which is designed for computer vision, has a comprehensive variety of tools for processing images and training convolution neural networks (CNN). But it cannot be directly applied to speech processing tasks. Luckily, there is Kaldi, a counterpart in speech processing that can process wave-form files, extract the acoustic features, train GMM-HMM models, fully-connected DNN models, LSTM models and so on. The second category strives to be general-purpose. As mentioned above, there are two types of generalities: - General among different network structures - General among different tasks The advantages of such general approach is obvious: we can train different network structures as long as the required basic computation units are provided by the toolkit. Besides, general-purpose toolkits usually have a unified interface for data input and output (I/O) which is the most task-specific part. Therefore, the task-specific implementation details are limited to separated I/O modules as possible. This concept of design also presents in operating systems where device-specific code is modularized and isolated from the core part and several abstraction layers are often used. Therefore, by this approach, general-purpose toolkits usually can be potentially adapted to various kinds of tasks without hacking or changing core code. Admittedly, the generalities are accomplished at the cost of losing implementation simplicity and efficiency. However, the overhead brought by abstraction and a relatively complex design would not be a huge problem given the importance of those benefits. Over the recent years, there have been many new network structures being proposed, examined, and applied to various kinds of tasks. Given the trend and difficulty to hack or modify the task-specific tools, the benefits in generalities outweigh those concerns and are handy. There are some well-known and successful general-purpose deep learning toolkit. They all have their strengths and weaknesses. Theano is numerical computation library for Python. It supports mainstream neural network structures such as fully-connected DNN, CNN, recurrent neural network (RNN) and its variants like long-short term memory network (LSTM). It also has a short learning curve once the user get to know about the symbolic computation and appears to be friendly to new users. However, some concepts may not be very friendly to users (such as "scans"), the compilation time of network could be very long for complex models, Python-based environment implies the toolkit cannot be lightweight, and large runtime footprint makes it hard to port the toolkit to embedding environment (where resources are very limited). So here comes a strong competitor, Torch. Torch is an open source machine learning library whose initial release dates back to 2002. Their design goals are similar: to provide with a "MATLAB-like" low-level computation library that comes with separated functional blocks that can be used by users to build and train their own network. Torch has a steeper learning curve than Theano, while it is more lightweight than the latter using Lua as the scripting language and only implement the time-consuming operations in C. However, Torch is not perfect. Its limitation comes with its advantage: there are few general training or network building patterns inside the toolkit. Torch plays a role more like MATLAB, so users need to write their own code to put all things together: to deal with data I/O, mini-batching, training scheduling and so on, many of which are not a trivial task, but are repeated by each user in their scripts. This may lead to the phenomenon that each user has her own code base, and on this level, it degrades Torch to a task-specific tool. Imagine a user wants to build and train an state-of-the-art LSTM model for acoustic modeling, and she needs to read from some pre-existing feature files extracted by popular speech processing framework like HTK or Kaldi. She has to implement the data I/O all on her own. Moreover, she has to implement mini-batching, network unrolling, BPTT, etc., to deal with loops and schedule the training. What's worse, when another user wants to train a different model for the same task or train the same model for a different task, he has two choices: to write his own training script or to copy the script from the previous person. However, either choice is not ideal, because it turns scripting into reinventing the wheel or hacking other's code, which goes against the goal of a general-purpose toolkit. In fact, Torch seemingly goes towards a more distributed and isolated development style, which can ease the project management and collaboration, but also implies less collaboration because people no longer work together at all but tend to write their own scripts with duplicate functionalities. So there will be less and less common code base among users' scripts. CNTK, developed by Microsoft Research, is an attempt to address these problems. It is a complete solution having the advantages of task-specific tools, but also support various kinds of neural networks via introducing a scripting language called NDL (now it is moving to a new language, BrainScript). It can be adapted to different tasks from speech processing to computer vision. Nevertheless, CNTK does not satisfy all researchers. Due to its highly coupled design, which is the opposite choice made by Torch, researchers find it extremely hard to add new functionalities. Unfortunately, this need always exists as the brand new network structures or components being proposed. Extensibility is a real problem for CNTK. Also, without being a general-purpose programming language, the custom scripting language seems to have limited power of expression. To increase the power, more syntax or semantics need to be introduced and thus makes the learning curve even longer. Finally, we propose NERV, a simple but effective deep learning toolkit that tries to address the issues above. It is designed with the following characteristics in mind: - simplicity: the learning curve is not steep and code is straight forward - extensibility: users can quickly become developers and add the missing modules or tailor the toolkit for their needs; major building blocks in NERV are modularized and the interfaces are standardized so that users can plug-in their additionally implemented ones and even use the modules implemented by others - lightweight: NERV strives to stay at minimal core code base and dependencies, which makes it fairly easy to embed it into other task-specific tools, such as Kaldi (in speech-processing) - completeness: at the same time, NERV provides with a complete pipeline of training and managing a neural network Abstraction ----------- NERV has a decoupled design, like Torch. It also uses Lua as its main scripting language. Lua is extensively used in video game industry because of its short learning curve and the tiny footprint of the interpreter/compiler. LuaJIT is a just-in-time compiler for Lua providing with very high efficiency. Therefore, our design strives to confine those performance demanding parts (such as matrix operation, low-level I/O) into pure C/C++ modules, and by giving them Lua interface, the rest of the project fully stands upon pure Lua, to avoid letting users coding in C/C++. To achieve the goal to be decoupled but complete, NERV uses a bottom-up style of development and has several levels of abstraction. The following figure depicts such abstraction from low (at the top of the figure) to high (at the bottom of the figure) level. .. image:: _static/nerv-hierarchy.svg As shown in the figure, the brown boxes are implemented in pure C. Some part of the classes (such as ``HTKReader``, ``KaldiReader``) inheriting from ``DataReader`` and some part of ``ChunkFile`` (defines the standard storage format) are also implemented in C/C++. Except for these, others are all written in pure Lua. This hierarchical design is the solution to the previous mentioned conflict between extensibility and completeness. People who are merely interested in using current functionalities do not have to write much code in order to have a running example. They only need to invoke the ``Scheduler`` and provides with a Lua script having declarative information of the model to be trained, the data to be used and the training policy, which are task-specific. The rest, non-task-specific jobs, are accomplished by corresponding components/modules in NERV. When a research wants to have more control of the training process, she may write a custom trainer inherits from the general trainer or even write her own trainer, and that is the whole story. When she needs even more control or try to hack an exotic model that temporarily does not fit into the current pipeline, she may try to build her own script using lower level components. In short, hierarchical design gives people choices and is expected to satisfy different user requirements. The following subsections briefly introduce major categories of components in NERV. Matrix ****** This part corresponds to the brown boxes in the figure. Matrices are the basic data container for computation. Although in NERV, matrices are not assumed to be the only type of data container, the current available subclass of ``nerv.Param`` is ``nerv.MatrixParam`` (other classes such as ``nerv.BiasParam`` are currently aliases of ``nerv.MatrixParam``), meaning parameters stored by matrices are sufficient for most tasks. The base class for all matrices is ``nerv.Matrix``, which defines the common behavior of a matrix regardless its type of value (float or double) or storage location (on host or on device). Inheriting from that, there are two major groups of matrix types: ``nerv.MMatrix`` and ``nerv.CuMatrix``. As the name suggests, ``nerv.MMatrix`` defines the matrices in the main memory (on host), whereas ``nerv.CuMatrix`` defines those in the GPU memory (on device) using CUDA as programming interface (there may be some class like ``nerv.CLMatrix`` in the future, to support other programming interface for GPU). Then for each storage category, there are subclass implementing matrices with different types of values: - ``nerv.MMatrixFloat`` - ``nerv.MMatrixDouble`` - ``nerv.CuMatrixFloat`` - ``nerv.CuMatrixDouble`` Layer ***** A layer (``nerv.Layer``) in NERV conceptually represents a computation node which declaratively defines the computation logic needed to produce the output from the input. This means a layer itself is "oblivious" in a sense that its computation is time-invariant (except that some layers maintain some auditing information, which do not change the behavior of the output though) when hyper-parameters are fixed. This concept is important in understanding the relationship between layers and parameters in NERV and is not quite so in other toolkits. In some toolkits, a layer or a computation node stands for the computation logic together with the contained parameters. This coupled design causes inconvenience in network model management when users want to manipulate or duplicate the parameters and also makes parameter tying (different layer share the same group of parameters) difficult to implement. In NERV, layers are abstract concept that defines the *structure* of a neural network, whereas the parameters defines the *state* of the neural network. Layers have two categories: *graph layers* and *primitive layers*. The former are a special kind of layer that acts like a container for a set of sub-level layers by specifying both the temporal and structural connections between these sub-level layers in the layer configuration of the containing graph layer. The latter, primitive layers, are the layers that define basic computation without any sub-level structures. Graph layers can be nested to introduce a complex hierarchy, but the final contained layers must be primitive. Network ******* Layers alone only declare the structure and behavior of the intended network. To turn a layer into a trainable model, we need to *compile* the layer into an *network* instance (``nerv.Network``). What compilation does is to flatten out the nested graph layer according to the specified structure and allocate the space for those intermediate activations between the layers. Although the internal implementation of a network is flatten out and may lose nested structures, it is designed to be opaque to users, which means users should not bother much about the compiled network where many technical tricks are applied, but should access the state of computation or change the hyper-parameters to alter the behavior of the network via layer abstraction (the instances of subclasses of ``nerv.Layer``) where the structures are preserved. Parameter ********* Parameters (``nerv.Param``) represents the state of layers (``nerv.Layer``) in NERV. They are time-variant during training because of the update. They can be read from files (in NERV ``nerv.ChunkFile`` format) and written to files. Take an fully-connected linear layer in a neural network as an example, the layer defines the calculation of :math:`y = W x + b`, where x is the input, whereas the parameters W and b provide with the state of the layer. The association between layers and parameters is called *binding*. For a layer like :math:`y_1 = W x_1 + b`, when it is bound to parameter :math:`W_1` and :math:`b_1`, the calculation becomes :math:`y_1 = W_1 x_1 + b_1`. However, when a network need to rollback to a previous state, the layer may bind to parameter :math:`W_0` and :math:`b_0`, so becomes :math:`y_1 = W_0 x_1 + b_0`. The concept of binding decouples the layer and the corresponding parameters in a clear way. Buffer ****** Buffers (``nerv.DataBuffer``), as the name suggests, connect I/O ends with different speed (granularity). Buffers in NERV accept variable length of samples (frames) from readers (``nerv.DataReader``) and produce a regularized sequence of data to feed as the input to a network (``nerv.Network``). This process is often called "mini-batching" which roughly means to stack the samples together and cut samples into mini-batches. Scheduler (Trainer) ******************* Schedulers refer to those top-level scripts that implements the main training loop and ticks the training process. A general-purpose scheduler typically takes in a Lua script written by an end user that contains description of the network, task-specific processing for reading data and some hyper-parameters. Then, according to these metadata, it instantiates the major components that make up the training pipeline and build up the pipeline by connecting them according to their interface. Schedulers are user-friendly and thought to be the main tool used by the end user. The end user could also change the behavior of the scheduler by overriding functions in their Lua script read by the scheduler. Experienced users can also directly write their own schedulers to train exotic neural networks that temporarily does not fit into the current pipeline.