1 files changed, 189 insertions, 17 deletions
diff --git a/nerv/doc/source/overview.rst b/nerv/doc/source/overview.rst
index ccdd2ef..f28be89 100644
--- a/nerv/doc/source/overview.rst
+++ b/nerv/doc/source/overview.rst
@@ -5,13 +5,13 @@ What is NERV?
 -------------
 
 NERV is a general-purpose deep learning toolkit designed to be be simple,
-lightweight and extensible. The name "NERV" comes from the German word "nerv"
-for "nerve" in English. It is also a fictional organization in the *Neon
-Genesis Evangelion*, a famous anime.
+lightweight, extensible and complete. The name "NERV" comes from the German
+word "nerv" for "nerve" in English. It is also a fictional organization in the
+*Neon Genesis Evangelion*, a famous anime.
 
 
-Why NERV?
----------
+Introduction and Philosophy
+---------------------------
 
 In recent years, the invention and prevalence of the deep neural network (DNN)
 and related deep learning techniques have given rise to many tools and toolkits
@@ -26,8 +26,10 @@ of neural network model and provides with peripheral facilities dedicated to
 one specific task, such as image classification or phone state prediction. Not
 only the network model is hard-coded into the very core of the tool, but some
 modifications or tricks that are only specific to a certain task are wired into
-the code. The effect of this approach is two-fold. On the one hand, they have a
-complete and tightly designed system that appears to provide with a simple user
+the code.
+
+The effect of this approach is two-fold. On the one hand, they have a complete
+and tightly designed system that appears to provide with a simple user
 interface. Thus, researchers who are familiar with that specific area can use
 it easily. Also, because the network is hard-coded and task-specific, the
 implementation could be easy and optimization could be effective so that offers
@@ -37,13 +39,14 @@ flexibility.  People need to hack the code and make modifications to suit the
 tool to another structure of neural network model, and it is very difficult and
 tricky to use the tool designed for one specific area (for example, image
 classification) in another area (like speech recognition). Engineering details
-and implementation tricks refrain people from doing so. Caffee, which is
-designed for computer vision, has a comprehensive variety of tools for
-processing images and training convolution neural networks (CNN). But it cannot
-be directly applied to speech processing tasks. Luckily, there is Kaldi, a
-counterpart in speech processing that can process wave-form files, extract the
-acoustic features, train GMM-HMM models, fully-connected DNN models, LSTM
-models and so on.
+and implementation tricks refrain people from doing so.
+
+Caffee, which is designed for computer vision, has a comprehensive variety of
+tools for processing images and training convolution neural networks (CNN). But
+it cannot be directly applied to speech processing tasks. Luckily, there is
+Kaldi, a counterpart in speech processing that can process wave-form files,
+extract the acoustic features, train GMM-HMM models, fully-connected DNN
+models, LSTM models and so on.
 
 The second category strives to be general-purpose. As mentioned above, there are two types of generalities:
 
@@ -56,7 +59,7 @@ by the toolkit. Besides, general-purpose toolkits usually have a unified
 interface for data input and output (I/O) which is the most task-specific
 part. Therefore, the task-specific implementation details are limited to
 separated I/O modules as possible. This concept of design also presents in
-operating systems where device-specific code are modularized and isolated from
+operating systems where device-specific code is modularized and isolated from
 the core part and several abstraction layers are often used. Therefore, by this
 approach, general-purpose toolkits usually can be potentially adapted to
 various kinds of tasks without hacking or changing core code. Admittedly, the
@@ -116,12 +119,181 @@ because people no longer work together at all but tend to write their own
 scripts with duplicate functionalities. So there will be less and less common
 code base among users' scripts.
 
-- simplicity: the learning curve is not steep and code are straight forward
+CNTK, developed by Microsoft Research, is an attempt to address these problems.
+It is a complete solution having the advantages of task-specific tools, but
+also support various kinds of neural networks via introducing a scripting
+language called NDL (now it is moving to a new language, BrainScript). It can
+be adapted to different tasks from speech processing to computer vision.
+Nevertheless, CNTK does not satisfy all researchers. Due to its highly coupled
+design, which is the opposite choice made by Torch, researchers find it extremely
+hard to add new functionalities. Unfortunately, this need always exists as the
+brand new network structures or components being proposed. Extensibility is a
+real problem for CNTK. Also, without being a general-purpose programming
+language, the custom scripting language seems to have limited power of
+expression. To increase the power, more syntax or semantics need to be
+introduced and thus makes the learning curve even longer.
+
+Finally, we propose NERV, a simple but effective deep learning toolkit that
+tries to address the issues above. It is designed with the following
+characteristics in mind:
+
+- simplicity: the learning curve is not steep and code is straight forward
+
 - extensibility: users can quickly become developers and add the missing
   modules or tailor the toolkit for their needs; major building blocks in NERV
   are modularized and the interfaces are standardized so that users can plug-in
   their additionally implemented ones and even use the modules implemented by others
+
 - lightweight: NERV strives to stay at minimal core code base and dependencies,
   which makes it fairly easy to embed it into other task-specific tools, such
-  as Kaldi (in speech-processing).
+  as Kaldi (in speech-processing)
+
+- completeness: at the same time, NERV provides with a complete pipeline of
+  training and managing a neural network
+
+Abstraction
+-----------
+
+NERV has a decoupled design, like Torch. It also uses Lua as its main scripting
+language. Lua is extensively used in video game industry because of its short
+learning curve and the tiny footprint of the interpreter/compiler. LuaJIT is a
+just-in-time compiler for Lua providing with very high efficiency. Therefore,
+our design strives to confine those performance demanding parts (such as matrix
+operation, low-level I/O) into pure C/C++ modules, and by giving them Lua
+interface, the rest of the project fully stands upon pure Lua, to avoid letting
+users coding in C/C++.
+
+To achieve the goal to be decoupled but complete, NERV uses a bottom-up style
+of development and has several levels of abstraction. The following figure
+depicts such abstraction from low (at the top of the figure) to high (at the
+bottom of the figure) level.
+
+.. image:: _static/nerv-hierarchy.svg
+
+As shown in the figure, the brown boxes are implemented in pure C. Some part of
+the classes (such as ``HTKReader``, ``KaldiReader``) inheriting from ``DataReader``
+and some part of ``ChunkFile`` (defines the standard storage format) are also
+implemented in C/C++. Except for these, others are all written in pure Lua.
+
+This hierarchical design is the solution to the previous mentioned conflict
+between extensibility and completeness. People who are merely interested in
+using current functionalities do not have to write much code in order to have a
+running example. They only need to invoke the ``Scheduler`` and provides with a
+Lua script having declarative information of the model to be trained, the data
+to be used and the training policy, which are task-specific. The rest,
+non-task-specific jobs, are accomplished by corresponding components/modules in NERV.
+When a research wants to have more control of the training process, she may
+write a custom trainer inherits from the general trainer or even write her own
+trainer, and that is the whole story. When she needs even more control or try
+to hack an exotic model that temporarily does not fit into the current
+pipeline, she may try to build her own script using lower level components. In
+short, hierarchical design gives people choices and is expected to satisfy
+different user requirements. The following subsections briefly introduce major
+categories of components in NERV.
+
+Matrix
+******
+
+This part corresponds to the brown boxes in the figure. Matrices are the basic
+data container for computation. Although in NERV, matrices are not assumed to
+be the only type of data container, the current available subclass of
+``nerv.Param`` is ``nerv.MatrixParam`` (other classes such as
+``nerv.BiasParam`` are currently aliases of ``nerv.MatrixParam``), meaning
+parameters stored by matrices are sufficient for most tasks. The base class for
+all matrices is ``nerv.Matrix``, which defines the common behavior of a matrix
+regardless its type of value (float or double) or storage location (on host or
+on device). Inheriting from that, there are two major groups of matrix types:
+``nerv.MMatrix`` and ``nerv.CuMatrix``. As the name suggests, ``nerv.MMatrix``
+defines the matrices in the main memory (on host), whereas ``nerv.CuMatrix``
+defines those in the GPU memory (on device) using CUDA as programming interface
+(there may be some class like ``nerv.CLMatrix`` in the future, to support other
+programming interface for GPU). Then for each storage category, there are
+subclass implementing matrices with different types of values: 
+
+- ``nerv.MMatrixFloat``
+- ``nerv.MMatrixDouble``
+- ``nerv.CuMatrixFloat``
+- ``nerv.CuMatrixDouble``
+
+Layer
+*****
+
+A layer (``nerv.Layer``) in NERV conceptually represents a computation node
+which declaratively defines the computation logic needed to produce the output
+from the input. This means a layer itself is "oblivious" in a sense that its
+computation is time-invariant (except that some layers maintain some auditing
+information, which do not change the behavior of the output though) when
+hyper-parameters are fixed. This concept is important in understanding the
+relationship between layers and parameters in NERV and is not quite so in other
+toolkits. In some toolkits, a layer or a computation node stands for the
+computation logic together with the contained parameters. This coupled design
+causes inconvenience in network model management when users want to manipulate
+or duplicate the parameters and also makes parameter tying (different layer
+share the same group of parameters) difficult to implement. In NERV, layers are
+abstract concept that defines the *structure* of a neural network, whereas the
+parameters defines the *state* of the neural network.
+
+Layers have two categories: *graph layers* and *primitive layers*. The former
+are a special kind of layer that acts like a container for a set of sub-level
+layers by specifying both the temporal and structural connections between these
+sub-level layers in the layer configuration of the containing graph layer.  The
+latter, primitive layers, are the layers that define basic computation without
+any sub-level structures. Graph layers can be nested to introduce a complex
+hierarchy, but the final contained layers must be primitive.
+
+Network
+*******
+
+Layers alone only declare the structure and behavior of the intended network.
+To turn a layer into a trainable model, we need to *compile* the layer into an
+*network* instance (``nerv.Network``). What compilation does is to flatten out
+the nested graph layer according to the specified structure and allocate the
+space for those intermediate activations between the layers. Although the
+internal implementation of a network is flatten out and may lose nested
+structures, it is designed to be opaque to users, which means users should not
+bother much about the compiled network where many technical tricks are applied,
+but should access the state of computation or change the hyper-parameters to
+alter the behavior of the network via layer abstraction (the instances of
+subclasses of ``nerv.Layer``) where the structures are preserved.
+
+Parameter
+*********
+
+Parameters (``nerv.Param``) represents the state of layers (``nerv.Layer``) in
+NERV. They are time-variant during training because of the update. They can be
+read from files (in NERV ``nerv.ChunkFile`` format) and written to files. Take
+an fully-connected linear layer in a neural network as an example, the layer
+defines the calculation of :math:`y = W x + b`, where x is the input, whereas
+the parameters W and b provide with the state of the layer. The association
+between layers and parameters is called *binding*. For a layer like :math:`y_1
+= W x_1 + b`, when it is bound to parameter :math:`W_1` and :math:`b_1`, the
+calculation becomes :math:`y_1 = W_1 x_1 + b_1`.  However, when a network need
+to rollback to a previous state, the layer may bind to parameter :math:`W_0`
+and :math:`b_0`, so becomes :math:`y_1 = W_0 x_1 + b_0`. The concept of binding
+decouples the layer and the corresponding parameters in a clear way.
+
+Buffer
+******
+
+Buffers (``nerv.DataBuffer``), as the name suggests, connect I/O ends with
+different speed (granularity). Buffers in NERV accept variable length of
+samples (frames) from readers (``nerv.DataReader``) and produce a regularized
+sequence of data to feed as the input to a network (``nerv.Network``). This
+process is often called "mini-batching" which roughly means to stack the
+samples together and cut samples into mini-batches.
+
+Scheduler (Trainer)
+*******************
 
+Schedulers refer to those top-level scripts that implements the main training
+loop and ticks the training process. A general-purpose scheduler typically
+takes in a Lua script written by an end user that contains description of the
+network, task-specific processing for reading data and some hyper-parameters.
+Then, according to these metadata, it instantiates the major components that
+make up the training pipeline and build up the pipeline by connecting them
+according to their interface. Schedulers are user-friendly and thought to be
+the main tool used by the end user. The end user could also change the behavior
+of the scheduler by overriding functions in their Lua script read by the
+scheduler. Experienced users can also directly write their own schedulers to
+train exotic neural networks that temporarily does not fit into the current
+pipeline.