You can get the 'theory' by simply searching for it online.
Basically it is a way to preprocess certain types of data (like largish pixel images) by repeatedly running 'small' feature filter NN in parallel locally (with overlaps) across the whole image .
Each local area of the regular grid (image) is processed to extract/integrate generic patterns/trends (like line/boundry detection or spotting a solid blob) from the basic data. Further layer processing then (in parallel) integrate that first order detecting larger patterns/trends (like spotting a 'corner'). Later layers then look for the super patterns which classify the picture.
The advantage is the lower 'detail' filter NNs are fairly small (some like 5x5 local groupings) and can be well formed to do their task. They can be run in a massively parallel manner (you apply that layers same filter in an array scanning fashion ) and integrate/collapse each next layers input data til the final classification (several layers itself) which detects combinations of the macro patterns.
A 'divide and conquer' solution eliminating/minimizing ALOT of the NxN input weights (in the lower layers) such large data input arrays would require if done monolithicly.
40+ years ago anatomical research was done that showed that the retina of eyes do operations like this (the low level feature detection).
--------------------------------------------[size="1"]Ratings are Opinion, not Fact