Published by Microsoft Research Sep 18, 2017
Differences of NN in small devices: NN in gadgets compared to datacenters
- Usually safety-critical (except smartphones) vs rarely for datacenters
- Low-power is required vs nice-to-have
- Real-time is required vs preferable
Desirable properties on NN on gadgets: - sufficiently high accuracy
- low computational complexity
- low energy usage
- small model size
Advantages of small models: - Fewer parameters means bigger opportunities for scaling training - 145X speedup on 256 GPUs for FireNet (CVPR 2016), 47x speedup for GoogLeNet
- Enables complete on-chip integration of CNN model with weights - no need for off-chip memory -> dramatically reduces energy for inference, up-close/personal data gathering, integration with sensor
- Enables continuous wireless updates of models if retraining is required
Seven ways to squeeze: - Replace FC with CNN
- Kernel reduction: reduce height x width of filters e.g. 3x3 -> 1x1
- Channel reduction: reduce the number of filters and channels
- Evenly spaced downsampling: early vs late vs evenly spaced (gradual) downsampling
- Depthwise separable convolutions: apply convolutions only to some channels
- Shuffle layer: idea 2 & idea 5 channels to talk to each other the first time
- Distillation & Compression: refer to paper on Deep Compression. Many ways to do it.
|