Digital Design Entry Methods

posted Aug 13, 2018, 9:51 PM by MUHAMMAD MUN`IM AHMAD ZABIDI   [ updated Aug 13, 2018, 11:10 PM ]

  • Schematic capture: for small designs
  • Hardware description language (HDL) such as Verilog and VHDL: medium to large designs
  • High-level synthesis (HLS) such as Vivado-HLS. Design is entered using C, and converted to HDL.
  • OpenCL: framework for writing parallel programs for CPU, GPU & FPGA. Based on C99 and C++11.
The most popular system-level development tools are Vivao HLS and Altera SDK for OpenCL. Vivado HLS requires more hardware knowledge. Altera OpenCL is relatively easier for software programmers but uses more FPGA resources.


Qin, S., & Berekovic, M. (2015). A Comparison of High-Level Design Tools for SoC-FPGA on Disparity Map Calculation Example. In 2nd International Workshop on FPGAs for Software Programmers (FSP 2015). Retrieved from

Array Multiplier

posted Aug 13, 2018, 8:54 PM by MUHAMMAD MUN`IM AHMAD ZABIDI   [ updated Aug 13, 2018, 9:34 PM ]

Digital multiplication is the most extensively used operation (especially in signal processing), people who design digital signal processors sacrifice a lot of chip area in order to make the multiply as fast as possible.

Parallel Multiplication
  • Partial products are generated simultaneously
  • Parallel implementations are used for high performancemachines, where computation latency needs to be minimized

AI Accelerators

posted Aug 12, 2018, 2:49 AM by MUHAMMAD MUN`IM AHMAD ZABIDI   [ updated Aug 12, 2018, 3:40 AM ]

AI chip design projects:

 Company     Project

AI implementations and main weakness of each:
  • ASIC: slow to market, must find a large enough market to justify cost
  • CPU: very energy inefficient
  • GPU: great for training but very inefficient for inference
  • DSP: not enough performance, high cache miss rate

When to Use FPGAs

  • Transistor Efficiency & Extreme Parallelism
    • Bit-level operations
    • Variable-precision floating point
  • Power-Performance Advantage
    • >2x compared to Multicore (MIC) or GPGPU
    • Unused LUTs are powered off
  • Technology Scaling better than CPU/GPU
    • FPGAs are not frequency or power limited yet
    • 3D has great potential
  • Dynamic reconfiguration
    • Flexibility for application tuning at run-time vs. compile-time
  • Additional advantages when FPGAs are network connected ...
    • allows network as well as compute specialization

When to Use GPGPUs

  • Extreme FLOPS & Parallelism
    • Double-precision floating point leadership
    • Hundreds of GPGPU cores
  • Programming Ease & Software Group Interest
    • CUDA & extensive libraries
    • OpenCL
    • IBM Java (coming soon)
  • Bandwidth Advantage on Power
    • Start w/PCIe gen3 x16 and then move to NVLink
  • Leverage existing GPGPU eco-system and development base
    • Lots of existing use-Cases to build on
    • Heavy HPC investment in GPGPU When to Use GPGPUs


Raspberry Pi GPU

posted Aug 12, 2018, 12:51 AM by MUHAMMAD MUN`IM AHMAD ZABIDI   [ updated Aug 12, 2018, 12:56 AM ]

The Raspberry Pi is a great platform for embedded computer vision but the CPU is slow. Using the onboard GPU may accelerate video operations. Broadcomm calls it the Quad Processing Units (QPUs). The QPU is a vector processor developed by Broadcom with instructions that operate on 16-element vectors of 32-bit integer or floating point values

Spiking Neural Networks

posted Aug 7, 2018, 1:05 AM by MUHAMMAD MUN`IM AHMAD ZABIDI

Fig 1. Machine learning encompasses a range of algorithms.

Spiking neural networks (SNNs) are a different form of neural networks that more closely matches biological neurons. SNNs, which use feed-forward training, have low computational and power requirements (Fig. 2) compared to CNNs.

Fig 2. Leveraging feed-forward training, spiking neural networks have low computational and power requirements compared to CNNs.

SNN models also work differently from CNNs because of their spiking nature. Information flows through CNN models in a wavelike fashion; information is modified by weights associated with the nodes in each network layer. SNNs emit spikes in a somewhat similar fashion, but spikes aren’t always generated at each point, depending on the data.

SNN training and hardware requirements are significantly different from CNNs. There are applications where one is much better than the other and areas where they overlap.

Posit Number System

posted Aug 6, 2018, 9:33 PM by MUHAMMAD MUN`IM AHMAD ZABIDI   [ updated Aug 7, 2018, 5:16 AM ]

Posit numbers are a new way to represent real numbers for computers, an alternative to the standard IEEE floating point formats. The primary advantage of posits is the ability to get more precision or dynamic range out of a given number of bits.

A conventional floating point number (IEEE 754) has a sign bit, a set of bits to represent the exponent, and a set of bits called the significand (formerly called the mantissa). For a given size number, the lengths of the various parts are fixed. A 64-bit floating point number, for example, has 1 sign bit, 11 exponent bits, and 52 bits for the significand.

A posit adds an additional category of bits, known as the regime. A posit has four parts
  1. sign bit
  2. regime
  3. exponent
  4. fraction
Unlike IEEE numbers, the exponent and fraction parts of a posit do not have fixed length. The sign and regime bits have first priority. Next, the remaining bits, if any, go into the exponent. If there are still bits left after the exponent, the rest go into the fraction

Generic format

An example.

Advantages of posit over IEEE 754:
  • Simpler, smaller, faster circuits
  • Superior accuracy, dynamic range, closure
  • Better answers with same number of bits or equally good answers with fewer bits
Who will be the first to produce a chip with posit arithmetic?

Embedded Deep Learning

posted Aug 5, 2018, 8:30 PM by MUHAMMAD MUN`IM AHMAD ZABIDI   [ updated Aug 7, 2018, 5:21 AM ]

Market trends:

  • By 2019, 755 of enterprise and ISV development will include AI or ML (IDC)
  • By 2020, 5.6 billion IoT devices connected to an edge solution

Energy-efficient deep learning sits at the intersection between machine learning and computer architecture. New architectures can potentially revolutionize deep learning and deploy deep learning at scale.

State-of-the-art algorithms for applications like face recognition, object identification, and tracking utilize deep learning-based models for inference. Edge-based systems like security cameras and self-driving cars necessarily need to make use of deep learning in order to go beyond the minimum viable product. However, the core deciding factors for such edge-based systems are power, performance, and cost, as these devices possess limited bandwidth, have zero latency tolerance, and are constrained by intense privacy issues. The situation is further exacerbated by the fact that deep learning algorithms require computation of the order of teraops for a single inference at test time, translating to a few seconds per inference for some of the more complex networks. Such high latencies are not practical for edge devices, which typically need real-time response with zero latency. Additionally, deep learning solutions are extremely compute intensive, resulting in edge devices not being able to afford deep learning inference.

Deep learning is necessary to bring intelligence and autonomy to the edge.

The first wave of embedded AI is marked by Apple Siri. It's not really embedded because Siri relies on the cloud to perform the full speech recognition process.

The second wave is marked by Apple's Face ID. The intelligence happens on the device, independent of the cloud.

IoT Scrapbook

posted Aug 4, 2018, 8:20 PM by MUHAMMAD MUN`IM AHMAD ZABIDI

The term "Internet of Things" was first used by Kevin Ashton in 1999.

S. Madakam, R. Ramaswamy, and S. Tripathi, "Internet of Things (IoT): A literature review," Journal of Computer and Communications, vol. 3, p. 164, 2015

IoT was first introduced to the (Massachusetts Institute of Technology) when the world was described in a vision where all our personal and working devices including inanimate objects, have not only digital identity but potential processing ability allowing central computer system to organize and manage them.

F. Wortmann and K. Flüchter, "Internet of things," Business & Information Systems Engineering, vol. 57, pp. 221-224, 2015.

IoT architecture

LoRa vs NB-IoT

posted Jul 30, 2018, 9:33 PM by MUHAMMAD MUN`IM AHMAD ZABIDI   [ updated Aug 8, 2018, 6:20 AM ]

 LoRa  NB-IoT   
Spectrum  Unlicensed  Licensed LTE
Spectrum cost
Free > $500 million/MHz
Modulation  CSS  QPSK
Bandwidth  500-125 kHz
 180 kHz
Peak data rate
 290 - 50 kpbs (DL/UL)
 DL: 234.7 kbps; UL: 204.8 kbps
Link budget
 154 dB
 150 dB
Max #message/day
 Unlimited  Unlimited
Duplex?  Full duplex  Half duplex
Power efficiency
 Very high
 Medium high           
Spectrum  efficiency
Chip SS CDMA better than FSK
 Improved by standalone, in-band, guard band operation
Area traffic capacity
Depends on gateway
40 devices per household, ~55k devices per cell       
Inteference immunity
Very high Low      
Peak current 32 mA
120-300 mA
Sleep current 1 μA 5 μA
Standardization De-facto standard  3GPP Rel. 13 (planned)   

  • Sinha, R. S., Wei, Y., & Hwang, S.-H. (2017). A survey on LPWA technology: LoRa and NB-IoT. ICT Express, 3(1), 14–21.

Practical guide to text classification

posted Jul 23, 2018, 9:56 PM by MUHAMMAD MUN`IM AHMAD ZABIDI   [ updated Jul 23, 2018, 10:05 PM ]

Example of text classification. Another example is sentiment analysis.

1-10 of 73