Skip to main content

Exceptions of floating point normalization

Floating point normalization has a great usage for computing anything very near to accuracy. A floating point number is consists of:

  1. Mantissa or significand.
  2. Exponent.

Say, I've a number 123.75. Its a floating point number. It has integer significand, 12375 and exponent -2.

So arithmatic representation is 12375 x 10-2.

How to normalize a floating point number?

- By shifting the mantissa to left until a 1 appears in most significant bits(HO). Hence, the normalized representation will be 1.2375 x 10+2. Most of the time for normalized number this bit is hidden as it happens to be 1. This is hidden bit.

Now the question when we can't normalize a floating point number?

- There are two such situations:

  1. We can't normalize zero(0). The floating point representation of Zero doesn't contain any 1 bit. However, IEEE representation for +0 and -0 has different significance.
  2. We also can't normalize a floating point number whose most significant bits in mantissa are zero as well as biased exponents are also zero.
Reference: - Floating Point


good read , however you could have explained a little more about the exceptional situation you have mentioned.
we are waiting for your new post !

Popular posts from this blog

A simple approach to generate Fibonacci series via multi-threading

T his is a very simple approach taken to generate the Fibonacci series through multithreading. Here instead of a function, used a function object. The code is very simple and self-explanatory.  #include <iostream> #include <mutex> #include <thread> class Fib { public:     Fib() : _num0(1), _num1(1) {}     unsigned long operator()(); private:     unsigned long _num0, _num1;     std::mutex mu; }; unsigned long Fib::operator()() {     mu.lock(); // critical section, exclusive access to the below code by locking the mutex     unsigned long  temp = _num0;     _num0 = _num1;     _num1 = temp + _num0;     mu.unlock();     return temp; } int main() {     Fib f;          int i = 0;     unsigned long res = 0, res2= 0, res3 = 0;     std::cout << "Fibonacci series: ";     while (i <= 15) {         std::thread t1([&] { res = f(); }); // Capturing result to respective variable via lambda         std::thread t2([&] { res2 = f(); });         std::thread t3(

Variadic template class to add numbers recursively during compilation

 The idea of having a class to add numbers (variable parameters) during compilation time recursively. Also wanted to restrict types to a single type while sending parameters to class member function. That said, if we mix int, float and double types to add function shall result in compilation error. How do we achieve this. The below is the code which actually helps to achieve this: <code> #include < fmt/format.h > template < typename T> class MyVarSumClass{     private :         T _sum = 0 ;     public :         template < typename ... TRest>         T add(T num, TRest... nums){             static_assert (std::conjunction<std::is_same<TRest, T>...>{}); /* Assert fails                if types are different */             _sum += num;             return add(nums...); // Next parameter packs gets picked recursively         }         // Base case         T add(T num){             _sum += num;             return _sum;         } }; int main() {     My

Reversing char array without splitting the array to tokens

 I was reading about strdup, a C++ function and suddenly an idea came to my mind if this can be leveraged to aid in reversing a character array without splitting the array into words and reconstructing it again by placing spaces and removing trailing spaces. Again, I wanted an array to be passed as a function argument and an array size to be passed implicitly with the array to the function. Assumed, a well-formed char array has been passed into the function. No malformed array checking is done inside the function. So, the function signature and definition are like below: Below is the call from the client code to reverse the array without splitting tokens and reconstructing it. Finally, copy the reversed array to the destination.  For GNU C++, we should use strdup instead _strdup . On run, we get the following output: Demo code