EducationColleges and Universities

What are floating point numbers?

The form of representation of real (or real) numbers, where they are stored as a mantissa and exponent, are floating-point numbers (maybe a point, as is customary in English-speaking countries). Despite this, the number is provided with a fixed relative accuracy and a varying absolute. The representation that is used most often is approved by the IEEE 754 standard. Mathematical operations, where floating-point numbers are used, are implemented in computer systems - both hardware and software.

Point or comma

In the detailed list of Decimal separator, those English-speaking and English-speaking countries are indicated, where, in the number records, the fractional part is separated from the whole part by a dot, and therefore the terminology of these countries is called floating point - "floating point". In the Russian Federation, the fractional part of the whole is traditionally separated by a comma, which is why the historically recognized term "floating-point numbers" designates this notion. Nevertheless, today both in the technical documentation and in the Russian-language literature both of these variants are quite acceptable.

The term "floating-point numbers" comes from the fact that the positional representation of a number represents a comma (ordinary decimal or binary - computer) that can fit anywhere among the numbers of the string. This feature must be discussed separately. This means that the representation of floating-point numbers can be considered as a computer implementation of an exponential number entry. The advantage of using such a representation over the representation of a format with a fixed comma and integers is that the range of values grows substantially, while the relative accuracy remains unchanged.

Example

If the comma in the number is fixed, then you can write it in only one format. For example, six bits of integer in number and two bits in fractional part are given. This can be done only in this way: 123456,78. The format of floating-point numbers provides full scope for expression. For example, the same eight bits are given. Variants of recording can be as long as the programmer does not have to make a double-digit additional field, where he will write down the exponents, which are usually 10, from 0 to 16, and the total number will be ten: 8 + 2.

Some options for writing that allow the format of floating-point numbers: 12345678000000000000; 0.0000012345678; 123,45678; 1,2345678 and so on. This format has even a unit of speed measurement! Rather, the speed of the computer system, which fixes the speed with which the computer performs operations, where there is a representation of floating-point numbers. This is measured in terms of FLOPS (floating-point operations per second, which translates as the number of operations per second with floating-point numbers). This unit is the main one in measuring the speed of the computer system.

Structure

To write a number in a floating-point format, you need the following way, observing the sequence of required parts, since this entry is exponential, where the real numbers are represented as a mantissa and order. This is necessary for representing too large and too small numbers, it is much more convenient to read them. Mandatory parts: the recorded number (N), the mantissa (M), the sign of the order (p) and the order (n). The last two characters form a characteristic of the number. Hence, N = M. N p . So the numbers are written with a floating comma. The examples will be varied.

1. It is necessary to write down the number one million so as not to get confused in the zeros. 1000000 is a normal entry, an arithmetic one. A computer looks like this: 1.0 . 10 6 . That is, ten in the sixth degree - three characters, which fit as many as six zeros. Thus, the representation of fixed-point and floating-point numbers occurs, where you can immediately detect differences in spelling.

2. And such a difficult number as 1435000000 (one billion four hundred thirty five thousand) can also simply be written down: 1,435 . 10 9 , only. Similarly, you can write any number with a minus sign. This is where the fixed-point and floating-point numbers differ from each other.

But these are big numbers, how to deal with small ones? Yes, too easy.

3. For example, how to designate one millionth? 0.000001 = 1.0 . 10 -6 . Significantly facilitate the writing of the number, and its reading.

4. And more difficult? Five hundred and forty-six billion: 0.000000546 = 546 . 10 -9 . Here. " The range of representation of floating-point numbers is very wide.

The form

The form of a number can be normal or normalized. Normal - always observes the precision of floating-point numbers. It should be noted that the mantissa in this form, without taking into account the sign, is halfway through the interval: 0 1, hence 0 ⩽ a <1. The number does not lose its accuracy in normal form. The disadvantage of the normal form of a number is that many numbers can be written in different ways, that is, ambiguous. An example of a different record of the same number: 0.0001 = 0, 000001 . 10 2 = 0.00001 . 10 1 = 0.0001 . 10 0 = 0.001 . 10 -1 = 0.01 . 10 -2 and so you can still a lot. That is why in computer science another, normalized form of record is used, where the mantissa of decimal numbers takes a value from one (inclusive) and thus up to ten (not inclusive), and in the same way the mantissa of binary numbers takes a value from one (inclusive) to two (not inclusive).

Hence, 1 ⩽ a <10. These are binary floating-point numbers, and this form of writing fixes any number (except zero) uniquely. But there is also a drawback - the impossibility in this form is zero. Therefore, computer science provides for the number 0 use of a special characteristic (bit). The integer part of the number (highest digit) of the mantissa in a binary number except zero in the normalized form is 1 (implicit unit). Such a record is used by the IEEE 754 standard. The numbering systems, where the base is greater than two (ternary, quaternary and other systems), this property has not been acquired.

Real numbers

Floating-point floating-point numbers are usually the only way, because this is not the only, but very convenient way to represent a real number, as it were, a compromise between a range of values and precision. This is an analog of an exponential record just executed in the computer. A floating-point number is a set of individual bits separated by a sign, order (exponent), and a mantis (mantis). The most common IEEE 754 format is a floating-point number as a set of bits that encode one part of the mantissa, the other part is a power, and a single sign indicates the number sign: zero is if it is positive, a unit if the number is negative. The whole order is written as an integer (code with a shift), and the mantissa is in normalized form, its fractional part is in the binary system.

Each character is one bit, which indicates a sign for a fully floating-point number. The mantissa and the order are integers, they are combined with a sign and make the representation of a floating-point number. An order can be called an exponent or exponent. Not all real numbers can be represented in the computer in their exact meaning, while the rest are represented by approximate values. A much simpler variant is to represent a real number with a fixed point, where the real and the whole parts are stored separately. Most likely, in such a way that the whole part is always allocated X bits, and fractional - Y bits. But processor architectures do not know this method, and therefore the preference is given to the floating-point number.

Addition

Adding floating-point numbers is quite simple. In connection with the IEEE 754 standard, the single precision of a number has a huge number of bits, so it's better to go directly to examples, and it's better to take the smallest representation of a floating-point number. For example, two numbers - X and Y.

Variable Sign Exhibitor Mantissa
X 0 1001 110
Y 0 0111 000

The steps are:

A) The numbers must be presented in a normalized form. Obviously, a hidden unit appears. X = 1,110 . 2 2 , and Y = 1,000 . 2 0 .

B) The process of addition can be continued only by equating the exponents, and for this it is necessary to rewrite the value of Y. It will correspond to the value of the normalized number, although in fact it will be denormalized.

Calculate the difference of exponents of degree 2 - 0 = 2. Now move the mantissa to compensate for these changes, that is, add 2 to the exponent of the second summand, thus shifting the comma of the hidden unit to two points to the left. This gives 0.0100 . 2 2 . This will be the equivalent of the previous value of Y, that is, already Y '.

C) Now we need to fold the mantissas of the number X and the adjusted Y.

1,110 + 0,01 = 10,0

The exponent is still equal to the presented indicator X, which is equal to 2.

D) The amount received in the previous stage has displaced the unit of normalization, so you need to shift the exponent and repeat the summation. 10.0 with two bits to the left of the comma, now the number needs to be normalized, that is, move the comma to the left by one point, and increase the exponent by 1. This turns out to be 1,000 . 2 3 .

E) It's time to convert the floating-point number to a single-byte system.

Amount Sign Exhibitor Mantissa
X + Y 0 1010 000

Conclusion

As you can see, adding up such numbers is not too difficult, nothing that comma floats. If, of course, we do not consider the reduction of a number with a smaller exponent to a larger number (in the example given it was Y to X), and also the restoration of the status quo, that is, the issuance of compensation - the movement of the comma to the left of the mantissa. When the addition has already been made, it is very possible and another difficulty - renormalization and bit truncation, if their number does not correspond to the format of the number to represent it.

Multiplication

The binary system offers two ways to multiply floating point numbers. This task can be performed by multiplying, which begins with the lower order digits and which starts with the highest digits in the multiplier. Both cases contain a number of operations, successively adding up private works. These addition operations are controlled by the bits of the multiplier. Hence, if there is one in one of the digits of the multiplier, then the sum of the partial products grows multiplied with the corresponding shift. And if the zero multiplied in the multiplier, then the multiplicand is not added.

If the multiplication of just two numbers is performed, then the digits of the product in its quantity can not exceed the number of digits contained in the factors more than twice, and for large numbers this is very, very much. If several numbers are multiplied, then the product runs the risk of not being placed on the screen. Therefore, the number of digits of any digital automaton is completely finite, and this forces us to limit ourselves, as a maximum, to the number of digits of the totalizers. And if the number of digits is limited, an error inevitably enters the work. If the volume of computations is large, then the errors are superimposed, and as a result, the overall error greatly increases. Here the only way out is to round off the multiplication results, then the error of the product will turn out to be alternating. When the multiplication operation is performed, it is possible to go beyond the grid of digits, but only from the lower-order side, since the restriction imposed on the numbers that are represented in the semicolon form is fixed.

Some explanations

It's better to start first. The most common way of representing a number is as a string of digits as an integer, where the comma is meant at the very end. This string can be of any length, and the comma is in the most necessary place for it, separating the whole number from the fractional part of it. The format of a fixed-point representation of a system necessarily places certain conditions for the location of the comma. The exponential notation uses the standard normalized representation of numbers. This is aqn {\ displaystyle aq ^ {n}} aq n . Here a {\ displaystyle a} a , and this lace is called a mantissa. Just about this it was said that 0 ⩽ a n is an integer, exponent, and q {/ displaystyle q} q is also an integer, which is the basis of the given number system (and in the letter it is usually 10). The mantissa will leave a comma after the first digit, which is not zero, but further information is recorded on the actual value of the number.

The floating-point number is very similar to the standard standard notation for numbers, only the exponent and the mantissa are recorded separately. The latter is also in the normalized format - with a fixed comma, which adorns the first significant figure. Simply floating comma is used mainly in the computer, that is, in the electronic representation, where the system is not decimal, but binary, where even the mantissa is denormalized by the comma - now it is before the first digit, then before and not after it, where the whole part In principle, it can not be. For example, our native decimal system will give its nine to the binary system for temporary use. And she will write it with a floating point mantissa like this: +1001000 ... 0, and +0 ... 0100 to it. But the decimal system can not produce such complex calculations as possible in binary, using a floating-point form.

Long arithmetic

In electronic computers there are built-in software packages, where the amount of memory allocated for the mantissa and the exponent is programmed, limited only to the size of the computer's memory. This is how long arithmetic looks like, that is, simple operations on numbers performed by a computer. These are all the same - subtraction and addition, division and multiplication, elementary functions and erection to the root. But only the numbers are completely different, their bit depth can significantly exceed the length of the computer word. The implementation of such operations is not hardware, but software, but the basic hardware is widely used in working with numerically much lower orders. There is also arithmetic, where the length of numbers is limited solely to the amount of memory - arbitrary precision arithmetic. A long arithmetic is used in many areas.

1. To compile the code (processors, low-bit microcontrollers - 10 bits and 8-bit bit registers, this is clearly not enough to process the information with Analog-to-digital, and therefore do not do without long arithmetic.

2. Also, long arithmetic is used for cryptography, where it is necessary to ensure the accuracy of the result of exponentiation or multiplication up to 10 309 . Integer arithmetic is used modulo m - a large natural number, and not necessarily simple.

3. Software for financiers and mathematicians also can not do without long arithmetic, because only in this way can you verify the results of calculations on paper - using a computer, providing high accuracy of numbers. Floating point they can draw as long as desired. But engineering calculations and the work of scientists rarely require the intervention of software calculations, because it is very difficult to make input without making mistakes. Usually they are much larger than the results of rounding.

Fighting errors

In operations with numbers in which the comma is floating, it is very difficult to evaluate the error of the results. So far, there has not been invented a mathematical theory satisfying all that would help solve this problem. But errors with integer numbers are easy to evaluate. The possibility of getting rid of inaccuracies lies on the surface - just use only numbers with a comma fixed. For example, financial programs are built on this principle. However, it's simpler: the required number of digits after the decimal point is known in advance.

Other applications can not be limited to this, because it is impossible to work with either very small or very large numbers. Therefore, when working, it is always taken into account that inaccuracies are possible, and therefore it is necessary to round off the results when deriving the results. And, automatic rounding is often an inadequate action, and therefore rounding is set specially. The comparison operation is very dangerous in this respect. Here, even to assess the size of future errors is extremely difficult.

Similar articles

 

 

 

 

Trending Now

 

 

 

 

Newest

Copyright © 2018 en.delachieve.com. Theme powered by WordPress.