24.9.2017
In November 2016 I lectured course of advanced programming techniques in Qt. During the preparation of the materials, I studied a number of different parts of Qt. One of these parts is parallel processing.
I've come to the parallel calculations a few times. For the first time, a lot of data from photovoltaic power plants was statistically processed. In the Qt documentation I encountered paralleling tools, but due to zero experience with parallel computations, I did not understand how to use the Qt Parallel Computing Library. So I just made the parallel calculation in the multiple threads and I programmed the whole calculation manually.
The result is an application that can use the 100% computing capability - my handwork has no negative effect on data processing efficiency. However, I had to deal with parallel computing management and the source of the application is not clear and easy to understand in this section. Parallel calculation management can be implemented in a standard way by the Qt library itself.
For a deeper understanding of parallel computing, I worked on using the CouchDB database and later on trying to use the graphics card for parallel computing.
In Qt, QtConcurrent is used for parallel calculations. The entire module is traditionally well described, but without the basic idea of parallel computations it may be somewhat confusing.
Call: +420 777 566 384 or email to address info@hobrasoft.cz
The basis for parallel calculations is the set of input data. What does it mean? For statistical calculations on photovoltaic power plants, the smallest separately processed item is one day per photovoltaic power plant. If I do a comparison of ten different power plants throughout the year, I will receive 3650 input values (365 days × 10 plants). Each of these input values can be handled separately and does not affect calculations derived from other input values.
Calculation function converts input data (365 days for 10 plants) to output data (3650 values with daily statistics of one power plant).
Calculating one plant's statistics in one day can be a relatively complicated issue. At the beginning of the calculation I will connect to the database and read the measured data relevant to the power plant and the day. The count of measured values may be several tens of thousands to hundreds of thousands per plant. The read data is processed statistically and results in several statistical parameters characterizing the power plant on a given day. There parameters can be used to compare photovoltaic power plant with each other.
In this calculation, it is important that it is not dependent on any other ongoing calculations or other context of the application. When computing, you can not look at another power plant or another day. If the calculation is so constructed, it can be programmed quite easily and effortlessly.
It is also necessary to ensure that one calculation is not dependent on another context of the application. What does it mean? For example, it's always necessary to create a new connection to retrieve data from the database. One common connection can not handle multiple requests at one time. Once the calculation is complete, you need to close the database connection. (A great power advantage is, of course, a stand-alone computer hosting database engine with a sufficient number of processor cores and disks, at least for PostgreSQL).
Because the computational function is independent of other threads and context of the application, it is possible to run several such computational functions at once - the calculation is performed in parallel.
A great advantage of the computation function in QtConcurrent compared to OpenCL is the range of functions and the amount of data the computational function can process. The computational kernel in the OpenCL (graphics card) can usually only have a very limited size and limited amount of memory. In addition, virtually all C++ options and Qt libraries can be used when programming the computational function in QtConcurrent. The limitation may be the lower number of computing units and their lower performance - on my computer, I can use up to 8 AMD CPU cores (one of the most powerful at the time of purchase) or 14 computing cores in the graphics card (at the most average for me was not sorry to spend money).
The QtConcurrent module in Qt implements several basic algorithms for performing parallel calculations:
The calculation described above (statistics from the photovoltaic power plant) is typical implementation of the map algorithm. On all elements of the input set (365 days times 10 power stations) some operations have been applied (readings of measured values and statistic scalculation) and the result is a structure describing the behavior of a particular power plant in the day. Data from the input set has only been converted to another form. Then the output data can be compared each to other, show in graph, and so on.
Photovoltaic power plants typically operate in varying weather conditions. It is difficult to compare power plants with each other if we try compare, for example, the power plant in sunny day with other one in rainy day. The filter algorithm can only select from the input data set such data that correspond to the specified conditions. For example, you can select only sunny days and abandon all other days. Only a few such days in a year exists and it can take a lot of work to find them manually.
The function implementing such a calculation would not, in principle, differ much from calculation of daily statistics of one photovoltaic power plant. However, the output would be instead of complete statistics only one value: cloudy / clear (true / false).
Imagine you want to evaluate the work of the photovoltaic power plant in whole year and summarize the results into a single number (or a structure with few values). Such number can represents a comparison of the total annual production with the projected production or comparison with the data of the irradiance of the area from satellite measurements.
The first step of such a calculation is to process the measured data for each day and power station. In parallel calculations, the input data (356 days × 10 power plants) was transferred to output statistics (again 365 days × 10 plants).
Then the output data of each power plant are aggregated by the reduction function - all day values are converted to whole year statistics. The result is ten output values, for each power plant one output value. Of course, the output value can be represented with a structure containing dozens of different parameters.
The algorithm map-recude can calculate data for the whole years. But, there are days in which the sun was shining, but also days, when it was cloudy. If we want to include only sunny days data in the calculation, we could narrow the set of input data using a parallel filter algorithm. Only filtered data should be used to create total statistics for all sunny days.
Note: the result of the filter function is only yes / no (sunny / cloudy). When you filter input days, you should repeat some calculations to compute day statistics. Such naive calculations leads to repeating time consuming operations. Since I already know how many work have two powerful computers with such calculations (database server and computing server), the need to repeat some calculations frightens me.
This article should only be an introduction to the Qt parallel programming series. Watch this site, watch our Twitter. Further parts will follow.