The method of statistical processing that uses a SQL query with analytic functions

The proposed method includes statistical treatment of the data within one or more different groups by using the SQL analytic functions. The group may denote products, types of orders, survey questions, etc.

As the data distribution is mostly unknown or can not be treated statistically, it is necessary to transform it into a normal distribution, which is achieved by sampling. The Central Limit Theorem says that as the number of samples increases, the distribution of mean values ​​from samples approaching normal distribution.

The uniqueness of the method is to use a single SQL query that performs sampling (necessary to convert to a normal distribution), the calculation of mean values ​​and standard deviations for each sample and, finally, for each group.

This method considerably simplifies the analytical treatment of the data, and is already used in practice for orders analytical reports, whose example is shown in the following table:

Order Type Average sent time (hours) 97% orders sent within (hours)
A 7 14
B 5 7
C 6,6 14
D 2,8 8
E 42 67

An example query is taken from orders analytics. The following information for each individual order have been identified from the event log table of business processes:

  • type is the type of order
  • time is the interval from leaving order from SAP into the BPM
    and sending that order to supplier.

 

SELECT type, avg(avg_sample), stddev(avg_sample)
FROM( SELECT distinct type,
                              avg(time) over (partition by type,sample) avg_sample, sample
FROM( SELECT type, time,
                                            (row_number() over (partition by type)-1)/30 sample
                            FROM data2
)
)
GROUP BY type
ORDER BY type

The blue color indicates the portion that generates samples of 30 orders, and the red color indicates taking the mean value for each sample.

The final PostgreSQL query is:

SELECT type,sum(cnt),count(*) cnt,avg(avg_sample),stddev(avg_sample)
FROM( SELECT distinct type,count(*) over w cnt,avg(diff) over w avg_sample
FROM ( SELECT type,(extract (‘epoch’ from end_time-start_time))/3600 diff,
(row_number() over (partition by type)-1)/30 sample
FROM data2
) foo1
WINDOW w AS (PARTITION BY type,sample)
) foo2
GROUP BY type
ORDER BY 1

 

Oglasi
Ovaj unos je objavljen u Nekategorizirano. Bookmarkirajte stalnu vezu.

Komentiraj

Popunite niže tražene podatke ili kliknite na neku od ikona za prijavu:

WordPress.com Logo

Ovaj komentar pišete koristeći vaš WordPress.com račun. Odjava / Izmijeni )

Twitter picture

Ovaj komentar pišete koristeći vaš Twitter račun. Odjava / Izmijeni )

Facebook slika

Ovaj komentar pišete koristeći vaš Facebook račun. Odjava / Izmijeni )

Google+ photo

Ovaj komentar pišete koristeći vaš Google+ račun. Odjava / Izmijeni )

Spajanje na %s