Predictive Analytics and R, Part 3

Forecasting

Fore­cast­ing is required in many sit­u­a­tions: decid­ing whether to build another power gen­er­a­tion plant in the next five years requires fore­casts of future demand; sched­ul­ing staff in a call cen­tre next week requires fore­casts of call vol­ume; stock­ing an inven­tory requires fore­casts of stock require­ments. Fore­casts can be required sev­eral years in advance (for the case of cap­i­tal invest­ments), or only a few min­utes before­hand (for telecom­mu­ni­ca­tion rout­ing). What­ever the cir­cum­stances or time hori­zons involved, fore­cast­ing is an impor­tant aid in effec­tive and effi­cient planning.

Before the 1920s, forecasting meant drawing lines through clouds of data values. Yule invented the autoregressive technique in 1927, so he could predict the annual number of sunspots. This was a linear model and the basic approach was to assume a linear underlying process modified by noise. That model is often used in marketing (e.g., what will my sales of wheat be next month).

Forecast accuracy measures such as mean squared error (MSE) can be used for selecting a model for a given set of data, provided the errors are computed from data in a hold-out set and not from the same data as were used for model estimation. However, there are often too few out-of-sample errors to draw reliable conclusions. Consequently, a penalized method based on the in-sample fit is usually better.

Complex models tend to correctly handle the training data but fail to generalize, a phenomenon usually termed as overfitting. The usual statistical approach to this situation is model selection, where different candidate models are evaluated according to a generalization estimate. Several complex estimators have been developed (e.g., Bootstraping), which are computationally burdensome. A reasonable alternative is the use of simple statistics that add a penalty to the model that is a function its complexity, such as the Bayesian Information Criterion (BIC).

Another such approach that uses a penalized likelihood is Akaike’s Information Criterion. The AIC is able to select between the error types because it is based on likelihood rather than one-step forecasts. We select the model that minimizes the AIC amongst all of the models that are appropriate for the data. The AIC also provides a method for selecting between the additive and multiplicative error models. Mod­els with mul­ti­plica­tive errors are use­ful when the data are strictly pos­i­tive, but are not numer­i­cally sta­ble when the data con­tain zeros or neg­a­tive val­ues. There­fore mul­ti­plica­tive errors mod­els will not be con­sid­ered if the time series is not strictly pos­i­tive. In that case only the six fully additive mod­els will be applied.

R code example for Customer Service Request’s forecast for next 20 days:

library(“xlsx”)
library(“xts”)
library(“chron”)
library(“forecast”)

file=file.choose()
data1=read.xlsx2(file,1,startRow=1,colIndex=c(1,3,8),
                                 as.data.frame=FALSE,colClasses=c(“integer”,”character”,”character”))
data1[3]=lapply(data1[3],function(x) as.Date(substring(x,2),format=”%d.%m.%Y”))
data2=data.frame(data1[[1]],data1[[2]],data1[[3]])
names(data2)=c(“CSR”,”Prioritet”,”Prijavljeno”)
table=table(data2$Prijavljeno,data2$Prioritet,dnn=c(“Prijavljeno”,”Prioritet”))
data3=as.data.frame.table(table,stringsAsFactors=FALSE)
data3$Prioritet=as.factor(data3$Prioritet)
data3$Prijavljeno=as.Date(data3$Prijavljeno)

startDate=”21.02.2011″
endDate=”31.12.2012″
data3=subset(data3,data3[,”Prijavljeno”]>=as.Date(startDate,format=”%d.%m.%Y”))

s=seq(from=as.POSIXct(paste(startDate,” 3″),format=”%d.%m.%Y %H”),
            to=as.POSIXct(paste(endDate,” 3″),format=”%d.%m.%Y %H”), by=”day”)
zeros=rep(0,length(s))
irts=irts(s,zeros)

hollidays=as.Date(c(“2011-01-06″,”2011-04-25″,”2011-06-22″,”2011-06-23”,
             “2011-08-05″,”2011-08-15″,”2011-11-01″,”2011-12-26″,”2012-01-06”,
             “2012-04-09″,”2012-05-01″,”2012-06-07″,”2012-06-22″,”2012-06-25”,
             “2012-08-15″,”2012-10-08″,”2012-11-01″,”2012-12-25″,”2012-12-26”))

wrkdys=irts[!is.weekend(irts)]
wrkdys=wrkdys[!(as.Date(wrkdys[[1]]) %in% hollidays)]
xts1=xts(wrkdys[[2]] ,order.by=as.Date(wrkdys[[1]]))

calc<-function(df,ahead) {
for(pr in levels(df$Prioritet))
   { print(paste(pr,”,ahead=”,ahead))
     subset=subset(df,df[,”Prioritet”]==pr)
     xts_subset=xts(subset$Freq,order.by=subset$Prijavljeno)
     xts=merge(xts_subset,xts1,fill=0)[,1]

     lmbd=BoxCox.lambda(xts)
    
     forecast=forecast(ets(xts,lambda=lmbd), h=ahead, simulate=TRUE, bootstrap=TRUE)
     summary(forecast)
     print(paste(“sum=”,sum(forecast$mean)))

     forecast=forecast(auto.arima(ts(xts), stepwise=FALSE, approximation=FALSE,
                                     lambda=lmbd), h=ahead, simulate=TRUE, bootstrap=TRUE)
     summary(forecast)
     print(paste(“sum=”,sum(forecast$mean)))
    
   }
}

calc(data3,20)

Oglasi
Ovaj unos je objavljen u Nekategorizirano. Bookmarkirajte stalnu vezu.

Komentiraj

Popunite niže tražene podatke ili kliknite na neku od ikona za prijavu:

WordPress.com Logo

Ovaj komentar pišete koristeći vaš WordPress.com račun. Odjava / Izmijeni )

Twitter picture

Ovaj komentar pišete koristeći vaš Twitter račun. Odjava / Izmijeni )

Facebook slika

Ovaj komentar pišete koristeći vaš Facebook račun. Odjava / Izmijeni )

Google+ photo

Ovaj komentar pišete koristeći vaš Google+ račun. Odjava / Izmijeni )

Spajanje na %s