#### Introduction

Predictive Analytics consists of the data processing techniques focusing in solving the problem of predicting future outcome based on analyzing previous collected data.

Organizations are increasingly adopting predictive analytics, and adopting these predictive analytics more broadly. Many are now using dozens or even thousands of predictive analytic models. These models are increasingly used in real-time decision making and in operational, production systems. Models are used to improve customer treatment by selecting the next best action to develop a customer, to make loan or credit pricing decisions that reflect the future risk of a transaction, to predict the likelihood of equipment failure to drive proactive maintenance decisions, or to detect potentially fraudulent transactions so they can be routed out of the system before they hit the bottom line. Examples like these deliver high ROI from analytics.

**Predictive maintenance and quality**

Predictive maintenance is just what its name implies: Maintaining components or assets, large or small, according to fact-based expectations for when they will fail or require service.

These facts can include:

- Real-time device status: How is the part performing right now?
- Historical device data: How has the part performed in the past?
- Data for similar devices: How have other, similar parts performed?
- Maintenance records: When was the part last serviced or replaced?
- Maintenance schedules: What does the manufacturer recommend?

**Of course, all of this Big Data is meaningless without analysis. There are hidden patterns lurking within these facts and figures. Decoding these patterns is what powers predictive maintenance and separates it from more traditional, reactionary approaches to equipment repair and replacement**.

#### Traditional approaches to maintenance

Predictive maintenance differs considerably from the traditional approaches to determining when to service or replace equipment. For years, companies have kept their production lines running through a combination of these maintenance methods:

**Reactive**: Service or replace equipment after it fails**Preventive**: Service or replace equipment according to the manufacturer’s suggested schedule, or the amount of time it has been in service, or based on operational observations**Condition-based**: Service or replace equipment based on monitoring performed to assess its current condition

The problem with these old-school approaches is their high cost. Waiting until a component fails means lost production time and revenue. In-person inspections are expensive and can lead to replacing parts unnecessarily, based only on the inspector’s best guess. Following the manufacturer’s recommended maintenance schedule saves on inspection costs but often results in replacing parts that are still functioning well and could continue to do so.

One solution to decrease the operational cost and to increase the manufacturing system availability (Figure 1) is to manage continuously all maintenance activities and to control the degradation to move to predictive maintenance.

*Figure 1 : Decreasing of failure rate through predictive maintenance*

Device events are supplied to the solution either in real time or as a batch and are transformed into the format required by the solution. The information in the events is recorded in the analytic data store along with aggregated key performance indicators (KPIs) and profiles. The KPIs are accumulated over time and show trends. The profiles indicate the current state of the device and can include statistical calculations of variation. For example, events containing the temperature and operating load of the transformer can be aggregated as a KPI of the average temperature and load per day. The operating load can also be aggregated as a profile to record the most recent load and the variability of the load over time.

The information in the analytic data store is used to perform predictive scoring, a process that uses a mathematical model to put a numerical value on the likelihood that a device or component failure will occur. We than use a predefined set of rules to determine the appropriate actions to take in response to various scores. For example, if a score indicates that the probability of a transformer failure is less than 0.7 (70%), the rules may call for no immediate action. If the score rises above 0.8 (80%), the rules may trigger a request to have a physical inspection performed.

Scoring is a key part of predictive maintenance and involves the use of predictive models that use historical data to determine the probability of certain future outcomes. For example, a model could be created based on historical data regarding transformer temperature, current load, and occurrences of failure.

#### R Statistical language for Predictive Analytics

Captured data is continuously scored using predictive analytics software. Predictive analytic models mine the data and correlate past failures using multivariate analysis. The models can mine all the variables and conditions that contributed to past failures in order to predict future failures. Incoming data are then run through the model and asset health scores are generated on a real time basis.

The processing cycle typically involves two phases of processing:

- Training phase: Learn a model from training data
- Predicting phase: Deploy the model to production and use that to predict the unknown or future outcome

R is an open source language/environment that is governed by GPL2.

Predictive Analytics is tightly integrated with the algorithms and statistical libraries available in R. Oracle has it’s own version of R called Oracle R Enterprise for better customization to analytics using Oracle databases. SAS Institute had made connectivity from SAS/IML and JMP products some time back. IBM ‘s acquired analytics software SPSS had been one of the first softwares to work with R.

R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, …) and graphical techniques, and is highly extensible.

Using R for predictive analytics is a low-cost and flexible solution, but does require a basic knowledge of statistics and mathematics. R is a very powerful language for a number of reasons. However, the main feature is vector processing — the ability to perform operations on entire arrays of numbers without explicitly writing iteration loops. Another important feature is that statisticians, engineers and scientists can improve the software’s code or write variations for specific tasks. Packages written for R add advanced algorithms, colored and textured graphs and mining techniques to dig deeper into databases.

#### Choose the right model

An important phase to consider is the actual analysis phase. When choosing a certain type of modeling, it is key to consider whether the main task is to provide a result that is as significant as possible or one that also needs to be presented to business users or engineers. In many cases, decision trees have proven to be a very good approach for classification analysis cases. In particular, the option to build the trees manually and, in so doing, being able to include domain know-how, is very powerful and well received by many customers. Also make sure that you create a holdout sample of your data to test the models on their stability and predictive power on unknown data. Otherwise you might end up creating a result that works for the current data but will not work in other data in the future. This phenomenon is called *overfitting*.

Four predictive models are used:

- Health Score (binary logit model)
- Lifespan Analysis (Cox Regression model)
- Random Forests CART (Classification and Regression Trees)
- Time series models

##### Health Score (binary logit model)

The Health Score model is based on the linear regression model and measures the likelihood that an asset or process will fail. The model uses historical defect data, operational information, and environmental data to determine the current operational health of an asset, and continuously monitors the asset to predict potential faults in real time. The resulting health score value, typically referred to simply as the Health Score, can also be used to predict the future health of the asset.

The Health Score is presented as a number between 0 and 1. The higher the number, the healthier the asset. The overall Health Score for an entire manufacturing site represents the average score for each individual asset at a site. If the input data model structure is modified, the health score model must be retrained on the new data.

A well-established statistical method for predicting binomial outcomes is required to predict the health score value, and the solution uses a binomial logistic algorithm for this purpose. In the binomial or binary logistic regression, the outcome can have only two possible types of values (e.g. “Yes” or “No”, “Success” or “Failure”). Multinomial logistic refers to cases where the outcome can have three or more possible types of values (e.g., “good” vs. “very good” vs. “best” ). Generally outcome is coded as “0″ and “1″ in binary logistic regression. This kind of algorithm is limited to models where the target field is a flag or binary field. The algorithm provides enhanced statistical output when compared to a multinomial algorithm and is less susceptible to problems when the number of table cells (unique combinations of predictor values) is large relative to the number of records.

R provides comprehensive support for multiple linear regression.

To fit logistic regression model, glm() function is used in R which is similar to lm() or “linear model” function, but glm() includes additional parameters. The format is

glm(Y~X1+X2+X3, family=binomial(link=”logit”), data=mydata)

Here, Y is dependent variable and X1, X2 and X3 are independent variables.

##### Lifespan Analysis (Cox Regression model)

The Lifespan Analysis model estimates a device’s remaining lifespan when functioning in a real-world scenario. Depending on the device, lifespan can be measured in hours, miles, stress cycles, or any other metric. Data on the functional condition of the device is collected from laboratory experiments.

The Lifespan Analysis model analyzes time-to-failure event data. Lifespan analysis is an offline, back-end process and can be performed at regular intervals or on demand.

The model is based on the **Cox regression model**. In many cases where the time to a certain event (such as a failure) can be predicted, the Cox Regression technique is particularly well-suited. This technique was originally developed to predict the life expectancy of cancer patents but it can be used just as well for technical analysis. Cox Regression can also take potential influence factors into account and fine-tune its failure estimates accordingly.

The shape of the survival function and the regression coefficients for the predictors are estimated from observed subjects; the model can then be applied to new cases that have measurements for the predictor variables.

For Cox Regression analysis we can use R package named ** survival **(http://cran.r-project.org/web/packages/survival/)

##### Random Forests CART (Classification and Regression Trees)

CART models offer an intuitive overview of a multivariate data set and are suitable for dealing with complex processes and nonlinear relationships. They are also able to recognize the parameters that are most important to a given regression problem.

However, they suffer from high prediction variance. Therefore, for prediction purposes we use a method that utilizes an ensemble of CART models called **Random Forests**. The aggregation of a large number of different single models usually offers improved prediction accuracy. Aggregating the results of single tree models reduces variance and produces more stable models. Furthermore the method does not overfit due to the law of large numbers. Regression tree model is constructed by using binary recursive partitioning routines as implemented in the R package ** rpart** (http://cran.r-project.org/web/packages

/rpart/index.html) and plotted using routines from the R package

**(http://cran.r-project.org/**

*partykit*web/packages/partykit/index.html).

The methodology allows a transition from a time-based to a condition-based maintenance, a reduction of problem complexity and it offers high predictive performance. As the Random Forest approach is free of parametric or distributional assumptions, the method can be applied to a wide range of predictive maintenance problems. This leads to a reduction of tool downtime, maintenance and manpower costs and improves competitiveness in the semiconductor industry.

##### Time series models

Time series models are used for predicting or forecasting the future behavior of variables. These models account for the fact that data points taken over time may have an internal structure (such as autocorrelation, trend or seasonal variation) that should be accounted for. As a result standard regression techniques cannot be applied to time series data and methodology has been developed to decompose the trend, seasonal and cyclical component of the series. Modeling the dynamic path of a variable can improve forecasts since the predictable component of the series can be projected into the future.

Time series models estimate difference equations containing stochastic components. Two commonly used forms of these models are autoregressive models (AR) and moving average (MA) models. The Box-Jenkins methodology (1976) developed by George Box and G.M. Jenkins combines the AR and MA models to produce the ARMA (autoregressive moving average) model which is the cornerstone of stationary time series analysis. ARIMA(autoregressive integrated moving average models) on the other hand are used to describe non-stationary time series.

In recent years time series models have become more sophisticated and attempt to model conditional heteroskedasticity with models such as ARCH (autoregressive conditional heteroskedasticity) and GARCH (generalized autoregressive conditional heteroskedasticity) models frequently used for financial time series. In addition time series models are also used to understand inter-relationships among economic variables represented by systems of equations using VAR (vector autoregression) and structural VAR models.

We are using R ** forecast** package (http://cran.r-project.org/web/packages/forecast/).

Reblogged this on Christophe Saint Carats.