A Technical chat: March 2013

Thursday, 28 March 2013

ITBAL ASSIGNMENT 9

Assignment 1:
Create 3 vectors x,y,z : choose any random values for each of equal length
T<-cbind(x,y,z)
create 3d plot and colour code it.

> datasample<- rnorm(76,52,34)
> datasample

[1] 45.5942618 19.9666328 57.2714050 74.3639975 40.3385894   0.6364362 74.7431400 115.0645893 25.1070366 43.4188552 63.1523647 23.6176277 67.2653427
[14] 26.6219515 49.4166552 28.8174814 57.3325996 97.7983760 26.1854705 78.6355984 73.4513727 -27.5435911 68.3513575 17.8148926 67.5888141 67.6574094
[27] -3.3454215 34.3924609 18.6275779 115.7888569 40.8016886 102.1727834 98.8360943 64.2131752 47.0419500 43.7890566 47.7691718 104.9063673 86.8813250
[40] 67.8444825 81.9504910 46.9187240 79.7412471 28.9433658 43.8613800 32.6041700 63.0511048 55.7572267 36.0072830 79.1773860 71.5129731 40.6598254
[53] 31.8630136 86.6888350 38.7001328 58.0645384 63.0741548 107.4190386 52.7611391   7.0894044 27.9309929 84.1651360 -42.5948954 39.2538138 -2.4216676
[66] 56.2117049 65.6911771 73.2045687 91.1742967 58.5928354 116.0099456 25.0099455 30.3843329 -1.5303707 50.1393531 95.2141598

> x<- sample(datasample,20)
> y<- sample(datasample,20)
> z<- sample(datasample,20)
> x

[1] 71.51297 79.17739 97.79838 28.81748 30.38433 58.06454 31.86301 67.26534 25.00995 43.86138 28.94337 56.21170 73.45137 95.21416 26.62195 84.16514
[17] 23.61763 104.90637 40.80169 116.00995
> y
[1] 91.17430 84.16514 73.20457 104.90637 67.84448 32.60417 107.41904 102.17278 18.62758 -27.54359 73.45137 47.04195 63.07415 115.06459 65.69118 28.94337
[17] 78.63560 56.21170 39.25381 19.96663
> z
[1] 46.9187240 67.5888141 43.7890566 116.0099456 38.7001328 25.1070366 67.8444825 107.4190386 45.5942618 -1.5303707 63.1523647 47.0419500 115.7888569
[14] 57.3325996 68.3513575 79.7412471   0.6364362 67.6574094 58.0645384 73.2045687
> T<-cbind(x,y,z)
> T
              x         y           z
[1,] 71.51297 91.17430 46.9187240
[2,] 79.17739 84.16514 67.5888141
[3,] 97.79838 73.20457 43.7890566
[4,] 28.81748 104.90637 116.0099456
[5,] 30.38433 67.84448 38.7001328
[6,] 58.06454 32.60417 25.1070366
[7,] 31.86301 107.41904 67.8444825
[8,] 67.26534 102.17278 107.4190386
[9,] 25.00995 18.62758 45.5942618
[10,] 43.86138 -27.54359 -1.5303707
[11,] 28.94337 73.45137 63.1523647
[12,] 56.21170 47.04195 47.0419500
[13,] 73.45137 63.07415 115.7888569
[14,] 95.21416 115.06459 57.3325996
[15,] 26.62195 65.69118 68.3513575
[16,] 84.16514 28.94337 79.7412471
[17,] 23.61763 78.63560   0.6364362
[18,] 104.90637 56.21170 67.6574094
[19,] 40.80169 39.25381 58.0645384
[20,] 116.00995 19.96663 73.2045687

> plot3d(T)

> plot3d(T,col=rainbow(1000))

> plot3d(T,col=rainbow(1000),type='s')

Assignment2:
Read the documentaion of rnorm,pnorm
Create two random variables x and y which are normally distributed
Plot x vs y
Plot x vs y after introducing a categorical variable using cbind with 5 diff categories
Get the colourcode
Get the smooth curve

> ?rnorm
starting httpd help server ... done
http://127.0.0.1:31992/library/stats/html/Normal.html

> x <- rnorm(200,90,10)
> y <- rnorm(200,50,10)
> z1<-sample(letters,5)
> z2<-sample(z1,200,replace=TRUE)
> z<-as.factor(z2)
> T<-cbind(x,y,z)
> qplot(x,y)

> qplot(x,z)

> qplot(x,z,alpha=I(1/10))

> qplot(x,y,geom=c("point","smooth"))

> qplot(x,y,colour=z)

> qplot(log(x),log(y),colour=z)

Saturday, 23 March 2013

Google Refine

Data Clensing

Data cleansing is identifying the wrong or inaccurate records in the data set and making appropriate corrections to the records.It involves identifying incomplete, inaccurate, and incorrect parts of data and then either replacing them with correct data or deleting the incorrect data.Data cleansing results in data which is consistent with the other standard data and is useful for performing various analysis.The error in the data could be due to data entry error by the user, failure during transmission of data or improper data definitions.

Google Refine :

Google Refine is a web application, but unlike 99% of web applications, it is intended to be run on one's own machine and used by oneself. The server-side maintains states of the data (undo/redo history, long-running processes, etc.) while the client-side maintains states of the user interface (facets and their selections, view pagination, etc.). The client-side makes GET and POST ajax calls to cause changes to the data and to fetch data and data-related states from the server-side

Google Refine is a powerful tool for effectively cleanse data online.The main features of Google Refine consists of

· Pulling data from various sources

· Cleaning the data using Transform/Clusters/Filters

· Linking to the web URLs to get more useful data

· Connection with various database to reconcile the collected data

The few snapshots of the project:

It allows to load multiple files at the same time from any source and practically in any form:

Data loaded in Google Refine :

One important aspect of google refine is Faceting. Faceting is about seeing the big picture and filtering the rows on which bulk update is to be performed.

We can perform text facet,numeric facet,timeline facet and scatterplot facet. we can also design customized facets.

Any redundancy or duplicates can also be removed.

Clustering is used to merge choices which look similar

Expressions can also be used:

Reconcilation is taking a step further from just cleansing the data to get more information about the data present through freely available online data base.(Freebase)

We can also enrich the data

Some of the advantages of Google Refine are

• Ease of use

• Works in any browser

• Extensive functionality

• Undo/Redo is simply awesome

Thursday, 14 March 2013

Assignment 8 : Panel Data Analysis

Do Panel Data Analysis of "Produc" data analyzing  on three types of model :
      Pooled affect model
      Fixed affect model
      Random affect model

Determine which model is the best by using functions:
       pFtest : for determining between fixed and pooled
       plmtest : for determining between pooled and random
       phtest: for determining between random and fixed

> data(Produc , package ="plm")

> head(Produc)

state year pcap hwy water util pc gsp emp unemp

1 ALABAMA 1970 15032.67 7325.80 1655.68 6051.20 35793.80 28418 1010.5 4.7

2 ALABAMA 1971 15501.94 7525.94 1721.02 6254.98 37299.91 29375 1021.9 5.2

3 ALABAMA 1972 15972.41 7765.42 1764.75 6442.23 38670.30 31303 1072.3 4.7

4 ALABAMA 1973 16406.26 7907.66 1742.41 6756.19 40084.01 33430 1135.5 3.9

5 ALABAMA 1974 16762.67 8025.52 1734.85 7002.29 42057.31 33749 1169.8 5.5

6 ALABAMA 1975 17316.26 8158.23 1752.27 7405.76 43971.71 33604 1155.4 7.7

Pooled Model

> pool <- plm(log(pcap)~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp) , data =Produc, model=("pooling"), index = c("state","year"))

> summary(pool)

Oneway (individual) effect Pooling Model

Call:

plm(formula = log(pcap) ~ log(hwy) + log(water) + log(util) +

log(pc) + log(gsp) + log(emp) + log(unemp), data = Produc,

model = ("pooling"), index = c("state", "year"))

Balanced Panel: n=48, T=17, N=816

Residuals :

Min. 1st Qu. Median 3rd Qu. Max.

-0.04950 -0.01940 -0.00412 0.01150 0.08690

Coefficients :

Estimate Std. Error t-value Pr(>|t|)

(Intercept) 0.7496721 0.0271054 27.6577 < 2.2e-16 ***

log(hwy) 0.5248704 0.0048326 108.6099 < 2.2e-16 ***

log(water) 0.1077579 0.0040454 26.6370 < 2.2e-16 ***

log(util) 0.4127255 0.0038337 107.6574 < 2.2e-16 ***

log(pc) -0.0330829 0.0048219 -6.8610 1.361e-11 ***

log(gsp) 0.0758341 0.0108650 6.9797 6.170e-12 ***

log(emp) -0.0891772 0.0076891 -11.5978 < 2.2e-16 ***

log(unemp) 0.0043878 0.0029465 1.4891 0.1368

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares: 724.14

Residual Sum of Squares: 0.56734

R-Squared : 0.99922

Adj. R-Squared : 0.98942

F-statistic: 147217 on 7 and 808 DF, p-value: < 2.22e-16

Fixed Model

> fixed <- plm(log(pcap)~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp) , data =Produc, model=("within"), index = c("state","year"))

> summary(fixed)

Oneway (individual) effect Within Model

Call:

plm(formula = log(pcap) ~ log(hwy) + log(water) + log(util) +

log(pc) + log(gsp) + log(emp) + log(unemp), data = Produc,

model = ("within"), index = c("state", "year"))

Balanced Panel: n=48, T=17, N=816

Residuals :

Min. 1st Qu. Median 3rd Qu. Max.

-0.069800 -0.005280 -0.000327 0.005360 0.061200

Coefficients :

Estimate Std. Error t-value Pr(>|t|)

log(hwy) 0.5418395 0.0109565 49.4536 < 2.2e-16 ***

log(water) 0.1215676 0.0053719 22.6304 < 2.2e-16 ***

log(util) 0.3909247 0.0065771 59.4368 < 2.2e-16 ***

log(pc) 0.0177190 0.0096372 1.8386 0.0663624 .

log(gsp) 0.0568433 0.0126569 4.4911 8.184e-06 ***

log(emp) -0.0851515 0.0146508 -5.8121 9.073e-09 ***

log(unemp) -0.0092135 0.0024988 -3.6872 0.0002429 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares: 9.4468

Residual Sum of Squares: 0.12613

R-Squared : 0.98665

Adj. R-Squared : 0.92015

F-statistic: 8033.41 on 7 and 761 DF, p-value: < 2.22e-16

Random Model

> random <- plm(log(pcap)~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp) , data =Produc, model=("random"), index = c("state","year"))

> summary(random)

Oneway (individual) effect Random Effect Model

(Swamy-Arora's transformation)

Call:

plm(formula = log(pcap) ~ log(hwy) + log(water) + log(util) +

log(pc) + log(gsp) + log(emp) + log(unemp), data = Produc,

model = ("random"), index = c("state", "year"))

Balanced Panel: n=48, T=17, N=816

Effects:

var std.dev share

idiosyncratic 0.0001657 0.0128743 0.221

individual 0.0005848 0.0241825 0.779

theta: 0.8719

Residuals :

Min. 1st Qu. Median 3rd Qu. Max.

-0.06500 -0.00624 -0.00195 0.00454 0.06450

Coefficients :

Estimate Std. Error t-value Pr(>|t|)

(Intercept) 0.6625006 0.0530786 12.4815 < 2.2e-16 ***

log(hwy) 0.5021294 0.0074551 67.3537 < 2.2e-16 ***

log(water) 0.1191683 0.0049801 23.9289 < 2.2e-16 ***

log(util) 0.3944635 0.0060802 64.8768 < 2.2e-16 ***

log(pc) 0.0101901 0.0075870 1.3431 0.1796

log(gsp) 0.0599363 0.0122997 4.8730 1.323e-06 ***

log(emp) -0.0767378 0.0125556 -6.1119 1.531e-09 ***

log(unemp) -0.0034020 0.0022591 -1.5059 0.1325

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Total Sum of Squares: 21.167

Residual Sum of Squares: 0.13965

Pooled vs Fixed

Null Hypothesis: Pooled Model

Alternate Hypothesis : Fixed Model

> pFtest(fixed,pool)

F test for individual effects

data: log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp)

F = 56.6361, df1 = 47, df2 = 761, p-value < 2.2e-16

alternative hypothesis: significant effects

Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Fixed Model is better than Pooled Model

Pooled vs Random

Null Hypothesis: Pooled Model

Alternate Hypothesis: Random Model

> plmtest(pool)

Lagrange Multiplier Test - (Honda)

data: log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp)

normal = 57.1686, p-value < 2.2e-16

alternative hypothesis: significant effects

Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Random Model is better than Pooled Model

Random vs Fixed

Null Hypothesis: No Correlation . Random Model

Alternate Hypothesis: Fixed Model

> phtest(fixed,random)

Hausman Test

data: log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp)

chisq = 93.546, df = 7, p-value < 2.2e-16

alternative hypothesis: one model is inconsistent

Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Fixed Model.

Conclusion:

So after making all the comparisons we come to the conclusion that Fixed Model is best suited to do the panel data analysis for "Produc" data set.

Hence , we conclude that within the same id i.e. within same "state" there is no variation.