Thursday, 28 March 2013

ITBAL ASSIGNMENT 9

Assignment 1:
Create 3 vectors x,y,z : choose any random values for each of equal length
T<-cbind(x,y,z)
create 3d plot and colour code it. 


> datasample<- rnorm(76,52,34)
> datasample


 [1]  45.5942618  19.9666328  57.2714050  74.3639975  40.3385894   0.6364362  74.7431400 115.0645893  25.1070366  43.4188552  63.1523647  23.6176277  67.2653427
[14]  26.6219515  49.4166552  28.8174814  57.3325996  97.7983760  26.1854705  78.6355984  73.4513727 -27.5435911  68.3513575  17.8148926  67.5888141  67.6574094
[27]  -3.3454215  34.3924609  18.6275779 115.7888569  40.8016886 102.1727834  98.8360943  64.2131752  47.0419500  43.7890566  47.7691718 104.9063673  86.8813250
[40]  67.8444825  81.9504910  46.9187240  79.7412471  28.9433658  43.8613800  32.6041700  63.0511048  55.7572267  36.0072830  79.1773860  71.5129731  40.6598254
[53]  31.8630136  86.6888350  38.7001328  58.0645384  63.0741548 107.4190386  52.7611391   7.0894044  27.9309929  84.1651360 -42.5948954  39.2538138  -2.4216676
[66]  56.2117049  65.6911771  73.2045687  91.1742967  58.5928354 116.0099456  25.0099455  30.3843329  -1.5303707  50.1393531  95.2141598

> x<- sample(datasample,20)
> y<- sample(datasample,20)
> z<- sample(datasample,20)
> x


 [1]  71.51297  79.17739  97.79838  28.81748  30.38433  58.06454  31.86301  67.26534  25.00995  43.86138  28.94337  56.21170  73.45137  95.21416  26.62195  84.16514
[17]  23.61763 104.90637  40.80169 116.00995
> y
 [1]  91.17430  84.16514  73.20457 104.90637  67.84448  32.60417 107.41904 102.17278  18.62758 -27.54359  73.45137  47.04195  63.07415 115.06459  65.69118  28.94337
[17]  78.63560  56.21170  39.25381  19.96663
> z
 [1]  46.9187240  67.5888141  43.7890566 116.0099456  38.7001328  25.1070366  67.8444825 107.4190386  45.5942618  -1.5303707  63.1523647  47.0419500 115.7888569
[14]  57.3325996  68.3513575  79.7412471   0.6364362  67.6574094  58.0645384  73.2045687
> T<-cbind(x,y,z)
> T

              x         y           z
 [1,]  71.51297  91.17430  46.9187240
 [2,]  79.17739  84.16514  67.5888141
 [3,]  97.79838  73.20457  43.7890566
 [4,]  28.81748 104.90637 116.0099456
 [5,]  30.38433  67.84448  38.7001328
 [6,]  58.06454  32.60417  25.1070366
 [7,]  31.86301 107.41904  67.8444825
 [8,]  67.26534 102.17278 107.4190386
 [9,]  25.00995  18.62758  45.5942618
[10,]  43.86138 -27.54359  -1.5303707
[11,]  28.94337  73.45137  63.1523647
[12,]  56.21170  47.04195  47.0419500
[13,]  73.45137  63.07415 115.7888569
[14,]  95.21416 115.06459  57.3325996
[15,]  26.62195  65.69118  68.3513575
[16,]  84.16514  28.94337  79.7412471
[17,]  23.61763  78.63560   0.6364362
[18,] 104.90637  56.21170  67.6574094
[19,]  40.80169  39.25381  58.0645384
[20,] 116.00995  19.96663  73.2045687


> plot3d(T)


> plot3d(T,col=rainbow(1000))


 > plot3d(T,col=rainbow(1000),type='s')




Assignment2:
Read the documentaion of rnorm,pnorm
Create two random variables x and y which are normally   distributed
Plot x vs y
Plot x vs y after introducing a categorical variable using   cbind with 5 diff categories
Get the colourcode
Get the smooth curve


> ?rnorm
starting httpd help server ... done

http://127.0.0.1:31992/library/stats/html/Normal.html

> x <- rnorm(200,90,10)
> y <- rnorm(200,50,10)
> z1<-sample(letters,5)
> z2<-sample(z1,200,replace=TRUE)
> z<-as.factor(z2)
> T<-cbind(x,y,z)
> qplot(x,y)




> qplot(x,z)




> qplot(x,z,alpha=I(1/10))



> qplot(x,y,geom=c("point","smooth"))



>  qplot(x,y,colour=z)




> qplot(log(x),log(y),colour=z)






Saturday, 23 March 2013

Google Refine


Data Clensing

Data cleansing is identifying the wrong or inaccurate records in the data set and making appropriate corrections to the records.It involves identifying incomplete, inaccurate, and incorrect parts of data and then either replacing them with correct data or deleting the incorrect data.Data cleansing results in data which is consistent with the other standard data and is useful for performing various analysis.The error in the data could be due to data entry error by the user, failure during transmission of data or improper data definitions.

Google Refine :

Google Refine is a web application, but unlike 99% of web applications, it is intended to be run on one's own machine and used by oneself. The server-side maintains states of the data (undo/redo history, long-running processes, etc.) while the client-side maintains states of the user interface (facets and their selections, view pagination, etc.). The client-side makes GET and POST ajax calls to cause changes to the data and to fetch data and data-related states from the server-side

Google Refine is a powerful tool for effectively cleanse data online.The main features of Google Refine consists of
·         Pulling data from various sources
·         Cleaning the data using Transform/Clusters/Filters
·         Linking to the web URLs to get more useful data
·         Connection with various database to reconcile the collected data

The few snapshots of the project:

It allows to load multiple files at the same time from any source and practically in any form:


Data loaded in Google Refine :



One important aspect of google refine is Faceting. Faceting is about seeing the big picture and filtering the rows on which bulk update is to be performed.
We can perform text facet,numeric facet,timeline facet and scatterplot facet. we can also design customized facets.




Any redundancy or duplicates can also be removed.



Clustering is used to merge choices which look similar


Expressions can also be used:


Reconcilation is taking a step further from just cleansing the data to get more information about the data present through freely available online data base.(Freebase)



We can also enrich the data



Some of the advantages of Google Refine are
•         Ease of use
•         Works in any browser
•         Extensive functionality
•         Undo/Redo is simply awesome

Thursday, 14 March 2013

Assignment 8 : Panel Data Analysis

Do Panel Data Analysis of "Produc" data analyzing  on three types of model :
      Pooled affect model
      Fixed affect model
      Random affect model 

Determine which model is the best by using functions:
       pFtest : for determining between fixed and pooled
       plmtest : for determining between pooled and random
       phtest: for determining between random and fixed


> data(Produc , package ="plm")

>  head(Produc)
    state year     pcap     hwy   water    util       pc   gsp    emp unemp
1 ALABAMA 1970 15032.67 7325.80 1655.68 6051.20 35793.80 28418 1010.5   4.7
2 ALABAMA 1971 15501.94 7525.94 1721.02 6254.98 37299.91 29375 1021.9   5.2
3 ALABAMA 1972 15972.41 7765.42 1764.75 6442.23 38670.30 31303 1072.3   4.7
4 ALABAMA 1973 16406.26 7907.66 1742.41 6756.19 40084.01 33430 1135.5   3.9
5 ALABAMA 1974 16762.67 8025.52 1734.85 7002.29 42057.31 33749 1169.8   5.5
6 ALABAMA 1975 17316.26 8158.23 1752.27 7405.76 43971.71 33604 1155.4   7.7

Pooled Model

> pool <- plm(log(pcap)~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp) , data =Produc, model=("pooling"), index = c("state","year"))
> summary(pool)

Oneway (individual) effect Pooling Model

Call:
plm(formula = log(pcap) ~ log(hwy) + log(water) + log(util) + 
    log(pc) + log(gsp) + log(emp) + log(unemp), data = Produc, 
    model = ("pooling"), index = c("state", "year"))

Balanced Panel: n=48, T=17, N=816

Residuals :
    Min.  1st Qu.   Median  3rd Qu.     Max. 
-0.04950 -0.01940 -0.00412  0.01150  0.08690 

Coefficients :
              Estimate Std. Error  t-value  Pr(>|t|)    
(Intercept)  0.7496721  0.0271054  27.6577 < 2.2e-16 ***
log(hwy)     0.5248704  0.0048326 108.6099 < 2.2e-16 ***
log(water)   0.1077579  0.0040454  26.6370 < 2.2e-16 ***
log(util)    0.4127255  0.0038337 107.6574 < 2.2e-16 ***
log(pc)     -0.0330829  0.0048219  -6.8610 1.361e-11 ***
log(gsp)     0.0758341  0.0108650   6.9797 6.170e-12 ***
log(emp)    -0.0891772  0.0076891 -11.5978 < 2.2e-16 ***
log(unemp)   0.0043878  0.0029465   1.4891    0.1368    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Total Sum of Squares:    724.14
Residual Sum of Squares: 0.56734
R-Squared      :  0.99922 
      Adj. R-Squared :  0.98942 
F-statistic: 147217 on 7 and 808 DF, p-value: < 2.22e-16

Fixed Model


> fixed <- plm(log(pcap)~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp) , data =Produc, model=("within"), index = c("state","year"))
> summary(fixed)

Oneway (individual) effect Within Model

Call:
plm(formula = log(pcap) ~ log(hwy) + log(water) + log(util) + 
    log(pc) + log(gsp) + log(emp) + log(unemp), data = Produc, 
    model = ("within"), index = c("state", "year"))

Balanced Panel: n=48, T=17, N=816

Residuals :
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-0.069800 -0.005280 -0.000327  0.005360  0.061200 

Coefficients :
             Estimate Std. Error t-value  Pr(>|t|)    
log(hwy)    0.5418395  0.0109565 49.4536 < 2.2e-16 ***
log(water)  0.1215676  0.0053719 22.6304 < 2.2e-16 ***
log(util)   0.3909247  0.0065771 59.4368 < 2.2e-16 ***
log(pc)     0.0177190  0.0096372  1.8386 0.0663624 .  
log(gsp)    0.0568433  0.0126569  4.4911 8.184e-06 ***
log(emp)   -0.0851515  0.0146508 -5.8121 9.073e-09 ***
log(unemp) -0.0092135  0.0024988 -3.6872 0.0002429 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Total Sum of Squares:    9.4468
Residual Sum of Squares: 0.12613
R-Squared      :  0.98665 
      Adj. R-Squared :  0.92015 
F-statistic: 8033.41 on 7 and 761 DF, p-value: < 2.22e-16

Random Model

> random <- plm(log(pcap)~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) + log(emp) + log(unemp) , data =Produc, model=("random"), index = c("state","year"))
> summary(random)

Oneway (individual) effect Random Effect Model 
   (Swamy-Arora's transformation)

Call:
plm(formula = log(pcap) ~ log(hwy) + log(water) + log(util) + 
    log(pc) + log(gsp) + log(emp) + log(unemp), data = Produc, 
    model = ("random"), index = c("state", "year"))

Balanced Panel: n=48, T=17, N=816

Effects:
                    var   std.dev share
idiosyncratic 0.0001657 0.0128743 0.221
individual    0.0005848 0.0241825 0.779
theta:  0.8719  

Residuals :
    Min.  1st Qu.   Median  3rd Qu.     Max. 
-0.06500 -0.00624 -0.00195  0.00454  0.06450 

Coefficients :
              Estimate Std. Error t-value  Pr(>|t|)    
(Intercept)  0.6625006  0.0530786 12.4815 < 2.2e-16 ***
log(hwy)     0.5021294  0.0074551 67.3537 < 2.2e-16 ***
log(water)   0.1191683  0.0049801 23.9289 < 2.2e-16 ***
log(util)    0.3944635  0.0060802 64.8768 < 2.2e-16 ***
log(pc)      0.0101901  0.0075870  1.3431    0.1796    
log(gsp)     0.0599363  0.0122997  4.8730 1.323e-06 ***
log(emp)    -0.0767378  0.0125556 -6.1119 1.531e-09 ***
log(unemp)  -0.0034020  0.0022591 -1.5059    0.1325    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Total Sum of Squares:    21.167
Residual Sum of Squares: 0.13965

Pooled vs Fixed 

Null Hypothesis: Pooled Model
Alternate Hypothesis : Fixed Model

>  pFtest(fixed,pool)

        F test for individual effects

data:  log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) +      log(emp) + log(unemp) 
F = 56.6361, df1 = 47, df2 = 761, p-value < 2.2e-16
alternative hypothesis: significant effects 

Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Fixed Model is better than Pooled Model

Pooled vs Random 

Null Hypothesis: Pooled Model
Alternate Hypothesis: Random Model

>  plmtest(pool)

        Lagrange Multiplier Test - (Honda)

data:  log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) +      log(emp) + log(unemp) 
normal = 57.1686, p-value < 2.2e-16
alternative hypothesis: significant effects 

Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Random Model is better than Pooled Model

Random vs Fixed 

Null Hypothesis: No Correlation . Random Model
Alternate Hypothesis: Fixed Model

> phtest(fixed,random)

        Hausman Test

data:  log(pcap) ~ log(hwy) + log(water) + log(util) + log(pc) + log(gsp) +      log(emp) + log(unemp) 
chisq = 93.546, df = 7, p-value < 2.2e-16
alternative hypothesis: one model is inconsistent 

Since the p value is negligible so we reject the Null Hypothesis and hence Alternate hypothesis is accepted which is to accept Fixed Model.

Conclusion: 

 So after making all the comparisons we come to the conclusion that Fixed Model is best suited to do the panel data analysis for "Produc" data set.
 
Hence , we conclude that within the same id i.e. within same "state" there is no variation.