Saturday, 23 March 2013

Google Refine


Data Clensing

Data cleansing is identifying the wrong or inaccurate records in the data set and making appropriate corrections to the records.It involves identifying incomplete, inaccurate, and incorrect parts of data and then either replacing them with correct data or deleting the incorrect data.Data cleansing results in data which is consistent with the other standard data and is useful for performing various analysis.The error in the data could be due to data entry error by the user, failure during transmission of data or improper data definitions.

Google Refine :

Google Refine is a web application, but unlike 99% of web applications, it is intended to be run on one's own machine and used by oneself. The server-side maintains states of the data (undo/redo history, long-running processes, etc.) while the client-side maintains states of the user interface (facets and their selections, view pagination, etc.). The client-side makes GET and POST ajax calls to cause changes to the data and to fetch data and data-related states from the server-side

Google Refine is a powerful tool for effectively cleanse data online.The main features of Google Refine consists of
·         Pulling data from various sources
·         Cleaning the data using Transform/Clusters/Filters
·         Linking to the web URLs to get more useful data
·         Connection with various database to reconcile the collected data

The few snapshots of the project:

It allows to load multiple files at the same time from any source and practically in any form:


Data loaded in Google Refine :



One important aspect of google refine is Faceting. Faceting is about seeing the big picture and filtering the rows on which bulk update is to be performed.
We can perform text facet,numeric facet,timeline facet and scatterplot facet. we can also design customized facets.




Any redundancy or duplicates can also be removed.



Clustering is used to merge choices which look similar


Expressions can also be used:


Reconcilation is taking a step further from just cleansing the data to get more information about the data present through freely available online data base.(Freebase)



We can also enrich the data



Some of the advantages of Google Refine are
•         Ease of use
•         Works in any browser
•         Extensive functionality
•         Undo/Redo is simply awesome

No comments:

Post a Comment