Big Data and R Programming Language is related as pbd is a series of R packages for statistical computing with Big Data by using high-performance computation. We discussed the basics of Big Data in our previously published article, What is Big Data and a discussion on one practical field of usage in the article Big Data in the Health Sector.
R is an interpreted language; users typically access it through a command-line interpreter suitable for statistical usage. We talked about High-Level Programming Language and provided an example with Lisp Programming Language. R is developed from S Programming Language. S is a statistical programming language. The two modern implementations of S are R and S-PLUS. The R implementation, part of the GNU free software project, is the top statistical language in the TIOBE index (position 18 in September 2013), 3 spots above the commercial software SAS. So, R is a free programming language for statistical computing and statistical graphics. As it is based on the programming language S, it is largely compatible with it. Although they are Programing Language, they are not “like” our usually used Perl, PHP etc. Programing Languages – they are fully related to computational statistics. Advanced users can write C, C++ or Java to manipulate R objects directly, that is what usually done by the software developers. Users typically access R through a command-line interpreter. If a user types “4+6” at the R command prompt and presses enter, the computer replies with “10”, as shown below:
1 2 | > 6+4 [1] 10 |
With this minimum background, we will start our topic – Big Data and R Programming Language divided in to two paragraphs.
---
Big Data and R Programming Language : What R Provides
R is part of the GNU project and on many platforms available. R is increasingly seen as the default language for statistical problems both in the commercial as well as in the scientific field (though mainly in the commercial sector SAS is also very popular).
Features of R can be extended by a variety of packages and adapted for specific statistical problems. Many packages may be selected directly from an on the R Console retrievable list and installed automatically. Central Archives for these packages is the Comprehensive R Archive Network (CRAN). The software is based on R and Bioconductor provides enhancements in the field of bioinformatics and in particular the analysis of gene expression data . Currently (January 2014) there are over 4200 packages on CRAN and more than 600 packages on Bioconductor.
With PL / R the language can also be an extension of PostgreSQL are used for server-side programming. R’s data structures include scalars, vectors, matrices, data frames (similar to tables in a relational database) and lists. R’s extensible object-system includes objects for (among others) : regression models, time-series and geo-spatial coordinates.
Reproducible research and automated report generation can be accomplished with packages that support execution of R code embedded within LaTeX, OpenDocument format and other markups.
Big Data and R Programming Language : More on R
The package jit provides JIT-compilation (Just-in-time compilation) and the package compiler offers a byte-code compiler for R. The packages snow, multicore, and parallel provide parallelism for R. The package ff saves memory by storing data on disk. The data structures behave as if they were in RAM. The package ffbase provides basic statistical functions for ‘ff’.
Programming with Big Data in R (pbdR) is a series of R packages and an environment for statistical computing with Big Data by utilizing high-performance statistical computation.
The pbdR uses the same programming language as R with S3/S4 classes and methods which is used among statisticians and data miners for developing statistical software. The significant difference between pbdR and R codes is pbdR mainly focuses on distributed memory system where data are distributed across several processors and analyzed in a batch mode, while communications between processors are based on MPI which is easily utilized in large high-performance computing (HPC) systems. R system mainly focuses on single multi-core machines for data analysis via an interactive mode such as GUI interface.
Sensing a growing interest in big data-style analysis, software provider Revolution Analytics has updated its flagship package of R statistical functions so it can be run with the Hadoop data processing platform.
Like we use Geshi for Syntax Highlighting, if we paste a snippet of R here :
1 2 | library(caTools) # external package providing write.gif function jet.colors |
They have a modified Syntax Highlighter with hyper-linked syntax :
1 | http://www.inside-r.org/pretty-r/tool |
There are projects available on GitHub :
1 2 | https://github.com/nachocab/clickme https://github.com/alexgutteridge/rsruby |
There is Enhanced-R package for Sublime Text, iTerm2 etc. commonly used Mac Softwares :
1 | https://github.com/randy3k/Enhanced-R |