7 Useful R Packages for analysis

Shaivya Kodan
8 min readSep 30, 2021

--

R is one of the most popular programming language. It is majorly used in statistical and data analysis. R growth has been tremendous over past years. It’s contribution towards data science space is promising too.

R is heavily used in academics and healthcare sector followed by consulting and insurance industry. Lets, dive into some of the useful packages in R. These packages prominence is based on their functionality and popularity. One may find that some of the packages are not widely used or may not make to list of the most downloaded packages. The aim is to know those packages which helps in conducting end to end analysis.

Let’s look into their usage and functionality with examples ✌🏼

1️⃣ dplyr

The data frame is a key data structure in R, so it’s important that we have good tools for dealing with them. It is one of most powerful and widely used R package to transform and summarise the data. It contains a set of functions that perform common data manipulation operations such as filtering rows, selecting columns, reordering rows, adding new columns and summarising data. It is often used to conduct the exploratory analysis. This was designed by Hadley Wickham chief scientist at R studio. The most unique and important thing about this package is it provides you functions as verbs. Some of these are — select, arrange, rename, mutate. The way this package is created is — first argument is data frame, subsequent ones are what to do with the data frame passed and return is the new data frame.

Let’s go over a code snippet to see how this package function works and the code looks like.

This example looks complicated but let’s break it down. In the above snippet, functions used are mutate(), group_by() and summarize() part of dplyr package. mutate() is use to create new variable which is pm25.quint here to get quantiles. group_by() is generating summary statistics of data by the new variable. summarise() is computing mean and in sub-groups by the new variable. We notice a pipe operator %>% which is also part of dplyr package used to string together multiple functions in the sequence. This is helpful in not writing long lines of code and also creating temporary variables along the way. This package is widely used and very important, with an ability to perform all SQL type queries.

You can find the dplyr documentation here : https://www.rdocumentation.org/packages/dplyr/versions/0.7.8

dplyr Cheat sheet : https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

2️⃣ lubridate

We all agree that data with date/time data is quite tricky. The challenge this data has, is the issue of parsing into correct or required format and also being able to extract the information out. Along with this comes the complexity of dealing with time zones and intervals. Lubridate package simplifies all of this. Let’s look at each of the example to see how this package do wonders dealing with date and time. This package has been written by Garrett Grolemund and Hadley Wickham.

Remember there are three types of date/time data : A date , A time and A date-time

In the above example of code snippet, these are few functions to parse your date in the required format.

This code snippets shows how we can leverage lubridate functions to pull out the individual parts of date, like the year, month, the day of the month and even also numeric weekday.

This package allows to work on interval of time as shown above in the code snippet. This is quite useful when dealing with date-time data, as we can see whatever date and time format you pass in interval function it outputs a consistent interval. There are other functions that work with intervals such as setdiff to find what time interval exist in one that is not in other, int_overlaps that gives true and false output and many more.

Note : lubridate provides time spans — Duration and Periods, so one can use these to do arithmetic with date times.

You can find the lubridate documentation here : https://www.rdocumentation.org/packages/lubridate/versions/1.7.10

Lubridate Cheat Sheet : https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_lubridate.pdf

3️⃣ purrr

We usually work with vectors, dataframes or lists. We know how important it is to express complex operations on these data types. Purrr is all about iteration. It is a package that fills in the missing pieces in R’s functional programming tools. It allows to perform operations just by combining simple pieces in standard way. For instance, if a problem requires repeated looping we can leverage map function in purrr package to do so. This package is written by Lionel Henry.

Let’s explore some examples to see how useful this package is:

We observe how easy it is to iterate the function over the set of list. Similarly we can use the various map functions to get desired results. Check the cheat sheet below for more functions.

The purrr functions can be used in a very interesting context as well. Below snippet, we use gapminder dataset. Let’s say we need to identify the type of each column by applying the class function. Here map_chr function will do the needful.

You can find the purrr documentation here : https://www.rdocumentation.org/packages/purrr/versions/0.2.5

Purrr Cheatsheet :https://evoldyn.gitlab.io/evomics-2018/ref-sheets/R_purrr.pdf

4️⃣ ggplot2

Any kind of analysis is complete only with a good visualisation. The ggplot2 package, offers a powerful graphics language for creating elegant and complex plots. It allows you to create graphs that represent both univariate and multivariate numerical and categorical data. There is lot of things we can do with various functions which are part of this package. It allows to build almost any type of chart and this why it’s the most loved package by R users.

Lets see some examples of this package. I am using a standard mpg data in R.

We can improve this graph by by mapping the aesthetics in your plot to the variables in your dataset, which is done in below graph by colours.

We can keep improving our graphs with pointing out outliers or even assigning different shapes to data points or can even put a trend line between ‘displ’ and ‘hwy’ variables to see their relationship.

Another interesting example is of using facet graphs using ggplot2

Facet graphs are useful for categorical variables, by splitting your plot into facets, subplots that each display one subset of the data.

We can create density plot, histograms, pie charts, bar charts, box plots and advanced geographic polygons. There is lot we can do with ggplot2.

You can find the ggplot2 documentation here : https://www.rdocumentation.org/packages/ggplot2/versions/3.3.5

ggplot2 Cheatsheet : https://www.maths.usyd.edu.au/u/UG/SM/STAT3022/r/current/Misc/data-visualization-2.1.pdf

5️⃣ knitr

Reporting is indeed crucial step when we analyse data. The cut and paste approach of results is messy and can produce unwanted errors in our reports. The knitr is the package that integrates computing and reporting. How does it do that ? It essentially incorporates code into text documents making the analysis, results and discussion all in one place. Files can then be processed into a diverse array of document formats — pdf, worddoc, webpage and even slides. Yihui Xie has written this package.

Let use mpg data to see how this package works

Its at top column you specify the output type. We can hit knit button in RStudio window, located on left side next to settings icon. The result we get is an pdf document. It opens in another window like this

There is lot we can do with this from writing summary sections to controlling the width and display on document. Since, result is generated by code that is tied to the document itself. It makes life easier when it comes time to make small updates to an analysis.

6️⃣ zoo

Sometimes we encounter data that is collected at consistent time intervals. We refer to it as time series data. At times we need to examine how the changes associated with the data point shifts other variables over the same time period. Zoo package in R is helpful for such cases because it aims at performing calculations containing irregular time series of numeric vectors, matrices and factors. So for time series kind of datasets, we use zoo package to do the analysis.

Let’s see some example to see how powerful this package is

Note the index to prices columns are the dates. Once we have a time series object (S3 class of indexed observation) using zoo, we can do data manipulation and time series analysis on it. We can also plot this data using ggplot2. 👍🏼

You can find the zoo documentation here : https://www.rdocumentation.org/packages/zoo/versions/1.8-9/topics/zoo

7️⃣ assertive

Code integrity is very important no matter what kind of analysis we are doing. When we write functions its quite evident that we check the state of variables, the assertions and errors messages are thrown if they are not in right form. The assertive package comes to the rescue that enables user to writes a robust code. This package is written by Richard Cotton.

Lets see example how can we leverage assertive package

As we can notice in code snippet the function for taking mean of whole numbers is created and once list of numbers is passed which contains one fractional number it prompts error message as assertive condition is used inside the function.

You can find the assertive documentation here : https://www.rdocumentation.org/packages/assertive/versions/0.3-6

⚠️ You can install packages using install button in packages tab in R located in the right most panel or can use CRAN to download it.

🔮 Advance data analytics and ML packages in R ➡️ mlr3, XGBoost and Caret

🔮 Packages that caught my attention lately ➡️ Data explorer and Dtplyr

🌀 Don’t forget to use some of these packages in your next analysis

😛 For fun explore this Insert emoji to R markdown

References : The whole online R community

👋🏼 👋🏼 👋🏼 Until next time

🔗 Let’s connect Linkedin or Instagram 👩🏻‍💻

--

--

Shaivya Kodan

🌟 Women in STEM 🌟 Data Analytics 🌟Product Management🌟