Meetup: State of the art R Programming from the Bay Area R Users Group
Last updated July 26, 2016Amazon Web Services EC2 is an extremely powerful cloud computing technology. In particular, R3.8xlarge instance is recommended by Amazon for “high performance databases, distributed memory caches, in-memory analytics, genome assembly and analysis, Microsoft SharePoint, and other enterprise applications.” It offers 244 GiB of RAM – 30x greater than that of a standard 13-inch Macbook Pro. With these kinds of specs, I’m interested in how well the R3.8xlarge instance performs as a data science tool.
To test the capabilities of the R3.8xlarge instance, I’ll be using a large dataset from the Python library, pandas. Since pandas is a memory-intensive technology running scripts on data frames larger than a gigabyte on a local machine can be very time consuming or cause out of memory errors, so the opportunity to test it in the cloud seems like a great option.
My test plan was simple: set up a R3.8xlarge instance, upload a dataset of my choosing onto the virtual machine, run and time a number of pandas scripts on the machine, and record my findings.
Getting Started
The way I set up my instance is pretty straightforward, here’s how I set mine.
- Log into your AWS account (create one if you don’t have already have one) and make sure the region option in the top right corner matches your location.
- Click on EC2 in the top left area, that will you bring you to the EC2 dashboard. Click on “Launch Instance.”
- Step 1, choose Ubuntu Amazon Machine Image (AMI) which is the software configuration you use to launch the instance.
- Step 2, select R3.8xlarge among all the other instance options.
- After selecting my instance I opted to go with the default instance details and the default storage in the following steps
- Name your instance and a new security group. If you’d like, you can allow some more ports, but I declined.
- Before you launch, you must create a new key pair if you don’t already one or if you prefer using a different one than your others. The key pair is what allows you to securely access your cloud computer from your local machine.
- After creating the key pair, download its .pem file to your computer and place it in a folder. I recommend ~/.ssh/. Then make the .pem file read only with the command chmod 400 [filename]
- Review your instance details and launch your instance.
Now that the instance is set up and running, here’s how you access and set up your virtual machine.
- Retrieve the IP address of your cloud computer from the EC2 dashboard and access it with the following script on your terminal ssh -i ~/.ssh/name_pem_file.pem ubuntu@ip address.
- Now you’re logged in, now it’s time to install the right packages for your analysis. We’re going to use a package management tool called apt-get which is used to install python libraries. I used it to install ipython and pandas.
- Run sudo apt-get update to make sure apt-get is up to date.
- Here is the command I used to install the libraries for this article:
sudo apt-get install ipython python-pandas
Testing pandas
I have all the tools I need to work on my dataset and now it’s time to upload a dataset into my virtual machine. I chose the dataset from Kaggle’s Lending Club Loan Data competition containing data on every loan from 2007 to 2015 issued by Lending Club. It’s a logical dataset with straightforward features.
I directly downloaded the loan data’s csv file from my Dropbox account
wget https://www.dropbox.com/s/bybv82xmjehm9hj/loan.csv?dl=0
Our machine is up and running, and our dataset is loaded on our virtual machine. However, it is very important to remember to keep track of the time the instance spends running because Amazon charges instance use by the hour, $2.80 in this case.
I start up an ipython shell, import pandas, and load in the loan dataset.
ipython
import pandas as pd
df = pd.read_csv(“loan.csv”)
The loan dataframe is only using up 508MB, which is way too small for an operation like this and a waste of money obviously. You can find out how much memory a pandas dataframe is using by doing:
df.index.nbytes + df.columns.nbytes + df.values.nbytes
In order to significantly increase the size of the dataframe, we can concatenate it with itself many times.
df = pd.concat([df for i in range(100)])
This command gives us a dataset of 49.6GB.
To time a command in ipython, all you need to do put the ipython magic function “%timeit” before your command. %timeit is lot less cumbersome and time-consuming than using the time library or any other time-based module.
Testing with 50GB:
%timeit df[df[“column”] < integer/float]
N number of loops, best of 3: time per loop
The results of the %timeit function return the number of loops the it took and the shortest time it recorded.
Here are the results from my tests of the 49.6GB (8,873,7900 rows x 74 columns) dataframe with the commands I used.
Command | Description | Time |
---|---|---|
df = pd.concat([d for i in range(100)]) | Increase size of original loan dataset by 100x | 1 loops, best of 3: 23.4 s per loop |
df.head() | View first five rows of data frame | 10000 loops, best of 3: 81.9 µs per loops |
df.describe() | View statistics such as mean, median, percentiles, standard deviation, and more for each numerical column | 1 loops, best of 3: 12min 20s per loop |
df.corr() | Outputs table of how much each other numerical column with an integer or float type correlates with one another | 1 loops, best of 3: 7min 14s per loop |
df.columns | View columns as a numpy array | 10000000 loops, best of 3: 103 ns per loop |
df.dtypes | View the types of each column in dataframe | 10000 loops, best of 3: 120 µs per loop |
df.shape | Returns number of rows and columns of the dataframe | 1000000 loops, best of 3: 479 ns per loop |
df.zip_code.value_counts() | Outputs the frequency for each unique value in this specific column | 1 loops, best of 3: 6.31 s per loop |
df.grade.apply(lambda x:x.lower()) | Returns every string value in a column as lower case. | 1 loops, best of 3: 18.7 s per loop |
grades ={‘A’: ‘great’, ‘B’: ‘good’, ‘C’: ‘ok’, ‘D’: ‘bad’, ‘E’: ‘horrible’, ‘F’: ‘disaster’} df.grade.map(grades) |
I created a dictionary that when mapped over the grade column, whose unique values are listed as keys, would return the corresponding values of those keys. | 1 loops, best of 3: 4.78 s per loop |
df.iloc[1000000:4000000,20:50] | Slices dataframe into 3,000,000 rows by 30 columns. | 1 loops, best of 3: 4.35 s per loop |
def rich_or_poor(x): ….: if x <= 65000: ....: return "poor" ....: else: ....: return "rich" df.annual_inc.apply(rich_or_poor) |
Applies a function to the annual income column that divides incomes of loanees into “rich” or “poor”. | 1 loops, best of 3: 1min 9s per loop |
df.groupby(“grade”).mean() | Categorizes each the data based on the unique values of the grade column and returns the mean of those groups. For example, one column would show the average income of the loans for each one of grades in the grade column | 1 loops, best of 3: 2min 35s per loop |
df.rename(columns={“annual_inc”:”income”}) | Changes name of a column | 1 loops, best of 3: 1min 2s per loop |
df.id.astype(float) | Changes the type of a column from int to float | 1 loops, best of 3: 236 ms per loop |
df.int_rate*100 | Returns column multiplied by 100 | 10 loops, best of 3: 44.9 ms per loop |
df.purpose.str.strip(“_”) | Removes the character “_” from the purpose column | 1 loops, best of 3: 23.3 s per loop |
df.loan_amnt.copy().sort() | Sorts the loan amount column lowest to greatest | 1 loops, best of 3: 27.4 s per loop |
df[“purpose”] | Selects and returns the purpose column | 100000 loops, best of 3: 1.95 ¬µs per loop |
df[df[“loan_amnt”] > 14000] | Returns filtered data frame where values in the loan_amnt column are greater than 14000 | 1 loops, best of 3: 3min 34s per loop |
The biggest positive takeaway from these results is the absence of any memory errors. Overall I believe that based on the tests I ran, pandas can easily handle the weight of 50GB on the R3.8xlarge instance. In other words, if you are analyzing a dataset with 100 bytes per event, you can crunch 500 million rows of data in pandas. The simpler tasks such as retrieving the columns or the dimension of the dataframe occurred instantaneously and the majority of the operations were measured in the units of seconds or smaller. The more complex operations such as the applying a function or the groupby command took a relatively short time when you consider the demands of those operations. Unless time is extremely crucial, I do not recommend using a more powerful and expensive EC2 instance than the R3.8xlarge when working with datasets in pandas.
The results from this initial test clearly indicates we need to go bigger in order to find the ceiling of pandas in R3.8xlarge.
Testing with 100GB:
For the 2nd round, I doubled my pandas dataframe with the same concatenation script but this time I made 200 duplicates instead of 100 (177475800 rows x 74 columns). And for the purpose of continuity, I used the same scripts for the bigger dataframe, that I used on the smaller one.
The findings from these results demonstrate that I did indeed reach some limitations due to the presence of several memory errors caused by the scripts.
Command | Description | Time |
---|---|---|
df = pd.concat([d for i in range(200)]) | Increase size of original loan dataset by 100x | 1 loops, best of 3: 55.9 s per loop |
df.head() | View first five rows of data frame | 10000 loops, best of 3: 82.9 µs per loop |
df.describe() | View statistics such as mean, median, percentiles, standard deviation, and more for each numerical column | 1 loops, best of 3: 38min 2s per loop |
df.corr() | Outputs table of how much each other numerical column with an integer or float type correlates with one another | |
df.columns | View columns as a numpy array | 10000000 loops, best of 3: 103 ns per loop |
df.dtypes | View the types of each column in dataframe | 10000 loops, best of 3: 120 µs per loop |
df.shape | Returns number of rows and columns of the dataframe | 1000000 loops, best of 3: 485 ns per loop |
df.zip_code.value_counts() | Outputs the frequency for each unique value in this specific column | 1 loops, best of 3: 12.9 s per loop |
df.grade.apply(lambda x:x.lower()) | Returns every string value in a column as lower case. | 1 loops, best of 3: 37.3 s per loop |
grades ={‘A’: ‘great’, ‘B’: ‘good’, ‘C’: ‘ok’, ‘D’: ‘bad’, ‘E’: ‘horrible’, ‘F’: ‘disaster’} df.grade.map(grades) |
I created a dictionary that when mapped over the grade column, whose unique values are listed as keys, would return the corresponding values of those keys. | 1 loops, best of 3: 9.55 s per loop |
df.iloc[20000000:80000000,20:50] | Slices dataframe into 3,000,000 rows by 30 columns. | 1 loops, best of 3: 8.74 s per loop |
def rich_or_poor(x): ….: if x <= 65000: ....: return "poor" ....: else: ....: return "rich" df.annual_inc.apply(rich_or_poor) |
Applies a function to the annual income column that divides incomes of loanees into “rich” or “poor”. | 1 loops, best of 3: 2min 18s per loop |
df.groupby(“grade”).mean() | Categorizes each the data based on the unique values of the grade column and returns the mean of those groups. For example, one column would show the average income of the loans for each one of grades in the grade column | Segmentation fault (core dumped) |
df.rename(columns={“annual_inc”:”income”}) | Changes name of a column | Memory Error |
df.id.astype(float) | Changes the type of a column from int to float | 10 loops, best of 3: 101 ms per loop |
df.int_rate*100 | Returns column multiplied by 100 | 10 loops, best of 3: 101 ms per loop |
df.purpose.str.strip(“_”) | Removes the character “_” from the purpose column | 1 loops, best of 3: 48.1 s per loop |
df.loan_amnt.copy().sort() | Sorts the loan amount column lowest to greatest | 1 loops, best of 3: 1min 12s per loop |
df[“purpose”] | Selects and returns the purpose column | 100000 loops, best of 3: 2.09 ¬µs per loop |
df[df[“loan_amnt”] > 14000] | Returns filtered data frame where values in the loan_amnt column are greater than 14000 |
In the end, for most operations, pandas does a decent job of handling a 100GB dataframe. However, a more powerful virtual machine is needed for some functionality like certain ‘groupby’ and ‘corr’ calls. For these, I’m looking forward to running the same test again on the upcoming X1 instance, once it becomes publicly available. Stay tuned!
Takeaways
Here’s are my key takeaways from the 100GB test:
- 100 GB is the upper limit on datasets size when using this particular instance due to the degraded performance of key pandas operations such as describe, corr and groupby
- One possible solution to working extremely large datasets in pandas is the new X1 instance, which is equipped with 1,952 GiB of RAM, eight times as much as R3.8xlarge. However, at the time of this writing, the new instance is in early access and not generally available.
- Another possible approach is to use more scalable service with pandas connectivity, such as Treasure Data, to pre-process raw data and pull in a smaller subset into pandas for further analysis
Disclosure: This research was sponsored by Treasure Data