The best way to learn R is by using it. There are many ways to get started with using R - the easiest way being in your browser. There are online sites such as try R or DataCamp which allow you to interactively learn and code in R without having to download it. If you want to get more familiar with it, the best way to set yourself up is by installing on your local machine. You can download a copy of R from CRAN - the Comprehensive R Archive Network. From there you can start working with R either in an IDE (Integrated Development Environment) such as RStudio, or through the terminal by using R console. Whichever your preferred method is, just get coding! Links to learning resources are included along the way and in the end of this tutorial.
CRAN is also where you get all the packages you need for analysis, visualization, machine learning, data munging, etc. R comes with some basic functionality, but to extend this you need to learn how to install and use the packages. A simple internet search for the functionality you need will show the required package. For example, let's install ggplot2, one of the most popular packages for visualization.
In your IDE or in the console, type the command:
then to load the package:
You can also install packages that other people have written from github and bitbucket. You need the devtools package to install these from the web. A helpful list of popular packages can be found here.
The way I learned a lot of R is through the swirl package. Install it just like you would any other package, then when you call the swirl() function you get brought through a comprehensive, interactive tutorial through the basics of R - right in the console or your IDE! It corrects you when you are wrong and gives encouragement when you are doing well. I highly recommend!
The main steps in any data analysis when using R is as follows:
You can import data from almost any source into R, with each requiring a different approach. Usually you will be working with 'flat files' - text files that contain tabular data. Functions such as read.table() and read.csv() from the utils packages are commonly used. For example, let's load a csv called 'data.csv' and store it in a variable called myData:
myData <- read.csv(file="data.csv", head=TRUE, sep=",")
The readxl, gdata and XLConnect packages all contains functions for working with Excel files. There are also packages to allow you to work directly with a database and also harvest web data using API's. This is a great introduction into everything you need to know about importing data into R.
It's been said that 80% of data analysis is spent on the process of cleaning and preparing the data. Therefore, every budding data scientist should learn and appreciate the methodologies of cleaning data and the tools that are available to carry out these practices. A great paper to read is this study titled 'Tidy Data' by Hadley Wickham that illustrates the importance of this step (you can also read a more code heavy version of the paper here). Two of my favourite tools to use in R are the tidyr and dplyr packages, tidyr for all your tidying needs, dplyr for manipulating data frame like objects. The data.table package is not as intuitive to use, but is very powerful and fast. Check out this introduction. Lubridate is the go-to for date and time manipulations. Time series data is also easily navigated with packages like zoo and xts.
There are lots of packages which contain built in algorithms for machine learning and analysis. For example, use the randomForest package to create simple decision trees with just one line of code.
output.forest <- randomForest(person ~ age + shoeSize, data = myData)
This will predict the person based on their age and shoe size from a dataset called myData. There are many functions in the randomForest package to determine, based on the decision tree built, which factors are important in determination of a variable, error rates, etc. There are many other packages for analysis, igraph (for graph data), rpart (regression trees), nnet (neural networks) and caret (classification and regression) being the most popular.
Visualizations in R are easy and highly customisable. Start with the most popular visualization package, ggplot2. Not only can you do simple histograms, but also complicated map and graph visualizations. There are lots of extensions to ggplot2 to add different themes and functionality, a list of which can be found at www.ggplot2-exts.org. DataCamp has a great tutorial on getting started with R visualizations - from axis and point labels, to adding colour and background graphics. A very handy tool for creating interactive dashboards and powerful visualizations is RShiny. You can find a comprehensive introduction on the site.
If the R visualizations don't do it for you, you can export your results from your analysis and visualize them in the tool of your choice. You can find more details on other types of visualization tools in the visualization section of these tutorials.
Although it seems like a lot of information to take in at once, after using R a couple times the package names will start to become more familiar, the syntax will seem easier and the workflow will be intuitive. Start simple, use swirl, practice loading data, cleaning it with tidyr, exploring it with simple functions like str(), dim(), nrow(), ncol(), analysing the data with kmeans() or lm().
Python is a very high level, dynamic programming language, emphasizing code readability and ease of use. It allows you to program in multiple different styles depending on your preferences, but tries to stick to the principle that there should be at most one good way to do any particular thing.
As a result of these features, Python has been widely adopted in the data science community. Some really fantastic data science libraries have been written for Python (the Anaconda Python distribution has all of the most common data science libraries packaged), so you'll rarely find yourself lacking the appropriate functionality! I recommend looking at Pandas in particular.
Having said that, Python is more than capable of doing some pretty impressive data analysis in a very small amount of code using only the core libraries. We'll try creating a program that counts the frequencies of each unique word in a file to showcase this. Note that anything after a '#' is a comment and will be ignored when running the program.
""" Save this code as HelloDataScience.py (or something else like a.py if you hate typing), then run `python HelloDataScience.py | more` in the command prompt. The `more` bit ensures the output doesn't flood the terminal screen immediately. """ import re # For regular expressions from collections import Counter # For counting hashable types (like strings) # Open the file for reading and call it 'f'. The file will only be open in this `with` block, # and will be automatically closed at the end of the block, so you don't have to worry about # closing it manually later. with open('path/to/file') as f: # This regex, when used, will match all characters which are not alphanumeric except for the # single-quote character. Remember, regular expressions are your friend, especially when # dealing with messy data! Make sure you know how to use them effectively. regex = re.compile(r"[^\w']") # Read the file as a string, split by whitespace (including tabs and newlines), # then for each word in that list, strip unwanted characters using the regex above. words = [regex.sub("", word) for word in f.read().split()] # Count each value in the list word_counts = Counter(words) # Print them in order of frequency, most common first print("\nWord counts:\n") for word_count in word_counts.most_common(): print(" " + str(word_count))
Remove all the comments to get a feel for just how short the code really is. There are only nine significant lines!
Let's read in a CSV file with Pandas so we can test something. If you want to try this yourself, download the sample real estate transactions data from here (second one down) and open the Anaconda IPython interpreter in the command prompt at the directory you saved the file in.
First, we'll read in the CSV file to a pandas DataFrame. The DataFrame is the main unit of functionality in pandas:
import pandas as pd df = pd.read_csv("Sacramentorealestatetransactions.csv")
Done. That was easy! To see the DataFrame, just run
df. By the way, you can also do other awesome things like
read_json, and even
It's always a good idea to have pandas describe your DataFrame if you just want a quick look at some basic statistical measures.
From this you can find things like the 'centerpoint' of all the houses by looking at the average latitude and longitude (although it's not really the center, since the Earth is ellipsoidal), or the minimum house price overall ($1551!!)
OK, now let's say we want to find the cheapest house listed in Sacramento with more than 3,500 sqft of land:
house = df.ix[df[(df.sq__ft > 3500) & (df.city == 'SACRAMENTO')]['price'].idxmin()]
Let's figure out what this means from the inside out:
# Get the rows for which sq__ft > 3500 and city == 'SACRAMENTO' candidates = df[(df.sq__ft > 3500) & (df.city == 'SACRAMENTO')] # Get the prices for those rows prices = candidates['price'] # Get the minimum of those prices as a row index lowest_priced_house_index = prices.idxmin() # Get the row at that index house = df.ix[lowest_priced_house_index]
Now we have our house selected. Let's see where it is! We can use the latitude, longitude pair or the address.
import subprocess as sbp # lat, lng lat, lng = house['latitude'], house['longitude'] sbp.run(["start", "chrome", "https://www.google.ie/maps/place/" + lat + "," + lng], shell=True) # address address = house['street'] + ' ' + house['city'] sbp.run(["start", "chrome", "https://www.google.ie/maps/place/" + address], shell=True)
The address will probably give you better information.
If you're on Linux, substitute
"start", "chrome" for just
I would encourage you to take a look at the 10 Minutes to pandas section on the pandas website, it gives you a very quick tour of some of the most common things you might want to use it for.
I know this is an introduction to Python for Data Science, but I feel I need to emphasize a more language agnostic point: if you don't know how to use regular expressions yet, it is absolutely imperative that you learn how to use them! Dealing with textual data moves away from being a massive headache to a breeze, even after just learning the basics. Here's a great Stack Overflow answer that covers a lot of the basics of regex, with lots of links included. Once you have a rudimentary grasp, try solving the problems on this website as practice, I can't recommend it enough.
SQL is a database computer language designed for the retrieval and management of data in a relational database. SQL stands for Structured Query Language. Let's face it. If one does not have SQL skills these days, getting a job in a data team, would be a daunting challenge. And why is that you may wonder. Relational databases have changed the way we think about data, how we store it and how we retrieve it. It happened a long time ago but to date RDBMS (Relational Database Management Systems) are the most popular storage solutions in pretty much every company on earth. Yes there is NoSQL and MySQL, yes there is Hadoop, but you would be surprised to know how many are still doing it the old school way.
So what is it that is so appealing about the structured query language? I would say the first thing is its relatively easy learning curve. Most engineers go from no SQL skills to proficient in a short amount of time. If you are a novice in this language, imagine the way you would ask a table in a database to give you data, something like select these things from that table. And that is almost exactly how you would code it in SQL.
/* The most important thing to know in SQL is 3 main keywords: SELECT, FROM and WHERE Every SQL Query you write will have at least the first 2 */ --Show everything in the table with * SELECT * FROM table_name WHERE [condition]; --Or select specific columns SELECT ID, NAME, AGE FROM CUSTOMERS WHERE SALARY > 2000; --Aggregate on columns with functions like SUM, COUNT or AVG --The % sign in the where clause can stand for zero, one or multiple characters SELECT COUNT(ID) FROM CUSTOMERS WHERE NAME LIKE '%Cottica'
Another popular aspect of the SQL language is the fact that across a multitude of different RDBMS technologies, SQL maintains querying standards. In that way moving from a platform to another is not that difficult bar some subtle differences that are proprietary to each technology. If you are curious about RDBMS flavours out there, let me throw a few names at you: SQL server, Oracle, Teradata, Vertica, HBase, Hive ... they all share SQL coding standards.
Furthermore you will find that when you develop an application in C, Java, R or PHP, if you interface at any level with a database, you will have to code your queries using SQL.
Below you can find some links to get you started in understanding the background and life cycle of the most popular querying language on the planet:
Visualizing your data is an important early step in the analysis of any dataset. It allows you to get a better understanding of the data you are working with and to identify features of your dataset such as the distribution of data or the presence of any outliers.
Data visualization is also an excellent way of conveying key messages in the presentation of findings that come about as a result of analysis. The visualizations can allow decision makers to understand concepts and spot trends in the data that may not be apparent from looking at statistics or at the data itself.
Some useful tools for data visualization include:
Tableau's Desktop edition is a very useful platform that allows for the building of data visualizations in a simple and intuitive manner. An annual student subscription can be found here.
Shiny is a web application for R. It allows you to create fully interactive visualizations through the use of a number of different plugins including Leaflet, Dygraphs and Highcharts. These applications can subsequently be hosted on the web for interaction by users.