Welcome to the extra info page. Here you'll find some high level information on App Development, Hackthon Tips, SQL, PYTHON, R and Data Visualisation that we think may be of interest.


You can use any tools, languages or systems that you like however during the challenges, these are just some suggestions! If you have any queries or questions just give us a shout.

(Note that AIB is not responsible for the content or actions of any external websites who are linked to from this site.)

App Development

Never built an app before? Don't worry, it's not as difficult as you may think and if you can code already, you're halfway there. Here's some pointers to get you started:

1. App development is free. Yes you need a computer and yes you need to pay Apple ($99) or Google ($25) if you want to put your app into their App Stores for others to download  — but you can still build an app and put it on your phone or friends’ phones for free.

2. To make Apple (iOS) apps you use a software package called Xcode. It’s free to download from the App Store on your Mac (you can only build iOS apps on a Mac, not a Windows machine). For Android apps, you use Android Studio — again a free download, from Google. You can use Android Studio on a Mac or Windows machine.

3. Xcode and Android Studio are a bit like Microsoft Work or Excel — they’re software applications designed to help you do something. Excel helps you make spreadsheets, Xcode helps you make apps. Both are designed to be easy to use with loads of useful tools and features.

4. You can learn to code for free. There are plenty of fantastic sites and resources that you can buy / subscribe to, but there are just as many free ones. All you really need is time — time to learn coding, how it all works, to start off with the basics, to practice and do exercises and gradually build up your skills and experience.

5. “Coding” apps basically means using a programming language to specify what the app does. iOS apps use a language called Swift, Android use one called Java. There are lots of alternative platforms to make apps (Xamarin, React Native, PhoneGap, Titanium etc) but for beginners I’d suggest learning the base “native” ways — Swift or Java.


So where do you start? If you have a Mac and an iPhone, I’d recommend learning how to make iOS apps using the Swift programming language, with Xcode. From my experience teaching both iOS and Android to students with no previous coding, iOS is easier to learn.

If you only have a Windows machine / only an Android phone, learn how to make Android apps using the Java language with Android Studio.

Where can you start to learn? RayWenderlich.com is one of the best sites around, with lots of free tutorials, along with ones you can buy aswell. Google’s official Android tutorials are great aswell for beginners. A small warning - both Xcode and Android Studio are quite large so careful where you download them!

We'll have our engineers on hand all day at the DataHack to help out, they don't get out much so are quite looking forward to it. I'll also be available if anyone wants to try beat me at Street Fighter 2 on our Sega Megadrive. #notgoingtohappen #hadouken


HACKATHON ADVICE

I've participated and judged in loads of Hackathons, big and small, here's some hopefully useful tips I've learned along the way.

1. It's about having fun and learning, not just winning. If you win, brilliant, but I love hackathons as they're a chance to build something new and learn new technologies or skills.

2. The best ideas are usually about solving problems or meeting needs. You can build the most beautiful and amazing app ever seen, but if users can't actually see a reason to use it, there's no point. Unless it's a pure entertainment app, which is itself a need.

3. Keep focused on your main objective. It's easy to get side-tracked, go off on tangents, or spend too long on individual features - and you may end up with an unfinished product. Remember what the goal is and use that as your compass.

4. Leave yourself time to make it look good and easy to use. Not everyone is a designer but it's worth spending time trying to make your solution beautiful and have a great user experience. I've seen plenty of great ideas and well coded apps let themselves down at demo time because of bad UI/UX.

5. Also leave youself time to practice your pitch. You'll only have a few minutes to tell the judges all about your idea, the problem it solves etc so make sure when you talk to them that it's not the first time you're saying it!

6. Talk to the other teams. While it's a competition (with some awesome prizes!), one of the best things about hackathons is meeting people and learning about more than just what you're doing. It's highly unlikely someone's going to steal your idea and you'll gain more by networking and chatting than you will by keeping to yourself.

7. Ask if you need help. We're going to have engineers there all day to help out, so don't be afraid to ask us anything. We want students to learn from the DataHack so if you want to ask about a coding issue, or your app's design, the best way to pitch it or anything else, we'll be more than happy to help.

8. Seriously, have fun!

(by Andy O'Sullivan, Chief Thought Architect)

SQL

SQL (Structured Query Language) is a database computer language designed for the retrieval and management of data in a relational database. Let’s face it - if one does not have SQL skills these days, getting a job in a data team would be a daunting challenge. And why is that you may wonder? Relational databases have changed the way we think about data, how we store it and how we retrieve it. It happened a long time ago but to date RDBMS (Relational Database Management Systems) are the most popular storage solutions in pretty much every company on earth. Yes there is NoSQL and yes there is Hadoop but you would be surprised to know how many are still doing it the old school way.


So what is it that is so appealing about SQL? I would say the first thing is its relatively easy learning curve. Most engineers go from no SQL skills to proficient in a short amount of time. If you are a novice in this language, imagine the way you would ask a table in a database to give you data, something like select these things from that table. And that is almost exactly how you would code it in SQL.

The most important thing to know in SQL is 3 main keywords:

SELECT, FROM and WHERE

Every SQL Query you write will have at least the first 2. Show everything in the table with * :

SELECT * FROM table_name WHERE [condition];

Or select specific columns:

SELECT ID, NAME, AGE FROM CUSTOMERS WHERE SALARY > 2000;

Aggregate on columns with functions like SUM, COUNT or AVG. The % sign in the where clause can stand for zero, one or multiple characters:

SELECT COUNT(ID) FROM CUSTOMERS WHERE NAME LIKE '%Cottica'

Another popular aspect of the SQL language is the fact that across a multitude of different RDBMS technologies, SQL maintains querying standards. In that way moving from a platform to another is not that difficult bar some subtle differences that are proprietary to each technology.

If you are curious about RDBMS flavours out there, let me throw a few names at you: SQL server, Oracle, Teradata, Vertica, HBase, Hive … they all share SQL coding standards. Furthermore you will find that when you develop an application in C, Java, R or PHP, if you interface at any level with a database, you will have to code your queries using SQL.

Below you can find some links to get you started in understanding the background and life cycle of the most popular querying language on the planet:

http://www.w3schools.com/sql/
http://www.sqlcourse.com/
https://en.wikibooks.org/ wiki/Structured_Query_Language

(by Max Cottica, Head of Data Science and Big Data Solutions)

PYTHON

Python is a very high level, dynamic programming language, emphasizing code readability and ease of use. It allows you to program in multiple different styles depending on your preferences, but tries to stick to the principle that there should be at most one good way to do any particular thing.

As a result of these features, Python has been widely adopted in the data science community. Some really fantastic data science libraries have been written for Python - Anaconda Python distribution has all of the most common data science libraries packaged, so you'll rarely find yourself lacking the appropriate functionality! I recommend looking at Pandas in particular.

Having said that, Python is more than capable of doing some pretty impressive data analysis in a very small amount of code using only the core libraries. We'll try creating a program that counts the frequencies of each unique word in a file to showcase this. Note that anything after a '#' is a comment and will be ignored when running the program.

Save this code as HelloDataScience.py (or something else like a.py if you hate typing), then in the command prompt run:

python HelloDataScience.py | more

The `more` bit ensures the output doesn't flood the terminal screen immediately.

import re # For regular expressions
from collections import Counter # For counting hashable types (like strings)


Open the file for reading and call it 'f'. The file will only be open in this `with` block, and will be automatically closed at the end of the block, so you don't have to worry about closing it manually later.

with open('path/to/file') as f:

This regex, when used, will match all characters which are not alphanumeric except for the single-quote character. Remember, regular expressions are your friend, especially when dealing with messy data! Make sure you know how to use them effectively.

regex = re.compile(r"[^\w']")

Read the file as a string, split by whitespace (including tabs and newlines), then for each word in that list, strip unwanted characters using the regex above.

words = [regex.sub("", word) for word in f.read().split()]

Count each value in the list

word_counts = Counter(words)

Print them in order of frequency, most common first

print("\nWord counts:\n")
for word_count in word_counts.most_common():
print(" " + str(word_count))


And done!
Now a short example using Pandas. Let's read in a CSV file with Pandas so we can test something. If you want to try this yourself, download the sample real estate transactions data from here (second one down) and open the Anaconda IPython interpreter in the command prompt at the directory you saved the file in.

First, we'll read in the CSV file to a pandas DataFrame. The DataFrame is the main unit of functionality in pandas:

import pandas as pd
df = pd.read_csv
("Sacramentorealestatetransactions.csv")


Done. That was easy! To see the DataFrame, just run `df`. By the way, you can also do other awesome things like `read_excel`, `read_json`, and even `read_clipboard`! It's always a good idea to have pandas describe your DataFrame if you just want a quick look at some basic statistical measures.

df.describe()

From this you can find things like the 'centerpoint' of all the houses by looking at the average latitude and longitude (although it's not *really* the center, since the Earth is ellipsoidal), or the minimum house price overall ($1551!!) OK, now let's say we want to find the cheapest house listed in Sacramento with more than 3,500 sqft of land:

house = df.ix[df[(df.sq__ft > 3500) & (df.city == 'SACRAMENTO')]['price'].idxmin()]

Let's figure out what this means from the inside out:

# Get the rows for which sq__ft > 3500 and city == 'SACRAMENTO'
candidates = df[(df.sq__ft > 3500) & (df.city == 'SACRAMENTO')]

# Get the prices for those rows
prices = candidates['price']

# Get the minimum of those prices as a row index
lowest_priced_house_index = prices.idxmin()

# Get the row at that index
house = df.ix[lowest_priced_house_index]

Now we have our house selected. Let's see where it is! We can use the latitude, longitude pair or the address.

import subprocess as sbp

# lat, lng
lat, lng = house['latitude'], house['longitude']
sbp.run(["start", "chrome", "https://www.google.ie/maps/place/" + lat + "," + lng], shell=True)

# address
address = house['street'] + ' ' + house['city']
sbp.run(["start", "chrome", "https://www.google.ie/maps/place/" + address], shell=True)


The address will probably give you better information. If you're on Linux, substitute `"start", "chrome"` for just `"firefox"`.

I would encourage you to take a look at the 10 Minutes to Pandas section on the pandas website, it gives you a very quick tour of some of the most common things you might want to use it for.


I know this is an introduction to Python for Data Science, but I feel I need to emphasize a more language agnostic point: if you don't know how to use regular expressions yet, it is absolutely imperative that you learn how to use them! Dealing with textual data moves away from being a massive headache to a breeze, even after just learning the basics. Here is a great Stack Overflow answer that covers a lot of the basics of regex, with lots of links included. Once you have a rudimentary grasp, try solving the problems on [this](http://regex.alf.nu/) website as practice, I can't recommend it enough.

(by Conor Reynolds, former intern in AIB Data Science.)


Data Visualisation Tools

Visualizing your data is an important early step in the analysis of any dataset. It allows you to get a better understanding of the data you are working with and to identify features of your dataset such as the distribution of data or the presence of any outliers.

Data visualization is also an excellent way of conveying key messages in the presentation of findings that come about as a result of analysis. The visualizations can allow decision makers to understand concepts and spot trends in the data that may not be apparent from looking at statistics or at the data itself.

Some useful tools for data visualization include:

1. Tableau
Tableau’s Desktop edition is a very useful platform that allows for the building of data visualizations in a simple and intuitive manner. An annual student subscription can be found here.

2. D3.js
D3 is a JavaScript library that can be used to create a variety of charts. Due to its open source nature it is highly customisable and very accessible. See https://d3js.org/ for more info.

3. Leaflet
Leaflet is an open source JavaScript library that can be used for plotting data on an interactive map. It is highly customisable and feature rich, making it a great tool for geospatial analysis. Here's one we put together:


See http://leafletjs.com/ for more info.

4. Shiny by RStudio
Shiny is a web application for R. It allows you to create fully interactive visualizations through the use of a number of different plugins including Leaflet, Dygraphs and Highcharts. These applications can subsequently be hosted on the web for interaction by users. Here's an example:




(By Killian Watchorn, Data Scientist with AIB)


R

The best way to learn R is by using it. There are many ways to get started with using R – the easiest way being in your browser. There are online sites such as try R or DataCamp which allow you to interactively learn and code in R without having to download it.

If you want to get more familiar with it, the best way to set yourself up is by installing on your local machine. You can download a copy of R from CRAN – the Comprehensive R Archive Network. From there you can start working with R either in an IDE (Integrated Development Environment) such as RStudio, or through the terminal by using R console. Whichever your preferred method is, just get coding! Links to learning resources are included along the way and in the end of this tutorial.

CRAN is also where you get all the packages you need for analysis, visualization, machine learning, data munging, etc. R comes with some basic functionality, but to extend this you need to learn how to install and use the packages. A simple internet search for the functionality you need will show the required package.

For example, let’s install ggplot2, one of the most popular packages for visualization:

In your IDE or in the console, type the command:

install.packages(“ggplot2”)

then to load the package:

library(ggplot2)

You can also install packages that other people have written from github and bitbucket. You need the devtools package to install these from the web. A helpful list of popular packages can be found here.

The way I learned a lot of R is through the swirl package. Install it just like you would any other package, then when you call the swirl() function you get brought through a comprehensive, interactive tutorial through the basics of R – right in the console or your IDE! It corrects you when you are wrong and gives encouragement when you are doing well. I highly recommend!

The main steps in any data analysis when using R is as follows:
1. Import Data
2. Prepare, Explore and Clean Data
3. Statistical Modelling & Analysis
4. Export Data & Visualizations

1. Importing the Data
You can import data from almost any source into R, with each requiring a different approach. Usually you will be working with ‘flat files’ – text files that contain tabular data. Functions such as read.table() and read.csv() from the utils packages are commonly used.

For example, let's load a csv called 'data.csv' and store it in a variable called myData:

myData <- read.csv(file=”data.csv”, head=TRUE, sep=”,”)

The readxl, gdata and XLConnect packages all contains functions for working with Excel files. There are also packages to allow you to work directly with a database and also harvest web data using API’s. This is a great introduction into everything you need to know about importing data into R.

2. Preparing, Exploring and Cleaning the Data
It’s been said that 80% of data analysis is spent on the process of cleaning and preparing the data. Therefore, every budding data scientist should learn and appreciate the methodologies of cleaning data and the tools that are available to carry out these practices. A great paper to read is this study titled ‘Tidy Data’ by Hadley Wickham that illustrates the importance of this step (you can also read a more code heavy version of the paper here).

Two of my favourite tools to use in R are the tidyr and dplyr packages, tidyr for all your tidying needs, dplyr for manipulating data frame like objects. The data.table package is not as intuitive to use, but is very powerful and fast.

Check out this introduction and Lubridate is the go-to for date and time manipulations. Time series data is also easily navigated with packages like zoo and xts.

3. Statistical Modelling & Analysis
There are lots of packages which contain built in algorithms for machine learning and analysis. For example, use the randomForest package to create simple decision trees with just one line of code.

output.forest <- randomForest(person ~ age + shoeSize, data = myData)

This will predict the person based on their age and shoe size from a dataset called myData. There are many functions in the randomForest package to determine, based on the decision tree built, which factors are important in determination of a variable, error rates, etc.

There are many other packages for analysis, igraph (for graph data), rpart (regression trees), nnet (neural networks) and caret (classification and regression) being the most popular.

4. Export Data & Visualizations
Visualizations in R are easy and highly customisable. Start with the most popular visualization package, ggplot2. Not only can you do simple histograms, but also complicated map and graph visualizations. There are lots of extensions to ggplot2 to add different themes and functionality, a list of which can be found at www.ggplot2-exts.org. DataCamp has a great tutorial on getting started with R visualizations – from axis and point labels, to adding colour and background graphics.

A very handy tool for creating interactive dashboards and powerful visualizations is RShiny. You can find a comprehensive introduction on the site.

If the R visualizations don’t do it for you, you can export your results from your analysis and visualize them in the tool of your choice. You can find more details on other types of visualization tools in the visualization section of these tutorials.

Although it seems like a lot of information to take in at once, after using R a couple times the package names will start to become more familiar, the syntax will seem easier and the workflow will be intuitive. Start simple, use swirl, practice loading data, cleaning it with tidyr, exploring it with simple functions like str(), dim(), nrow(), ncol(), analysing the data with kmeans() or lm().

[Fun extra: Using R to analyse Pokemon data!]

More courses and resources here:
- https://cran.r-project.org/doc/manuals/R-intro.pdf
- https://www.coursera.org/learn/r-programming
- https://www.udacity.com/course/data-analysis-with-r--ud651
- http://online.stanford.edu/course/statistical-learning-winter-2014
- https://www.kaggle.com/wiki/Tutorials
- http://rattle.togaware.com/
(By Cailla Rose O'Shea, AIB Data Scientist)


Copyright © 2016 AIB. Allied Irish Banks, p.l.c. is regulated by the Central Bank of Ireland.