Owen Rumney


Software Engineer


I’m hoping this will be a reasonably accurate account of my play with the TfL Cycling DataSets.

I’m still forming my plan, however loosely I think I want to end up with a visualisation where the bike points are highlighted in over a time series as bikes are taken and returned.

Initially, I’m working on my Mac, but I have a Databricks community cluster that I’ve migrated some of the parts to.

Preparing my Local Env

As I said, I’m using my MacBook so I’m going to install a couple of things

Install Spark

To install spark, I use brew

brew install spark

Install Jupyter

Installing jupyter notebooks is done with pip

pip install jupyter

Getting some data

I took a single file from the S3 bucket to play with locally, for no particular reason I went with 01aJourneyDataExtract10Jan16-23Jan16.csv

aws s3 cp s3://cycling.data.tfl.gov.uk/usage-stats/01aJourneyDataExtract10Jan16-23Jan16.csv ~/datasets/cycling/.

Starting Up

Run the following commands to get your Jupyter Notebook up and running

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
pyspark

Quick Test

Finally a quick test to see how it looks. In the Jupyter notebook I can do

data = spark.read.csv('~/datasets/cycling/01aJourneyDataExtract10Jan16-23Jan16.csv', header=True, inferSchema=True)
data.show()

This should show you 20 rows from the data set and we’re off.