Self-service analytics roadtest: Watson Analytics vs Tableau vs Popily

If you teach someone how to fish…

The world of analytics has exploded with a vast array of new technologies, tools, systems, training, opportunities and business models. Most people understand that analytics is powerful and have heard stories about how companies like Amazon and Google use it drive innovation and grow their organisations. However, when it comes to your own life, its can be difficult to understand exactly how you can use it. For some, analytics feels like its something akin to magic wielded by ‘data scientists’ with PhDs and decades of experience.

The reality is that analytics is being democratised by the very same technology that’s made it valuable. This has given raise to self-service analytics. After years of investment in centralising data, maturing data governance and user-friendly software there are now a range of options for anyone to answer their own questions using sophisticated analytical techniques.

There are a lot of tools available to anyone to do you own analytics. Some are ‘one off’ tools like Google’s Ngram viewer that will allow you to investigate how frequently specific words have been used in books or Twitter Analytics which will let you look over the stats for your own account. Then there are more broader tools that will allow you investigate a range of different data sources. While there are many examples I want to focus on three across the broad spectrum of options. They are Watson Analytics, Tableau and Popily.

Who’s who

TL;DR

  • Watson Analytics is cloud-based, lets you explore your own data, you can explore your data by typing natural language questions and it’s available with tiered payment options starting from free.
  • Tableau has desktop, cloud and server-based options, its optimised for Enterprise data sources, and has free and paid options.
  • Popily is a brand new offering and will continue to mature through new releases, it’s cloud-based, and currently only uses publicly available data but is free.

Watson Analytics

You may recognise the name ‘Watson’ as the artificial intelligence developed by IBM that won the quiz show Jeopardy in 2011. Watson was able to listen and respond to natural language questions beating two previous champions. Today, Watson is able to analyse large corpora of unstructured data allowing it to manage decisions in lung cancer treatment, find new food combinations for recipes and make music recommendations.

The Watson AI that is able to do all this is not the necessary the same ‘Watson’ you have access to as part of IBM’s cloud-based Watson Analytics offering. Watson Analytics allows you to ‘ask’ questions about your data sets in natural language by typing it questions. Watson Analytics responds with options and graphs that it’s determined will best answer you question.

While there appears to be no move to provide a desktop version of Watson Analytics, IBM’s enterprise-grade business intelligence offering, Cognos, is inheriting some of Watson Analytics natural language processing and visualisation aesthetics. For a great overview of the product, check out this video.

Tableau

Tableau is best known as a visualisation tool. Its adoption within the business community continues to grow year on year. Tableau is a mature offering and recently released version 9. It can be deployed on your local machine, your server or from the cloud. It allows you to create beautiful, interactive graphs to quickly and intuitively tell a story or to provide insight into previously unintelligible data. To get a sense of the look and feel of Tableau’s visualisation check out their gallery.

Popily

Popily is a brand new offering released by the same team responsible for the analytical-themed podcast Partially Derivative and who developed CrisisNET. Popily provides non-technical people the ability to explore data without needing to know code or statistics. As a brand new offering, the cloud-based Popily can only be used to explore publicly available data sets added to their platform. I believe the release of Popily is the start of a wave of new start ups with a focus on self-service analytics leveraging the raise of technologies like software-as-a-service, machine learning and scalable analytics.

Let’s test them

I’ve reviewed these offerings by the following areas:

  1. Signing up
  2. Loading data
  3. Finding insights

The data we’re looking at has been limited to what’s currently available through Popily’s public library of data sources. We’ll use Airbnb’s data set because they share their listing information through a Creative Commons license. In fact, you can explore the data through their own visualisations here (created using Leaflet and Mapbox).

Signing up

All three offerings have a free option (so feel free to jump in yourself and have a play – Watson Analytics, Tableau Public and Popily). Creating accounts for all options is straight forward, although you’ll need to download software for Tableau.

For Watson Analytics, if you pay you’ll be able to analyse more data (more rows and columns) and there’s an enterprise version where you can allocate access across a tenancy. Actual prices and packages are constantly changing (at least the time of writing) so check out the site for the latest prices.

Tableau has paid options designed for enterprises and are structured around the number of licensed users. For companies this means you’ll be paying for both desktop versions and a server license so that you can privately share your visualizations. Specifying users can be a bit limiting if your an organisation that prefers to have flexibility or plan on managing security access through Tableau server.

Loading data

Watson Analytics allows you to upload your own data and, if you upgrade, you can also connect automatically to the Twitter API (they’ll grab a 10% sample of tweets for the last 6 months based off keywords). Adding data is as simple as clicking the add button from the login dashboard. The free account is limited to 50,000 rows and 40 fields. Adding an abridged version of the Airbnb data set took about 6 minutes over a medium speed NBN connection. Once uploaded, the first thing you’ll notice is that Watson Analytics has assessed the quality of your data. When you first click on your data set you’ll get a dialog box with a series of prompt questions.

self-service analytics - watson analytics - prompt questions
Click to zoom

Tableau is optimised to analyse large data sets. For Tableau Public, it can connect to Microsoft Excel, Microsoft Access, and text files. While you are limited to 1 million rows of data, this is only a limit per connection. There is a file size limit of 1 gigabyte to save to the cloud. Adding data connections is easy as you can select by source type (e.g Excel file, database, etc), you can view the data once connected, and select how you want to import the fields.

There is currently no ability to load your own data sets into Popily. This is why we’re using the Airbnb public data set already added to Popily. They are extending invitations to companies to add their data now.

Finding insights

The focus on this section will be looking for relationships between the price of accomodation and the number of rooms.

As we saw when we first loaded our data set, Watson Analytics is already suggesting areas that we might want to investigate. If you select the Explore option you’ll be able to ask you natural language questions. In this instance I’ve asked ‘what is the relationship between bedrooms and weekly_price?’.

self-service analytics - watson analytics - search by room and price
Click to zoom

Exploring these options I found that the visualisations are not all that useful initally. Watson Analytics likes to aggregate by average and it hides a lot of the information you want to see. However, clicking on the columfunction on the right allows you to select exactly what fields you want and how to graph them. Using this I created the following graph.

self-service analytics - watson analytics - price by bedroom by property_type
Click to zoom

This is graph is more meaningful. I can see the relationship I’d expect to see between price and the number of rooms. But now I can also see which properties attract a higher premium per room (in this instance it’s trains and boats). Now you can also quickly click on the property_type field and select other relevant fields to investigate like Country and Neighborhood. Another powerful option available through Watson Analytics is its prediction engine. To see more about this feature check out some guides here and here.

self-service analytics - watson analytics - prediction dialog

Tableau is much more hands on then Watson Analytics or Popily. This means that when you first add your data set, you’re not going to get any automatic recommendations. However, Tableau has done a lot behind the scenes. It’s categorised each of the Airbnb fields and determined if they are attributes or dimensions. This works in your favour when deciding how to visualise your information.

self-service analytics - tableau - first screen
Click to zoom

From this starting screen you can start to explore your data. To explore the relationship between beds and price you grab the fields from the lists on the left and drag them across to the row and column shelves. Tableau will automatically select the scatter plot chart, which, for this investigation is exactly what we want. We can now decide which detail we want to split the plot by. Dragging across the property type field, and aggregating by average values, we can replicate a similar graph to what we create in Watson Analytics.

self-service analytics - tableau - second screen
Click to zoom

From here there’s a lot of flexibility with what you can do with this information. You can add dimensions to change size, shape and colour. You can also quickly add filters, trendlines and, forecast if you have time series data or graph data to a map.

self-service analytics - tableau - third screen
Click to zoom

When you first log in to Popily you’ll see a list of recent public data sources on the right. Click on Airbnb listings and you’ll immediately be presented with a set of charts. If you scroll to bottom you’ll see that the data source has been prepopulated with 2,421 pages of charts. You can go through and explore these pages, but it makes more sense to start limiting your search to those fields that you are interested in.
self-service analytics - popily - first log on

Let’s start our search with the relationship between cost and the number of rooms. You can search by fields within the yellow bordered search dialog at the top of the screen. Select monthly price and number of beds. You’ll see the number of pages has been limited to 5 and you can start exploring charts more relevant to your investigation. You’ll be presented with a chart called Average monthly price by number of beds over date cost started on AirBnB. Once again, not a particularly insightful. If you scroll down you’ll see Average monthly price of number of beds.

self-service analytics - popily - search by average better result

This graph is a little more useful as we can start to see the relationship – namely, more beds more expensive. However, from the example picture above you’ll notice an immediate limitation of Popily’s visualisation. There’s no axis headings, no legend and no labels. In fact, other then the heading the only indication you’ll know what you are looking at is if you mouse over the graph elements. Even more annoying is that if you have multiple elements on a line graph it won’t label the values (you need to guess) and you need to be very precise with how you position your mouse to get the values.

Conclusion

I like Tableau because it provides the most control over how you load, model and visual insights. However the value of self service analytics is giving anyone the power to do meaningful analytics. From the perspective of non-technical user I’d recommend Watson Analytics. It’s a more mature offering than Popily and doesn’t present you with learning curve required for Tableau. I’m looking forward to seeing how these offerings continue to grow and evolve. If you agree or disagree let me know below.

Advertisements

UPDATED: Sentiment Analysis with “sentiment”

I was looking for a quick way to do sentiment analysis for comments from an employee survey. I came across this post here by Gaston Sanchez.

The guide is a little dated now (the “sentiment” package needs to be manually downloaded, ggplot2 has been updated, setting up a Twitter API has changed, etc). Since I found Gaston’s guide useful, I’ve included some updated steps to effectively get the same output that they provided previously.

This example looks for the sentiment of tweets about the #UCLfinal.

NOTE: R version 3.1.2 through R Studio

Step 1 – Install packages

You will only be required to install these packages the first time.

# Required packages for the plots
install.packages(c("plyr","ggplot2","wordcloud","RColorBrewer","httr","slam","mime","R6"," Rcpp"))

#Required packages to connect to your Twitter API
install.packages(c("twitteR", "bit","bit64","rjson","DBI")

# Required packages for sentiment
install.packages(c("NLP","tm","Rstem"))

Step 2 – Install ‘sentiment’

The sentiment package is not available from all the CRAN server, so you can install it manually. Download “sentiment_0.2.tar.gz” from http://cran.r-project.org/src/contrib/Archive/sentiment/

# Update [directory] with the location where you have saved "sentiment_0.2.tar.gz"
install.packages("[directory]", repos = NULL, type = "source")

Step 3 – Load all your packages

You will need to load these packages for each new session.

library(plyr)
library(ggplot2)
library(wordcloud)
library (RColorBrewer)
library(httr)
library(slam)
library(mime)
library(R6)
library(twitteR)
library(bit)
library(bit64)
library(rjson)
library(DBI)
library(tm)
library(Rstem)
library(NLP)
library(sentiment)
library(Rcpp)

Step 4 – Set up your API with Twitter

Go to https://apps.twitter.com/ and sign in (you’ll need to create a twitter account if you haven’t already)

Click on ‘Create New App’

Complete the compulsory fields, accept the Developer Agreement (note you can enter a placeholder Website if you don’t have one) and click ‘Create your Twitter Application’.

After the application management page loads click ‘Keys and Access Tokens’ and note your consumer key and secret.

Click ‘Create my access token’ and note your access token and token secret.

Step 5 – Connect to Twitter

Enter the authentication details below

# Authenticate with Twitter

api_key <- "[your key]"
api_secret <- "[your secret]"
token <- "[your token]"
token_secret <- "[your token secret]"
setup_twitter_oauth(api_key,api_secret,token,token_secret)

If you get the following prompt:

[1] "Using direct authentication"
Use a local file to cache OAuth access credentials between R sessions?
1: Yes
2: No

Press 1 and execute to save a local copy of the OAuth access credentials.

Step 6 – Harvest tweets

Now it’s time to harvest the tweets for analysis. Note, if you’re setting behind a firewall this may not work. If so, tweak your firewall settings. Additionally, it might take a minute to harvest the tweets.

# harvest some tweets
some_tweets = searchTwitter("uclfinal", n=1500, lang="en")

# get the text
some_txt = sapply(some_tweets, function(x) x$getText())

Step 7 – Prepare text for sentiment analysis

# remove retweet entities
some_txt = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", some_txt)

# remove at people
some_txt = gsub("@\\w+", "", some_txt)

# remove punctuation
some_txt = gsub("[[:punct:]]", "", some_txt)

# remove numbers
some_txt = gsub("[[:digit:]]", "", some_txt)

# remove html links
some_txt = gsub("http\\w+", "", some_txt)

# remove unnecessary spaces
some_txt = gsub("[ \t]{2,}", "", some_txt)
some_txt = gsub("^\\s+|\\s+$", "", some_txt)

# define "tolower error handling" function 
try.error = function(x)
{
   # create missing value
   y = NA
   # tryCatch error
   try_error = tryCatch(tolower(x), error=function(e) e)
   # if not an error
   if (!inherits(try_error, "error"))
   y = tolower(x)
   # result
   return(y)
}

# lower case using try.error with sapply 
some_txt = sapply(some_txt, try.error)

# remove NAs in some_txt
some_txt = some_txt[!is.na(some_txt)]
names(some_txt) = NULL

Step 8 – Perform sentiment analysis

Please note that the classifying the polarity and emotion of the tweets may take a few minutes

# classify emotion
class_emo = classify_emotion(some_txt, algorithm="bayes", prior=1.0)

# get emotion best fit
emotion = class_emo[,7]

# substitute NA's by "unknown"
emotion[is.na(emotion)] = "unknown"

# classify polarity
class_pol = classify_polarity(some_txt, algorithm="bayes")

# get polarity best fit
polarity = class_pol[,4]

Step 9 – Create a data frame in order plot the results

# data frame with results
sent_df = data.frame(text=some_txt, emotion=emotion,
polarity=polarity, stringsAsFactors=FALSE)

# sort data frame
sent_df = within(sent_df, emotion

 This is what the first 5 rows of data may look like for df_sent

sentiment analysis R - first 5 rows

Step 10 – plot the emotions and polarity of the tweets

# plot distribution of emotions
ggplot(sent_df, aes(x=emotion)) +
geom_bar(aes(y=..count.., fill=emotion)) +
scale_fill_brewer(palette=”Dark2″) +
labs(x=”emotion categories”, y=”number of comments”) +
labs(title = “Sentiment Analysis of Tweets about UCL Final\n(classification by emotion)”, plot.title = element_text(size=12))

Sentiment analysis in R - emotionality

# plot distribution of polarity

ggplot(sent_df, aes(x=polarity)) +
geom_bar(aes(y=..count.., fill=polarity)) +
scale_fill_brewer(palette=”RdGy”) +
labs(x=”polarity categories”, y=”number of tweets”) +
labs(title = “Sentiment Analysis of Tweets about UCL Final \n(classification by polarity)”,plot.title = element_text(size=12))

sentiment analysis of tweets about UCL final - polarity

# separating text by emotion

emos = levels(factor(sent_df$emotion))
nemo = length(emos)
emo.docs = rep(“”, nemo)

for (i in 1:nemo)
{
tmp = some_txt[emotion == emos[i]]
emo.docs[i] = paste(tmp, collapse=” “)
}

# remove stopwords
emo.docs = removeWords(emo.docs, stopwords(“english”))

# create corpus
corpus = Corpus(VectorSource(emo.docs))
tdm = TermDocumentMatrix(corpus)
tdm = as.matrix(tdm)
colnames(tdm) = emos

# comparison word cloud
comparison.cloud(tdm, colors = brewer.pal(nemo, “Dark2”), scale = c(3,.5), random.order = FALSE, title.size = 1.5)

Sentiment analysis in R - word cloud