import blogpost
Β def prepandpython():

As it turns out, consistent blogging isn’t a strong suit of mine. I’ll make up for that with valuable (ish) blogging instead. Playing around with the beta version of Tableau 2019.3, I spotted a feature that I’ve been waiting for [demanding] for some time now: integration with Python. In the short time I’ve used it, I’m more than happy to say that it’s awesome and I’m fully excited and ready to embrace Tableau Prep next step in its development journey (…flow?).

If that excitement also applies to you, get over to the pre-release page and download the latest beta version – then use this post as a very loose getting started guide.

So… How does this all work then?

Step 1 – Set up TabPy

Tableau Prep uses TabPy as a means of passing through the data into Python to be run / ingested. Installing and setting up TabPy is actually incredibly straightforward – if you’re new to using it, follow the steps on GitHub here.

Once you’ve got that set up and configured, start the service and you’re good to go. For the most part, I typically run TabPy locally, but you could host your TabPy service on a dedicated server – even in the cloud.

Note: Script processing speed is fully dependent on the machine running TabPy, the pass through of data, and the script itself.

Note #2: Remember to remove all unnecessary print() statements when moving to production. Print() will output in to the command terminal of TabPy – and can slow things down if you accidentally leave it in.

Step 2 – Adding and Configuring a Script Node

In the 2019.3 version of Prep, you’ll find a new option within your flow, “add script”.

Within the script step, you first add the location of the TabPy service you previously configured and select your script file. Script file you say? Let’s cover the basics of what that means and how it should look.

Step 3 – The ScriptΒ 

Using Python, Prep will start you off with a pandas dataframe (df) – with pandas loaded as pd (of course). If you’re unfamiliar with pandas, I highly highly recommend reading up and doing some further learning. It’s a fantastically simple way of manipulating sets of data in a way that’s familiar to most data-literate people.

Now, what happens between that initial dataframe and the end of the script is totally up to you. Us numpy to calculate a forecast? Go ahead. Scrape the web for your favourite stock data? Absolutely. Feed the data from Prep in to a database? Sure! Download a list of cat memes? I mean, you’re pushing it, but why not? You can import the libraries you need and do – technically – whatever you want. The only* requirement from a Prep point of view is that the end of the script returns a new dataframe for Prep to ingest.

Well, I said “only”, but there’s one more bit to consider: the structure of the final dataframe.

Step 4 – Defining the finish

This is hugely important – if your output structure differs from the original input, you’ll have to include a function in your script called “get_output_schema()”. This function is Prep’s way of understanding what’s coming its way as part of the flow and should be structured like so:

Replacing ‘Col name’ with the actual names of the columns within the dataframe, matched with the correct datatype on the other side. You don’t actually need to include all the fields from your final dataframe, just the ones you’re interested in using in Prep. Each column should match up to a data type defined in that list above.

As example, here’s my setup from my stock script (full script below):

The get_output_schema() function doesn’t have to be called directly. Prep looks for it regardless, so all you’ll have to call is your main function instead. So, with that, you’re all set to start using your script as part of your Prep flow.

Step 5 – Debugging the inevitable bugs

Before we finish, we need to consider a hugely important element of using Python, debugging. My personal favourite reason for using Python above other programming languages is that *most* errors are easily searchable – and likely there’s somebody on Stack Overflow with a similar issue.

If there’s an error as part of your script, Prep will actually surface that error in the alert window in the top right within the Prep window. If TabPy is accessible, you can also check the TabPy cmd window to see the result of the script as well as the errors that appear as a result.

I generally recommend constructing your script in an environment like PyCharm first before directly running in Prep to avoid any obvious errors.

Bonus – Example Script

As a nice bonus, here’s the script I used to pull through two years of Tableau data in from Yahoo Finance. Before the first step, I connected to an Excel file with literally the number “1” in and nothing. else. The very first part of my Python code effectively disregards it immediately. This should give you a good overview of a few different pandas functions combined with how to surface the result correctly in Prep.

I hope this helps you on your journey to using Prep + Python… I’m excited to see where the community takes this next!

*Updated to avoid formatting issues*

#################################
# Tableau Prep | TabPy          #
# Scenario 2: Bring in web data #
#################################

import time
import pandas_datareader as dr
from datetime import datetime, timedelta

def getstock(df):

    stockname='DATA'

    yesterday = (datetime.now() - timedelta(1)).strftime('%Y-%m-%d')
    df = dr.DataReader(stockname, 'yahoo', '2016-12-01', yesterday)

    del df['Close']
    # rename Adjusted Close to Close
    df.rename(columns={'Adj Close': 'Close'}, inplace=True)

    start_date = df.index.min()
    end_date = df.index.max()
    all_weekdays = pd.date_range(start=start_date, end=end_date, freq='B')

    # Filling missing weekdays
    df = df.reindex(all_weekdays)
    df = df.fillna(method='ffill')

    df['% Change'] = df['Close'].pct_change(1)
    df['Change'] = df['Close'].diff()
    df['High-Low Diff'] = df['High'] - df['Low']
    df['14ATR'] = df['High-Low Diff'].abs().rolling(window=14).mean()
    df['20ATR'] = df['High-Low Diff'].abs().rolling(window=20).mean()

    df['Name'] = stockname

    df = df.reset_index()
    df.rename(columns={'index': 'Date'}, inplace=True)
    df['Date'] = df['Date'].astype(str)

    return df

def get_output_schema():
    return pd.DataFrame({
        'Name' : prep_string(),
        'Date' : prep_string(),
        'High' : prep_decimal(),
        'Low' : prep_decimal(),
        'Open': prep_decimal(),
        'Volume': prep_decimal(),
        'Close': prep_decimal(),
        '% Change': prep_decimal(),
        'Change' : prep_decimal(),
        'High-Low Diff': prep_decimal(),
        '14ATR': prep_decimal(),
        '20ATR': prep_decimal(),
        })


1 Comment

Bashii - Tomohiro Iwahashi · 12 July 2019 at 4:55 pm

Hi, Tom, Thank you for letting me translate your blog to Japanese : I posted article to my blog ” Start Data Science From Tableau ” http://lovedata.main.jp/2019/07/13/tableau-prep-builder-tabpy/ We have Google Translate button (Orange One) on the left bottom . I’m not quite sure how accurate the translation works πŸ™‚ I hope Japanese customer will jump to your site from mine !

Leave a Reply

Your email address will not be published. Required fields are marked *