How we won the Hackathon at PCM17

This article is about how we won this year’s Hackathon at the Pentaho Community Meeting. In addition to an explanation of how we came up with the idea and finally put it into practice, you can also download the entire project.

November 10, 2017 - 7 minute read -
web pentaho

Make Donald tweet again!

Hackathon

This year’s Pentaho Community Meeting started with a hackathon on Friday evening. Altogether 6 teams took part (or as Pedro put it so nicely: 5,5 teams 😜). The task this year was to make “something” out of a dataset with tweets from Trump. The dataset is public and can be downloaded here.

Idea

First we thought about which funny things could be done with the dataset? Finally, we agreed to generate our own tweets based on trumps tweets. I came up with the idea when I visited the CCC last year. David Kriesel presented his project SpiegelMining there - he collected over a period of several years all articles that were ever published on Spiegel Online. At the end of his presentation, the term Markov Chain was mentioned. So why not run a Markov Chain algorithm over the dataset with the tweets of Trump?

Architecture

Even with a small project like this one, it can’t hurt to think briefly about architecture in advance. Now that we decided to create our own tweets based on the Markov Chain algorithm, the following questions had to be answered:

  • Is there an implementation of the algorithm in Pentaho Data Integration or do we use Python?
  • What data do we need for the algorithm?
  • Where to put the generated tweet?
  • How do we present the generated tweet?

Python

Unfortunately, there is no step in Pentaho Data Integration that could have taken the work off our hands of creating a model for generating tweets. Although a model could have been created with the existing steps, we decided to use Python (also because of the limited time). As so often with Python, there is a suitable module for this purpose that is just waiting to be installed and used.

Markovify is a simple, extensible Markov chain generator, that can be installed through:

pip install markovify

In order to create a Markov chain model we only need all tweets of Trump in one file. We then read this file in, build our model and let us generate a sentence that is a maximum of 140 characters long (remember: we want to generate tweets).

import markovify
# Get raw text as string.
with open("trump_raw.txt") as f:
    text = f.read()

# Build the model.
text_model = markovify.Text(text)

# Generate a tweet of no more than 140 characters
tweet = text_model.make_short_sentence(140)

# Write tweet to file    
file = open('tweet.txt','w')
file.write(tweet)
file.close()

First problem solved: We can generate a tweet that could have been from Trump. 🤓 All you need to get this Python script up and running are all tweets in one text file. To create this text file, a simple transformation is sufficient.

Pentaho Data Integration

Transformation (Raw Tweets)

First, we read in the CSV file and then keep only the column that contains the content of the tweet. The filter ensures that only lines are included in the text file where the content of the tweet is not empty (this sometimes happens when only a picture or video is twittered). While this transformation only needs to be executed once, the Python script has to be executed each time to generate a new tweet.

Before we take care of how to call the Python script from Pentaho Data Integration, we need to clarify what to do with the generated tweet. One simple way is to write the tweet to a database table. We enabled Truncate table to ensure that only the current tweet is in the database table.

Transformation (Tweet)

Let’s briefly summarize what we have so far:

  • A transformation which creates a text file that contains all of Trump’s tweets
  • A python script that can generate a tweet for us (based on the text file)
  • A transformation that writes the generated tweet to a database table

The next task was to automate this process. We have therefore created a job that executes the individual steps in the following sequence:

  • Execute Python script, which writes the generated tweet to a new text file
  • Read this file and write the tweet to a database table

Job (Tweet)

The shell script starts our Python script (you might even be able to execute the Python script directly). Since macOS is running on my computer, the paths for Windows have to be adjusted accordingly.

#bin/bash
python /Applications/Pentaho-7/server/pentaho-server/pentaho-solutions/system/MakeDonaldTweetAgain/endpoints/kettle/tweet.py

To create a new tweet, you only have to run this job. Now we’ll take care of the dashboard, which will show our generated tweet and create a new tweet on button press.

Dashboard

We use the Application Builder to combine a dashboard with the power of Pentaho Data Integration.

Application Builder

After you have created a new app using the Application Builder, you will need to restart the Pentaho Server. Afterwards all functions are available to you. Under macOS you will find a new folder with the name of the app (I called the App MakeDonaldTweetAgain) under the following path: /Applications/Pentaho-7/server/pentaho-server/pentaho-solutions/system/MakeDonaldTweetAgain/. We now paste our transformation, the job, the Python script and the text file with the tweets of Trump to the folder endpoints/kettle. Now we add a dashboard by clicking Add New Element on the Elements tab.

Application Builder (Elements)

Our app now consists of the following elements:

  • dashboarddonald (CDE Dashboard as Frontend)
  • kettledonald (Job, that generates a new tweet and writes it to a database table)
  • 0_import (Transformation, that generates a text file with all tweets of Trump - can be ignored)
  • 1_tweet (Transformation, that generates a new tweet - can be ignored, is part of the Job)

Let’s start by defining the layout for our dashboard. As we don’t need many elements, this is quickly created. All we need is two components, a button that launches our job, and a table that displays the generated tweet.

CDE Dashboard Layout

Now we add the two needed components (button and table) to our dashboard. In order for the table to refresh itself after running the job and display the new tweet, we need to write a callback function that updates our dashboard. To do this we need to write the following code snippet into the Success Callback function.

function f() {
    Dashboards.initEngine();
}

The entire thing looks like this:

CDE Dashboard Components

The Action Datasource is the Kettle job that is executed when you click on the button. We add this DataSource in the Datasources tab. In our case, the name of the output step is OUT and the type is JSON.

CDE Dashboard DataSources

We retrieve the generated tweet, that we want to display in our table component, from the database - a simple SQL query is sufficient.

SELECT *
FROM tweets;

That was it in principle - we can now create new tweets from the dashboard with a single click of a button and display them directly.

Posting to Twitter (Bonus)

To make the whole thing even more fun, we have created a fake Twitter account called @faketrumpagain (reads: fake Trump again). Every tweet we create with our job shall now be automatically published on the Twitter profile. With Python, this is not very complex either. To do this we first had to create the Twitter profile and then an app for it. Use this link to create an app - make sure you set the access level to Read and write.

Twitter App

There is also a suitable module for Twitter available. We can install this for Python as follows:

pip install python-twitter

To be able to post the tweet directly to your Twitter profile, all you have to do is integrate the module and specify your newly created keys from the Twitter App.

import twitter

api = twitter.Api(consumer_key='key here',
consumer_secret='key here',
access_token_key='key here',
access_token_secret='key here')

api.PostUpdate(tweet)

We did it! I hope you had as much fun at the hackathon as I did! I will clean up the project in the next few days and publish it here. 😋