This year’s Pentaho Community Meeting started with a hackathon on Friday evening. Altogether 6 teams took part (or as Pedro put it so nicely: 5,5 teams 😜). The task this year was to make “something” out of a dataset with tweets from Trump. The dataset is public and can be downloaded here.
First we thought about which funny things could be done with the dataset? Finally, we agreed to generate our own tweets based on trumps tweets. I came up with the idea when I visited the CCC last year. David Kriesel presented his project SpiegelMining there - he collected over a period of several years all articles that were ever published on Spiegel Online. At the end of his presentation, the term
Markov Chain was mentioned. So why not run a
Markov Chain algorithm over the dataset with the tweets of Trump?
Even with a small project like this one, it can’t hurt to think briefly about architecture in advance. Now that we decided to create our own tweets based on the Markov Chain algorithm, the following questions had to be answered:
- Is there an implementation of the algorithm in Pentaho Data Integration or do we use Python?
- What data do we need for the algorithm?
- Where to put the generated tweet?
- How do we present the generated tweet?
Unfortunately, there is no step in Pentaho Data Integration that could have taken the work off our hands of creating a model for generating tweets. Although a model could have been created with the existing steps, we decided to use Python (also because of the limited time). As so often with Python, there is a suitable module for this purpose that is just waiting to be installed and used.
Markovify is a simple, extensible Markov chain generator, that can be installed through:
In order to create a Markov chain model we only need all tweets of Trump in one file. We then read this file in, build our model and let us generate a sentence that is a maximum of 140 characters long (remember: we want to generate tweets).
First problem solved: We can generate a tweet that could have been from Trump. 🤓 All you need to get this Python script up and running are all tweets in one text file. To create this text file, a simple transformation is sufficient.
Pentaho Data Integration
First, we read in the CSV file and then keep only the column that contains the content of the tweet. The filter ensures that only lines are included in the text file where the content of the tweet is not empty (this sometimes happens when only a picture or video is twittered). While this transformation only needs to be executed once, the Python script has to be executed each time to generate a new tweet.
Before we take care of how to call the Python script from Pentaho Data Integration, we need to clarify what to do with the generated tweet. One simple way is to write the tweet to a database table. We enabled
Truncate table to ensure that only the current tweet is in the database table.
Let’s briefly summarize what we have so far:
- A transformation which creates a text file that contains all of Trump’s tweets
- A python script that can generate a tweet for us (based on the text file)
- A transformation that writes the generated tweet to a database table
The next task was to automate this process. We have therefore created a job that executes the individual steps in the following sequence:
- Execute Python script, which writes the generated tweet to a new text file
- Read this file and write the tweet to a database table
The shell script starts our Python script (you might even be able to execute the Python script directly). Since macOS is running on my computer, the paths for Windows have to be adjusted accordingly.
To create a new tweet, you only have to run this job. Now we’ll take care of the dashboard, which will show our generated tweet and create a new tweet on button press.
We use the
Application Builder to combine a dashboard with the power of Pentaho Data Integration.
After you have created a new app using the Application Builder, you will need to restart the Pentaho Server. Afterwards all functions are available to you. Under macOS you will find a new folder with the name of the app (I called the App MakeDonaldTweetAgain) under the following path:
/Applications/Pentaho-7/server/pentaho-server/pentaho-solutions/system/MakeDonaldTweetAgain/. We now paste our transformation, the job, the Python script and the text file with the tweets of Trump to the folder
endpoints/kettle. Now we add a dashboard by clicking
Add New Element on the Elements tab.
Our app now consists of the following elements:
- dashboarddonald (CDE Dashboard as Frontend)
- kettledonald (Job, that generates a new tweet and writes it to a database table)
- 0_import (Transformation, that generates a text file with all tweets of Trump - can be ignored)
- 1_tweet (Transformation, that generates a new tweet - can be ignored, is part of the Job)
Let’s start by defining the layout for our dashboard. As we don’t need many elements, this is quickly created. All we need is two components, a button that launches our job, and a table that displays the generated tweet.
Now we add the two needed components (button and table) to our dashboard. In order for the table to refresh itself after running the job and display the new tweet, we need to write a callback function that updates our dashboard. To do this we need to write the following code snippet into the
Success Callback function.
The entire thing looks like this:
Action Datasource is the Kettle job that is executed when you click on the button. We add this DataSource in the Datasources tab. In our case, the name of the output step is OUT and the type is JSON.
We retrieve the generated tweet, that we want to display in our table component, from the database - a simple SQL query is sufficient.
That was it in principle - we can now create new tweets from the dashboard with a single click of a button and display them directly.
Posting to Twitter (Bonus)
To make the whole thing even more fun, we have created a fake Twitter account called @faketrumpagain (reads: fake Trump again). Every tweet we create with our job shall now be automatically published on the Twitter profile. With Python, this is not very complex either. To do this we first had to create the Twitter profile and then an app for it. Use this link to create an app - make sure you set the access level to Read and write.
There is also a suitable module for Twitter available. We can install this for Python as follows:
To be able to post the tweet directly to your Twitter profile, all you have to do is integrate the module and specify your newly created keys from the Twitter App.
We did it! I hope you had as much fun at the hackathon as I did! I will clean up the project in the next few days and publish it here. 😋