Python 101 for Aspiring Data Nerds
Last updated April 24, 2015As a data scientist, or anyone interested in collecting data for that matter, it’s no doubt helpful to know about how to go about collecting the data in your app – data that you’ll want to later query and analyze.
Here, we’ll build an app in Python from A-Z, iterate on it to make it more robust, and finally add application event logging with Fluentd and Treasure Data. We chose Python because it’s quickly becoming the language of choice among aspiring data scientists. In our examples, we’ll use Python version 2.7.
In a later segment, we’ll cover even more programming languages and environments, as well as visualization (using tools like Tableau and Chartio), but for now, let’s get started by brushing up on some Python basics before we move onto data collection.
Rock, Scissors… Paper?
For our example today, we’ll create a basic rock, paper, scissors app. (As for you job seekers out there, this may, or may not be something you’ll get asked about in a random coding interview, so pay attention!) With it, we’ll collect data on:
- Player
- Choice
- Verdict
For those of you unfamiliar with the Rock-Paper-Scissors concept, let’s ask Wikipedia:
Rock-paper-scissors … is a zero-sum hand game where each player simultaneously forms one of three shapes with an outstretched hand. These shapes are “rock” (a simple fist), “paper” (a flat hand), and “scissors” (a fist with the index and middle fingers together forming a V). The game has only three possible outcomes other than a tie: a player who decides to play rock will beat another player who has chosen scissors (“rock crushes scissors”) but will lose to one who has played paper (“paper covers rock”); a play of paper will lose to a play of scissors (“scissors cut paper”). If both players throw the same shape, the game is tied and is usually immediately replayed to break the tie.
As for me, I learned the game by watching a few kids play it on the street in front of my house. There may also have been a coding interview involved.
The Basic App
import random tie= "a tie" p1 = "Player 1; Player 2 loses" p2 = "Player 2; Player 1 loses" myDict = {('rock', 'rock') : tie, ('rock', 'paper') : p2, ('rock', 'scissors') : p1, ('paper', 'rock') : p1, ('paper', 'paper') : tie, ('paper', 'scissors') : p2, ('scissors', 'rock') : p2, ('scissors', 'paper') : p1, ('scissors', 'scissors') : tie } def throw(player1, player2): verdict = myDict[(player1, player2)] print "The game goes to " + verdict print "Ya wanna play Rock, Paper, Scissors?" player1 = raw_input("Throw! Choose 'rock', 'paper', or 'scissors': " ) print "Player 1 chooses: " + player1 player2 = random.choice(['rock', 'paper', 'scissors']) print "Player 2 chooses: " + player2 throw(player1, player2)
Note: Scroll to view full code
Listing 1: rsp.py
A couple of points of note:
- For our purposes, we’ll have the player play against the computer, so we’ll use Python’s raw_input() to get the player’s input, and random.choice() to generate the computer’s input. Note that we’ve imported the random library at the beginning.
- We’ll use a Python dictionary (myDict) to search for verdict by key:value pair. Our key to search by is the pair of our player1 and player2 choices. We use our throw(), function to do this search and store our verdict to the associated variable. Incidentally, dictionaries are generally a good way to do things in Python and are much better than a very long string of if…then…else, statements.
Running the game with
$ python rsp.py
Produces the following output:
Ya wanna play Rock, Paper, Scissors?
Throw! Choose ‘rock’, ‘paper’, or ‘scissors’: rock
Player 1 chooses: rock
Player 2 chooses: scissors
The game goes to Player 1; Player 2 loses
It’s interesting to note that we don’t currently handle any errors. If we play the game and choose anything other than rock, paper or scissors, we get an error:
Throw! Choose ‘rock’, ‘paper’, or ‘scissors’: moar paper
Player 1 chooses: moar paper
Player 2 chooses: rock
Traceback (most recent call last):
File “rsp.py”, line 21, in <module>
throw(player1, player2)
File “rsp.py”, line 13, in throw
verdict = myDict[(player1, player2)]
KeyError: (‘moar paper’, ‘rock’)
To err is… well, to err
Let’s handle this error. We’ll move our player inputs and throw() function into a try/except block in a function so that our application can recover (and not crash) if the user enters an incorrect input. We’ll also call that function to repeat if the user enters the wrong thing:
def lets_get_started(): try: player1 = raw_input("Throw! Enter 'rock', 'paper', or 'scissors': " ) print "You chose: " + player1 player2 = random.choice(['rock', 'paper', 'scissors']) print "The computer chose: " + player2 throw(player1, player2) except KeyError: print "Hey! I said enter 'rock', 'paper' or 'scissors'! Try again." lets_get_started()
Edit your file so it looks like this:
import random tie= "is a tie" p1 = "goes to you; the computer loses." p2 = "goes to the computer; you lose." myDict = {('rock', 'rock') : tie, ('rock', 'paper') : p2, ('rock', 'scissors') : p1, ('paper', 'rock') : p1, ('paper', 'paper') : tie, ('paper', 'scissors') : p2, ('scissors', 'rock') : p2, ('scissors', 'paper') : p1, ('scissors', 'scissors') : tie } def throw(player1, player2): verdict = myDict[(player1, player2)] print "The game " + verdict def lets_get_started(): try: player1 = raw_input("Throw! Enter 'rock', 'paper', or 'scissors': " ) print "You chose: " + player1 player2 = random.choice(['rock', 'paper', 'scissors']) print "The computer chose: " + player2 throw(player1, player2) except KeyError: print "Hey! I said enter 'rock', 'paper' or 'scissors'! Try again." lets_get_started() print "Ya wanna play Rock, Paper, Scissors?" lets_get_started()
Listing 2: rsp-better.py
Try it out! Try it out with incorrect inputs, too. Now that this is working better and handling incorrect inputs, let’s think about how we can collect our user input data and log it to the cloud.
Log it!
Before you set up logging in your app, you’ll need to enable a few things in your environment. (For more information, see https://docs.treasuredata.com/articles/python.)
First, install TD agent
$ curl -L http://toolbelt.treasuredata.com/sh/install-ubuntu-trusty-td-agent2.sh | sh
Next, update /etc/td-agent/td-agent.conf to include your API key
type tdlog apikey 5919/ auto_create_table buffer_type file buffer_path /var/log/td-agent/buffer/td ...
Next, restart Treasure Agent:
cookie@monster-ThinkPad-X1:~omnomnom/python$ sudo /etc/init.d/td-agent restart
You should see the following status on console when Treasure Agent restarts:
* Restarting td-agent td-agent [ OK ]
Last, but not least, make sure you have installed fluent-logger to your Python environment:
$ pip install fluent-logger
As we mentioned before, we now want to collect data on
- our user’s choice;
- the computer’s choice;
- the verdict for the game.
Collecting log comments to send to Treasure Data via Treasure Agent is easy. Using the event function from the fluent library is a single line of code:
from fluent import event … event.Event('game_data', {'player': 'Player 1', 'choice': player1 })
This will write to our database (in a table named ‘game_data’), two columns (with associated values): ‘player’(with value containing the string ‘Player 1’) and ‘choice’ (with value containing the contents of the variable ‘player1’, which was the choice between ‘rock’, ‘paper’, or ‘scissors’ that we got as raw input from the user).
Let’s update our program to log both choices and the game verdict to Treasure Data.
Underneath the import random statement, but before our variable declarations, add these two import statements:
from fluent import sender from fluent import event
Next, let’s set up our sender and local database so that it can send data to Treasure Data:
sender.setup('td.rsp_db', host='localhost', port=24224)
Finally, we’ll add our logging events. First, in our throw() function, after we print the verdict, let’s log it:
event.Event('game_data', { 'verdict': verdict })
Log both the player’s choices. After you print player 1’s choice, log it:
event.Event('game_data', { 'player': 'Player 1', 'choice': player1 })
Do the identical thing for player 2, replacing player 1 with player 2 and the variables accordingly. Once you’re ready, your code should look like this file on github.
Now, try running your file a few times to populate the database. How did it work for you? Did you run into any problems? Please leave comments below.
Inquiring minds want to know
So we’ve played our game a few thousand times and want to populate the database. How do we query our database?
One way to see what’s cooking in your local database instance is to use the TD Toolbelt direct from bash to query and view the local database instance.
To view the tables in your running instance, try the td tables command (click the image to enlarge):
You can also run a tail command to see the most recent rows of your local database:
$td table:tail rsp_db game_data
{“time”:1429579380,”player”:”Player 2″,”choice”:”paper”}
{“time”:1429579383,”player”:”Player 1″,”choice”:”scissors”}
{“time”:1429579383,”verdict”:”goes to you; the computer loses.”}
…
Ultimately, however, you will want to query your data not from the device your app is running on, but from the cloud.
Sometimes it helps to flush Treasure Agent’s buffer in order to upload the data more quickly to Treasure Data in the cloud. To do that, run the following command:
$ kill -USR1 cat /var/run/td-agent/td-agent.pid
We should now be able to go to the Treasure Data console to run some different queries. To do this, log into https://www.treasuredata.com and hit the “New Query” button on the top right.
At the New Query window, we should select the database. Try running a few simple queries in the console to get you familiar with what’s currently in your table:
Select * from game_data
This will show you the top 100 rows in your database in descending order, and select count(1) from game_data will return the number of rows in your table.
Of course, when you start to do analytics on your data, you’ll want to do more than just these simple queries, so get familiar with the nuances of Hive and Presto queries. Presto is currently a work in progress, so while very powerful, it will have a few limitations that don’t translate correctly to standard SQL.
(Click on image to enlarge).
We can also run the queries direct from our script, provided we’re hooked up to the correct API key.
From here, you can query the data in Treasure Data for use on the visualization platform of your choice, such as Tableau, Chartio, and others. We’ll cover this in a later post. Stay tuned!
Learn more at https://docs.treasuredata.com/articles/python