# Text Message Analytics — Numbers

People communicate a lot through text messages, and lucky for me iPhones keep track of those text messages I’ve sent. iPhones store your text messages in a SQLite database, and this database is readily accessible in your iPhone backup on your computer. [This is why encrypting your backup might be a good idea if you have sensitive data.] I want to eventually perform some advanced text analytics to try to interpret the content of the text message. This post is only going to look at the ‘numbers’ aspect of my text messages. All the numbers on the following pages include both sent and received texts, and excludes texts that I either deleted or where deleted by the system. [I know I’ve deleted threads. I don’t think iOS deletes old messages, but it’s a possibility till I know otherwise.]

The most simple stat from text messages is how many have I sent/received per day or per week. The chart below has both. The notable trend is that there has been more text messages sent/received the longer I’ve had my iPhones. I’d suggest this is a little biases since I would be more likely to delete text threads that are much older, but I still think there would be the slight trend upwards regardless.

I wanted to look at area codes just out of curiosity. I thought that I would have the most texts between me and a 412 or 724 number. I’m a little surprised how many 412 numbers there are given how many people I know living in the Pittsburgh suburbs and I’m a 724. I think 412 is Allegheny County, while 724 is anything outside of that. I’m little surprised how close traditional SMS text messages, which go through your carrier network opposed over the Internet, since most of my friends have iPhones.

The last chart is my favorite, a breakdown of how often I text for each hour of the day. I think this tells you something about my behavior, albeit nothing common sense won’t tell you. I generally text earlier in the morning (7am to 10am) more often during the work week compared to the weekend. There’s a spike at 12PM (lunch time) and 9PM (making plans/socializing) for any day of the week. There’s virtually no texting between 4am and 6am. There are some texts that occur after the ‘Ted-Mobsy-Hour’ of 2am, where nothing good happens after that time, but not a spike like there might have been in college.

‘Kids, if it’s after 2am, don’t text, just go home….and watch How I Met Your Mother.’

# Charlie Morton — PitchFX

I’m in a predictive modeling class for my grad program at NU, and we are learning a statistical programming language called SAS. One of the things we are trying early on is cluster analysis to determine if variables are related. I decided to play around with data that’s a little more interesting than housing prices. Charlie Morton has been on of my favorite pitchers to watch pitch. His curveball is just sexy. Cluster analysis can help us separate Morton’s pitches into different pitch types using PitchFX data I’ve been scraping.

I’ve plotted two charts, one is the vertical movement vs. the release speed. The second is the vertical movement vs the horizontal movement. [The movement parameters are calculated from the deviation of the ball from a straight path with no spin. And the horizontal movement is from the perspective of the catcher/batter. So imagine that Morton is throwing toward you.] So fastballs with backspin will have a positive vertical movement. Curveballs with top spin will have negative vertical movement. I used SAS to look at the speed, vertical, and horizontal movement and cluster similar pitches together. Without much tweaking, I was able to identify Morton’s fastballs and curveballs. He also has a third group which is a splitter according to brooksbaseball.net

Morton is famous for his sinker, which is a two-seam fastball that ‘sinks’ relative to a four-seam fastball thrown at the same angle. I’ve annotated the sinker on the vertical movement to release speed chart below. Morton’s sinker is hard to differentiate because it’s almost as fast as his four-seamer. (low-90s) It doesn’t stay as high due to the different spin compared to the four-seam fastball. The advantage here is that a batter will swing as to hit the four-seam fastball, but the sinker will be an inch or two lower than what the batter adjusted for. Since the bat is round, the ball will come off the bat at a low angle, and bam! Ground ball.

Brooksbaseball.net has updated and historical PitchFX data presented very nicely. I suggest checking them out if you want to see visualizations like this for other games or pitchers. Their visualization tools are easy to use and updated right after games end.

# #SeanTrek GeoTracks 2012

You might remember #SeanTrek — the 46 day, 12,000 mile, 34 state excursion I took back at the very end of 2012. I didn’t know what I how I was going to use this at the time, but I geotagged just about everything I did on the trip. I checked-in to every place on Foursquare and obtained over 700 points in Portland and San Francisco, which is insane because I checked in just about everything I did or place I went. On top of Foursquare I geotagged every tweet I sent and picture I took. This resulted in me now having thousands of data points of both timestamps and location data.

The above map is what happens when you put all of them together. It outlines my entire trip! The more dense the marks the more I was in one place longer exploring it. Sparse points means I was driving a lot. You’ll find a lot of marks around Pittsburgh, Portland, SF, LA, Austin, and New Orleans, because I spent the most time there and didn’t drive much in most of those cities. I have a rather nice record of a long trip that didn’t require me to painstakingly record exactly what I did.

This map only has geotag data and the type of media. I’m hoping to use the geotag data and the timestamp to get an average speed between the two points. I also want to geocode some tweets or photos that were not geocoded in 2012 by interpolating using the timestamp now.

Once I properly extract the data from the tweets, I can have hashtags or mentions searchable by frequency and location. I used #SeanTrek a lot more than any other hashtag on the trip. Though curiously enough the first tweet mentioning #SeanTrek is not geotagged. (technical glitch) Hopefully, I’ll get some more things mapped out in the future.

# Pirates — Run Probability

Presented without much commentary or analysis. This is how the Pirates fared last year given a certain number of outs and with runners on specific bases. So for example with no ones and nobody on base the Pirates had a 26% chance of a scoring a run from that point in the inning on till the end. So that would score a run once in about every 4 innings. The stat I always reference is bases loaded and no outs. It should be the highest, for the Pirates, it’s not. Runners on 2nd and 3rd with no outs is the highest.

For a point of comparison the black reference lines on the bar graph are the MLB average for that specific base-out state.