An Analysis of Trump’s Tweets
Background
The following analysis uses several of my software tools under development, but is inspired by the work of David Robinson, and amounts to little more than a replication of his analysis shared on his blog Variance Explained. The long-term goal of this project is to make such analyses simple(r) by incorporating this workflow into an upcoming user-, and student-friendly analysis package called MassMine Analytics (or mmtool, as it’s temporarily known). Data for this analysis were collected using MassMine and the underlying tooling by my Racket data science package.
The Motivation
An astute twitter user noticed that when Trump said something positive, such as wish the Olympic team good luck, “he” tweeted from an iPhone (even though it is known that he uses a Samsung Galaxy). When his account tweets something negative, such as criticism for his opponents, it comes from an android device. This led to this analysis, recreated here with my own tooling, to determine if tweets from these two devices were categorically different–specifically, if his PR team was responsible for the iPhone-derived messages.
The Data Set
We will be working with tweets taken from Donald Trump’s Twitter account. The maximum we can collect from Twitter is his past 3200 tweets. With MassMine, getting this data is easy:
massmine --task=twitter-user --user=realDonaldTrump --count=3200 --out=trump_tweets.json
Preparing for the Analysis
To work with the resulting data (which is in JSON format), we load up a few tools in Racket scheme.
To begin, load the necessary dependencies
#lang racket
require data-science)
(require plot)
(require math)
(require json)
(require pict) (
Before we begin, we define several helper functions for parsing and working with the data set
;;; This function reads line-oriented JSON (as output by massmine),
;;; and packages it into an array. For very large data sets, loading
;;; everything into memory like this is heavy handed. For data this small,
;;; working in memory is simpler
#f])
(define (json-lines->json-array #:head [head let loop ([num 0]
(
[json-array '()]
[record (read-json (current-input-port))])if (or (eof-object? record)
(and head (>= num head)))
(
(jsexpr->string json-array)loop (add1 num) (cons record json-array)
(
(read-json (current-input-port))))))
;;; Normalize case, remove URLs, remove punctuation, and remove spaces
;;; from each tweet. This function takes a list of words and returns a
;;; preprocessed subset of words/tokens as a list
(define (preprocess-text lst)map (λ (x)
(
(string-normalize-spaces
(remove-punctuation
(remove-urlsstring-downcase x)))))
( lst))
Using Racket’s JSON package and the helper functions defined above, we can read in the tweet data (as raw JSON) and parse it as a list of hashes in Racket
;;; Read in the entire tweet database (3200 tweets from Trump's timeline)
(define tweets (string->jsexpr"trump_tweets.json" (λ () (json-lines->json-array))))) (with-input-from-file
Each tweet includes a lot of metadata. For this analysis we’ll keep only the text of each tweet, the timestamp from when it was posted online, along with its source–whether from an iPhone or Android device.
;;; Remove just the tweet text and source from each tweet
;;; hash. Finally, remove retweets.
;;; Remove just the tweet text, source, and timestamp from each tweet
;;; hash. Finally, remove retweets.
t
(define let ([tmp (map (λ (x) (list (hash-ref x 'text) (hash-ref x 'source)
('created_at))) tweets)])
(hash-ref x not (string-prefix? (first x) "RT"))) tmp)))
(filter (λ (x) (
;;; Label each tweet as coming from iphone, android, or other.
(define tweet-by-typemap (λ (x) (list (first x)
(cond [(string-contains? (second x) "android") "android"]
(second x) "iphone") "iphone"]
[(string-contains? ("other"])))
[else t))
;;; Separate tweets by their source (created on iphone vs
;;; android... we aren't interested in tweets from other sources)
second x) "android")) tweet-by-type))
(define android (filter (λ (x) (string=? (second x) "iphone")) tweet-by-type)) (define iphone (filter (λ (x) (string=? (
Results
Time of Tweets
If different people are using the android and iPhone device, we might expect a temporal signature related to when tweets are posted. We can visualize when tweets typically occur to confirm this suspicion:
;;; Helper function for converting raw JSON timestamps to something we
;;; can use
(define (convert-timestamp str)"~a ~b ~d ~H:~M:~S ~z ~Y"))
(string->date str
;;; Timestamps by device
(define timestamp-by-typemap (λ (x) (list (third x)
(cond [(string-contains? (second x) "android") "android"]
(second x) "iphone") "iphone"]
[(string-contains? ("other"])))
[else t))
;;; Timestamp binning. Simple counts of data records by unit time.
(define (bin-timestamps timestamps)let ([time-format "~H"])
(;; Return
(sorted-countsmap (λ (x) (date->string (convert-timestamp x) time-format)) timestamps))))
(
;;; Android binned times
(define a-time1 (λ (x) (string=? x "android"))) 0)))
(bin-timestamps ($ (subset timestamp-by-type ;;; iPhone binned times
(define i-time1 (λ (x) (string=? x "iphone"))) 0)))
(bin-timestamps ($ (subset timestamp-by-type
;;; Convert bin names to numbers
map (λ (x) (list (string->number (first x)) (second x))) a-time))
(define a-time (map (λ (x) (list (string->number (first x)) (second x))) i-time))
(define i-time (
;;; Fill in missing data. This helper makes sure we have a bin for
;;; every hour, even if zero tweets were observed. `bins` should
;;; contain a list of all required bins. Missing bins will be filled
;;; with a zero
(define (fill-missing-bins lst bins)
(define (get-count lst val)let ([count (filter (λ (x) (equal? (first x) val)) lst)])
(if (null? count) (list val 0) (first count))))
(map (λ (bin) (get-count lst bin))
(
bins))
;;; Convert UTC time to EST
(define (time-EST lst)map list
(0)
($ lst append (drop ($ lst 1) 4) (take ($ lst 1) 4))))
(
;;; Convert bin counts to percentages
(define (count->percent lst)let ([n (sum ($ lst 1))])
(map list
(0)
($ lst map (λ (x) (* 100 (/ x n))) ($ lst 1)))))
(
;;; What time of day do the different devices tend to tweet?
let ([a-data (count->percent (time-EST (fill-missing-bins a-time (range 24))))]
(24))))])
[i-data (count->percent (time-EST (fill-missing-bins i-time (range 'top-right]
(parameterize ([plot-legend-anchor 600])
[plot-width list
(plot (
(tick-grid)
(lines a-data"OrangeRed"
#:color 2
#:width "Android")
#:label
(lines i-data"LightSeaGreen"
#:color 2
#:width "iPhone"))
#:label "Hour of day (EST)"
#:x-label "% of tweets"))) #:y-label
Trump’s Quotations
A quick look at Trump’s Twitter timeline reveals that many tweets quote someone else. We can plot what proportion of tweets from each device consist of quotations versus original content:
;;; Let's count how many tweets were quotes of someone else from both
;;; sources.
(define android-quoteslet ([quotes (map (λ (x) (if (string-prefix? x "\"") "Quoted" "Not Quoted"))
(0))])
($ android
(let-values ([(label n) (count-samples quotes)])map list label n))))
(
(define iphone-quoteslet ([quotes (map (λ (x) (if (string-prefix? x "\"") "Quoted" "Not Quoted"))
(0))])
($ iphone
(let-values ([(label n) (count-samples quotes)])map list label n))))
(
;;; Restructure the data for our histogram below
(define quotedlist `("Android" ,(second (first android-quotes)))
("iPhone" ,(second (second iphone-quotes)))))
`(
(define not-quotedlist `("Android" ,(second (second android-quotes)))
("iPhone" ,(second (first iphone-quotes)))))
`(
;;; Plot number of quoted vs unquoted tweets per device
list (discrete-histogram not-quoted
(plot ("Not Quoted"
#:label 2.5
#:skip 0
#:x-min "OrangeRed"
#:color "OrangeRed")
#:line-color
(discrete-histogram quoted"Quoted"
#:label 2.5
#:skip 1
#:x-min "LightSeaGreen"
#:color "LightSeaGreen"))
#:line-color 1500
#:y-max ""
#:x-label "Number of tweets") #:y-label
This figure demonstrates clearly that the iPhone user generates almost entirely novel content. Many tweets from this device function as press releases, announcements for upcoming events, etc.
Linked Content (Images)
Moving forward, we want to focus on tweets containing novel content (the red-orange bars in the figure above). We remove any tweet beginning with quotation marks.
;;; From here foward, we remove quoted tweets to focus exclusively on
;;; content unique to the Trump twitter feed
(define androidnot (string-prefix? (first x) "\""))) android))
(filter (λ (x) (
(define iphonenot (string-prefix? (first x) "\""))) iphone)) (filter (λ (x) (
Trump’s Twitter account often shares images and links related to campaign events. Is this content shared equally across devices?
;;; Next we count how may tweets from each device links to an image
(define android-picslet ([pics (map (λ (x)
(if (string-contains? x "t.co")
("Picture/Link"
"No picture/link"))
0))])
($ android
(let-values ([(label n) (count-samples pics)])map list label n))))
(
(define iphone-picslet ([pics (map (λ (x)
(if (string-contains? x "t.co")
("Picture/Link"
"No picture/link"))
0))])
($ iphone
(let-values ([(label n) (count-samples pics)])map list label n))))
(
;;; Restructure the data for our histogram below
(define picslist `("Android" ,(second (second android-pics)))
("iPhone" ,(second (second iphone-pics)))))
`(
(define no-picslist `("Android" ,(second (first android-pics)))
("iPhone" ,(second (first iphone-pics)))))
`(
;;; Finally, we can plot the ratio of tweets with pics to no pics
'top-right])
(parameterize ([plot-legend-anchor list (discrete-histogram no-pics
(plot ("No picture/link"
#:label 2.5
#:skip 0
#:x-min "OrangeRed"
#:color "OrangeRed")
#:line-color
(discrete-histogram pics"Picture/link"
#:label 2.5
#:skip 1
#:x-min "LightSeaGreen"
#:color "LightSeaGreen"))
#:line-color 1100
#:y-max ""
#:x-label "Number of tweets")) #:y-label
Once again, we see a distinct pattern. The iPhone device shares a rich amount of media content, typically similar to the following:
Most Common Words Across Devices
For the remaining analysis, we’ll perform text-level analyses. To begin, we clean up using typical text-processing strategies: remove case, URLs, punctuation, stop-words, and extra spaces. Then we split tweets into words and combine across tweets for each device, as well as for both devices combined.
Finally, we plot the 20 most frequent words used across both devices.
;;; Normalize case, remove URLs, remove punctuation, and remove spaces
;;; from each tweet.
(define (preprocess-text str)
(string-normalize-spaces
(remove-punctuation
(remove-urlsstring-downcase str)) #:websafe? #t)))
(
map (λ (x)
(define a (
(remove-stopwords x))map (λ (y)
(
(string-split (preprocess-text y)))0))))
($ android
map (λ (x)
(define i (
(remove-stopwords x))map (λ (y)
(
(string-split (preprocess-text y)))0))))
($ iphone
;;; Remove empty strings and flatten tweets into a single list of
;;; words
not (equal? x ""))) (flatten a)))
(define a (filter (λ (x) (not (equal? x ""))) (flatten i)))
(define i (filter (λ (x) (;;; All words from both sources of tweets
append a i))
(define b (;;; Only words that are used by both devices
(define c (set-intersect a i))
;; ;;; Word list from android and iphone tweets
sort (sorted-counts a)
(define awords (> (second x) (second y)))))
(λ (x y) (sort (sorted-counts i)
(define iwords (> (second x) (second y)))))
(λ (x y) (sort (sorted-counts b)
(define bwords (> (second x) (second y)))))
(λ (x y) (
;;; Plot the top 20 words from both devices combined
600]
(parameterize ([plot-width 600])
[plot-height list
(plot (
(tick-grid)reverse (take bwords 20))
(discrete-histogram (#t
#:invert? "DimGray"
#:color "DimGray"
#:line-color 450))
#:y-max "Occurrences"
#:x-label "word")) #:y-label
These terms should be familiar if you’ve been following the Trump spin machine. Even more interesting is a comparison between the words from the android vs iPhone devices. We use log-odds (to overcome the asymmetric nature of data from ratios).
;;; Now we calculate the log odds ratio of words showing up in tweets
;;; across both devices
(define (get-word-freq w lst)let ([word-freq (filter (λ (x) (equal? (first x) w)) lst)])
(if (null? word-freq) 0 (second (first word-freq)))))
(
;;; Next, calculate the log odds for the top-20 words from the
;;; android, and then the log odds for the top-20 words from the
;;; iphone.
(define log-oddsmap (λ (x)
(
`(,x/
,(log-base (/ (add1 (get-word-freq x awords)) (add1 (length a)))
(/ (add1 (get-word-freq x iwords)) (add1 (length i))))
(2)))
#:base
c))
;;; Plot the results
let ([android-words (take (sort log-odds (λ (x y) (> (second x) (second y)))) 20)]
(sort log-odds (λ (x y) (< (second x) (second y)))) 20)])
[iphone-words (take (600]
(parameterize ([plot-width 600])
[plot-height list
(plot (
(discrete-histogram iphone-words#t
#:invert? -7.5
#:y-min 5
#:y-max "LightSeaGreen"
#:color "LightSeaGreen")
#:line-color reverse android-words)
(discrete-histogram (#t
#:invert? 20
#:x-min -7.5
#:y-min 5
#:y-max "OrangeRed"
#:color "OrangeRed"))
#:line-color "Android / iPhone log ratio"
#:x-label "word"))) #:y-label
In the above figure, positive log-odds reflect words more commonly found in android tweets, with negative log ratios for words more common in iPhone-derived tweets. A quick visual scan reveals a clear distinction between the two groups of tweets. The iPhone features campaign slogans and hashtags, dates, and event information. The android account does not, by comparison, but does feature many more emotionally-charged words.
Sentiment Analysis
In the figure above, the top words suggest a difference in sentiment, or emotional valence between the two devices. We can assess this further with sentiment analysis by assigning an affective label to each word in the tweets. Words are assigned one of the following labels (or none at all):
- sadness
- fear
- anger
- surprise
- anticipation
- trust
- joy
By tabulating the frequency of sentiment scores, we can inspect the top 10 most influential words from each emotional label, and determine whether they occur most frequently in android vs iPhone tweets.
;;; Sentiment from words coming from either device
second x)) (list->sentiment bwords #:lexicon 'nrc)))
(define bsentiment (filter (λ (x) (
;;; We calculate the log odds for each affective label from the
;;; sentiment analysis
(define (find-sentiment-log-odds str)map (λ (x)
(
`(,x/
,(log-base (/ (add1 (get-word-freq x awords)) (add1 (length a)))
(/ (add1 (get-word-freq x iwords)) (add1 (length i))))
(2)))
#:base 1 (λ (x) (string=? x str))) 0)))
($ (subset bsentiment
;;; Apply the above helper to each affective label
"sadness"))
(define sadness-lo (find-sentiment-log-odds "fear"))
(define fear-lo (find-sentiment-log-odds "anger"))
(define anger-lo (find-sentiment-log-odds "disgust"))
(define disgust-lo (find-sentiment-log-odds "surprise"))
(define surprise-lo (find-sentiment-log-odds "anticipation"))
(define anticipation-lo (find-sentiment-log-odds "trust"))
(define trust-lo (find-sentiment-log-odds "joy"))
(define joy-lo (find-sentiment-log-odds
;;; Helper plotting function
(define (plot-sentiment lst)let* ([n (min 10 (length lst))]
(sort lst (λ (x y) (> (abs (second x)) (abs (second y))))) n)]
[top-words (take (second x))) top-words)]
[android-words (filter (λ (x) (positive? (second x))) top-words)])
[iphone-words (filter (λ (x) (negative? (300]
(parameterize ([plot-width 400]
[plot-height 'right]
[plot-x-tick-label-anchor 90])
[plot-x-tick-label-angle list
(plot-pict (
(discrete-histogram android-words-4
#:y-min 4
#:y-max "OrangeRed"
#:color "OrangeRed")
#:line-color reverse iphone-words)
(discrete-histogram (length android-words)
#:x-min (-4
#:y-min 4
#:y-max "LightSeaGreen"
#:color "LightSeaGreen"))
#:line-color ""
#:x-label ""))))
#:y-label
;;; Plot everything together
(vl-append"sadness" null 20))
(hc-append (ct-superimpose (plot-sentiment sadness-lo) (text "fear" null 20))
(ct-superimpose (plot-sentiment fear-lo) (text "anger" null 20))
(ct-superimpose (plot-sentiment anger-lo) (text "digust" null 20)))
(ct-superimpose (plot-sentiment disgust-lo) (text "surprise" null 20))
(hc-append (ct-superimpose (plot-sentiment surprise-lo) (text "anticipation" null 20))
(ct-superimpose (plot-sentiment anticipation-lo) (text "trust" null 20))
(ct-superimpose (plot-sentiment trust-lo) (text "joy" null 20)))) (ct-superimpose (plot-sentiment joy-lo) (text