Steve and Leo often joke about how long Security Now! episodes have been getting, and I'd sort of had it in the back of my mind for a while to actually look into it. The other day the moon was finally in the right phase, so I typed up a quick script to scrape the Security Now! page on GRC.com for the episode lengths and plot them on a chart.
A chart showing that the length of Security Now! episodes has been increasing over time
It got a positive reception from Steve and many people who saw him retweet it, and I was wondering what I could find with some deeper analysis, so after doing one more on-the-fly chart separating feedback vs. non-feedback episodes, I started over with the intention of doing things in a more scalable, flexible manner, versus the off-the-cuff code I had used before.
You can get in touch with me at or on Twitter @cyphase. I'm also cyphase on TWiT IRC (webchat).
If you appreciate this work, and/or want to see more sooner rather than maybe someday, you can send me bitcoins: 1AWZy5X89KH54ntYcZXELgDfuuoFosmm7q
Steve Gibson (@SGgrc), for all his work on the podcast, and everything else.
Leo Laporte (@leolaporte), for convincing Steve to do the show and making it possible.
Elaine Farris (@ElaineFarris), for her awesome transcripts, without which much of this analysis would have been impossible.
TWiT, for giving us something to listen to.
I'm also grateful for all the too-numerous-to-mention developers of Python, IPython, NumPy, pandas, matplotlib, Vincent, prettyplotlib, seaborn, Anaconda, and everything else in the PyData stack and beyond.
Non-code text like this is the main textual content that most people will be interested in, in addition to the charts and tables of course.
The code is displayed by default; however, if you're not interested in and/or don't want to see the code, you can toggle it here, or with the bigger link below. JavaScript must be enabled for the toggle to work.
%matplotlib inline
""" This is a Python docstring. """
def make_cool_charts(data):
print "Processing %s data..." % data
print "Done!"
security_now_data = "Security Now!"
make_cool_charts(security_now_data)
This section is mostly about initializing, retrieving the data and cleaning it up; there isn't any analysis here.
#
# Import dependencies and initialize
#
import re
from collections import namedtuple, Counter
from datetime import datetime
from math import log
from urllib2 import urlopen
from HTMLParser import HTMLParser
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
# Set the default matplotlib figure size
pylab.rcParams['figure.figsize'] = (20, 12)
#import vincent
## Initialize vincent's IPython integration
#vincent.core.initialize_notebook()
# Whether the transcripts are available locally
TRANSCRIPTS_EXIST_LOCALLY = True
# Should the transcripts be saved to disk once they're loaded
TRANSCRIPTS_SAVE_TO_DISK = not TRANSCRIPTS_EXIST_LOCALLY
# The remote URL as a format string that takes the episode number
TRANSCRIPTS_REMOTE_URL = 'https://www.grc.com/sn/sn-%03d.txt'
# The local path as a format string that takes the episode number
TRANSCRIPTS_LOCAL_PATH = 'sn_transcripts/%03d.txt'
#
# Get raw HTML from episode archive pages
#
# Order of the URLs doesn't matter of course, but doing it this way will
# retrieve the information in order of descending episode number
sn_episode_archive_urls = ['https://www.grc.com/securitynow.htm']
sn_episode_archive_urls.extend('https://www.grc.com/sn/past/%s.htm' % year for year in range(2011,2004,-1))
sn_episode_archive_html = '\n'.join(urlopen(url).read() for url in sn_episode_archive_urls)
#
# Get episode information from the raw HTML of the archive pages
#
raw_episodes_info_regex = re.compile(
r'Episode.*?(?P<ep_num>\d*?) \| (?P<ep_date>.*?) \| (?P<ep_len>\d*).*?'
'<b>(?P<ep_title>.*?)</b>.*?', flags=re.MULTILINE|re.DOTALL)
raw_episodes_info = raw_episodes_info_regex.findall(sn_episode_archive_html)
Here's a sample of the collected data (the most recent five episodes), fresh from the regular expression used to scrape the archive pages:
raw_episodes_info[:5]
#
# Create DataFrame and clean the data
#
episodes_info = pd.DataFrame(raw_episodes_info)
episodes_info.columns = ['Episode #', 'Date', 'Length (mins)', 'Title']
func_list = (int, pd.to_datetime, int, HTMLParser().unescape)
for name,func in zip(episodes_info.columns, func_list):
episodes_info[name] = episodes_info[name].apply(func)
episodes_info.index = episodes_info['Episode #']
del episodes_info['Episode #']
episodes_info.sort_index(inplace=True)
The most recent five episodes, cleaned up and loaded into a pandas.DataFrame
:
episodes_info.tail(5)
# Create "Is Feedback" column
episodes_info['Is Feedback'] = episodes_info['Title'].str.contains(
r'(?=Listener)|(?=Feedback)|(?=Q\&A)')
Added a new column, Is Feedback
, indicating whether the episode is a feedback episode:
episodes_info.tail(5)
#
# Get/load transcripts, and possibly save them to disk
#
def get_transcript(ep_num):
if TRANSCRIPTS_EXIST_LOCALLY:
with open(TRANSCRIPTS_LOCAL_PATH % ep_num, 'r') as f:
data = f.read()
else:
data = urlopen(TRANSCRIPTS_REMOTE_URL % ep_num).read()
return data
episodes_info['Raw Transcript'] = [get_transcript(ep_num)
for ep_num in episodes_info.index]
# Possibly save the transcripts to disk
if TRANSCRIPTS_SAVE_TO_DISK and TRANSCRIPTS_LOCAL_PATH:
for idx in episodes_info.index:
with open(TRANSCRIPTS_LOCAL_PATH % idx, 'w') as f:
f.write(episodes_info['Raw Transcript'][idx])
The raw transcript information has been added for each episode under the column Raw Transcript
:
episodes_info.head(5)
The following blob of code implements sufficient transcript parsing and querying for the current uses.
ParsedLine = namedtuple('ParsedLine', ['kind', 'tag', 'content'])
AnalyzedLine = namedtuple('AnalyzedLine', ['kind', 'metadata', 'content'])
Speech = namedtuple('Speech', ['speaker', 'content'])
colontag_pattern = re.compile(r'^([A-Z][A-Z ]*):')
brackettag_pattern = re.compile(r'^\[(.*?)\]')
words_pattern = re.compile(r'[a-z][-a-z]*[a-z]')
class Transcript(object):
def __init__(self, raw_transcript):
self._cleaned_raw_transcript = self._clean_transcript(raw_transcript)
self._lines = [line.strip()
for line in self._cleaned_raw_transcript.splitlines()
if line.strip()]
self._parsed_lines = [self._parse_line(line) for line in self._lines]
self._analyzed_lines = self._analyze_lines(self._parsed_lines)
self._normalize_speakers()
def get_speech_by(self, speakers=[], speech_type='', not_in=False):
speech_type = "speech.%s" % speech_type
for a_line in self._analyzed_lines:
if a_line.kind.startswith(speech_type):
speaker_test = a_line.metadata['speaker'] in speakers
if not_in:
speaker_test = not speaker_test
if speaker_test:
yield Speech(a_line.metadata['speaker'], a_line.content)
@property
def all_speech(self):
return self.get_speech_by(not_in=True)
def get_word_freq(self, speakers=[], speech_type='', not_in=False):
speeches = self.get_speech_by(speakers, speech_type, not_in)
c = Counter()
for speech in speeches:
c.update(words_pattern.findall(speech.content.lower()))
return c
def _normalize_speakers(self):
pass
@staticmethod
def _clean_transcript(raw_transcript):
replacement_list = [
('CLIP:', '[CLIP:ONELINE]'),
('[Clip]', '[CLIP:START]'), ('[CLIP]', '[CLIP:START]'),
('[Begin clip]', '[CLIP:START]'), ('[Video clip]', '[CLIP:START]'), ('[Pause clip]', '[CLIP:PAUSE]'),
('[Resume clip]', '[CLIP:RESUME]'), ('[End clip]', '[CLIP:END]'), ('[End video clip]', '[CLIP:END]'),
('[Begin KABC7 interview]', '[CLIP:START]'), ('[End interview]', '[CLIP:END]'),
('[Begin embedded clip]', '[SUBCLIP:START]'), ('[Begin 1990 recording]', '[CLIP:START]'),
('[Applause and SpinRite giveaway]', '[SOUND:APPLAUSE]\n[CUSTOM:SpinRite giveaway]\n[CLIP:END]'),
('[Talking simultaneously]', '[SOUND:CROSSTALK]'), ('[Speaking simultaneously]', '[SOUND:CROSSTALK]'),
('[Crosstalk]', '[SOUND:CROSSTALK]'), ('[Laughter]', '[SOUND:LAUGHTER]'), ('[Laughing]', '[SOUND:LAUGHTER]'),
('[laughing]', '[SOUND:LAUGHTER]'), ('[laughter]', '[SOUND:LAUGHTER]'), ('[Music]', '[SOUND:MUSIC]'),
('[Commercial break]', '[COMMERCIAL]'), ('[Interruption]', '[INTERRUPTION]'),
('[Loud yabba-dabba do]', '[YABBA_DABBA_DO]'), ('[Barely audible "yabba-dabba do"]', '[YABBA_DABBA_DO]'),
('[indiscernible]', '[SOUND:INDISCERNIBLE]'), ('[Indiscernible]', '[SOUND:INDISCERNIBLE]'), ('[sic]', '[SIC]'),
('[Sighing]', '[SOUND:SIGHING]'), ('[sighing]', '[SOUND:SIGHING]'), ('[Australian accent]', '[ACCENT:AUSTRALIAN]'),
('[Indian accent]', '[ACCENT:INDIAN]'), ('[With accent]', '[ACCENT]'), ('[Italian accent]', '[ACCENT:ITALIAN]'),
('[Bad accent]', '[ACCENT]'), ('[in bad Italian accent]', '[ACCENT:ITALIAN]'),
('[In a British accent]', '[ACCENT:BRITISH]'), ('[Accent]', '[ACCENT]'), ('[Dracula accent]', '[ACCENT:DRACULA]')]
replacement_list.extend([('Title:\t\t', 'TITLE:\t\t')])
for start,end in replacement_list:
raw_transcript = raw_transcript.replace(start, end)
return raw_transcript
@staticmethod
def _parse_line(line):
colontag_match = colontag_pattern.match(line)
if colontag_match:
pl_kind = 'colon'
pl_tag = colontag_match.group(1)
pl_content = line[len(colontag_match.group(0)):].strip()
parsed_line = ParsedLine(pl_kind, pl_tag, pl_content)
else:
brackettag_match = brackettag_pattern.match(line)
if brackettag_match:
pl_kind = 'bracket'
pl_tag = brackettag_match.group(1)
pl_content = line[len(brackettag_match.group(0)):].strip()
parsed_line = ParsedLine(pl_kind, pl_tag, pl_content)
elif line.startswith('GIBSON RESEARCH CORPORATION'):
parsed_line = ParsedLine(kind='head', tag=None, content=line)
else:
parsed_line = ParsedLine(kind=None, tag=None, content=line)
return parsed_line
@staticmethod
def _analyze_lines(parsed_lines):
header_colontags = ['DATE', 'DESCRIPTION', 'EPISODE', 'FILE ARCHIVE',
'GUEST', 'INTRO', 'SERIES', 'SHOW TEASE',
'SOURCE FILE', 'SPEAKERS', 'TITLE']
misc_colontags = ['BOTH', 'CLIP', 'DNS']
analyzed_lines = []
for idx,p_line in enumerate(parsed_lines):
if p_line.kind == 'colon':
if p_line.tag in header_colontags:
al_kind = p_line.tag
al_metadata = None
al_content = p_line.content
analyzed_line = AnalyzedLine(al_kind, al_metadata, al_content)
else:
al_kind = 'speech.initial'
al_metadata = {'speaker': p_line.tag.lower()}
al_content = p_line.content
analyzed_line = AnalyzedLine(al_kind, al_metadata, al_content)
elif p_line.kind == 'bracket':
al_kind = 'bracket'
al_metadata = None
al_content = p_line.content
analyzed_line = AnalyzedLine(al_kind, al_metadata, al_content)
elif p_line.kind == None:
last_speech_line = (analyzed_lines[i] for i in xrange(len(analyzed_lines)-1, -1, -1)
if analyzed_lines[i].kind.startswith('speech.')).next()
al_kind = 'speech.cont'
al_metadata = {'speaker': last_speech_line.metadata['speaker']}
al_content = p_line.content
analyzed_line = AnalyzedLine(al_kind, al_metadata, al_content)
elif p_line.kind == 'head':
al_kind = 'head'
al_metadata = None
al_content = p_line.content
analyzed_line = AnalyzedLine(al_kind, al_metadata, al_content)
else:
raise Exception, "This should never happen"
analyzed_lines.append(analyzed_line)
return analyzed_lines
STEVE = ['steve', 'steve gibson']
LEO = ['leo', 'leo laporte']
transcripts = dict((ep_num,Transcript(episodes_info["Raw Transcript"][ep_num]))
for ep_num in episodes_info.index)
word_counts = dict((ep,transcripts[ep].get_word_freq(not_in=True)) for ep in episodes_info.index)
steve_word_counts = dict((ep,transcripts[ep].get_word_freq(speakers=STEVE)) for ep in episodes_info.index)
leo_word_counts = dict((ep,transcripts[ep].get_word_freq(speakers=LEO)) for ep in episodes_info.index)
A simple chart without any smoothing or other modifications. It's very volatile, but you can see the upward trend.
devnull = episodes_info['Length (mins)'].plot(legend=False, xticks=range(0,500,25), yticks=range(0,140,10))
This moving average calculation uses a window centered on the x coordinate; it includes the preceding 6 and following 5 values. Because 6 preceding values are required for the calculation, the lines don't start at the y-axis (x = 0
).
episodes_info['4-episode Moving Average'] = pd.rolling_mean(episodes_info['Length (mins)'], 4, center=True)
episodes_info['12-episode Moving Average'] = pd.rolling_mean(episodes_info['Length (mins)'], 12, center=True)
to_plot = ['4-episode Moving Average', '12-episode Moving Average']
devnull = episodes_info[to_plot].plot(style=['g', 'r'], legend=True, xticks=range(0,500,25), yticks=range(0,140,10))
length_expanding_max = pd.expanding_max(episodes_info['Length (mins)'])
dups = length_expanding_max.duplicated()
max_pusher_table = pd.DataFrame(episodes_info['Length (mins)'][dups == False]).T
The expanding maximum is the maximum of all the episode lengths up to the current episode.
As you can see, there, there have only been 18 episodes that increased the maximum episode length.
devnull = pd.expanding_max(episodes_info['Length (mins)']).plot(legend=False, xticks=range(0,500,25), yticks=range(0,140,10))
max_pusher_table
This is just an educated guess, but I believe the reason for the drop in WPM between episodes 75-100 is the fact that the transcripts stopped including the advertising live reads from Leo. That dropped the number of transcribed words per episode, but not the length, and so the WPM decreased. See the next chart showing the frequency of ad-related words for some evidence of this.
devnull = pd.rolling_mean(pd.Series((
sum(word_counts[ep_num].values()) / episodes_info['Length (mins)'][ep_num]
for ep_num in episodes_info.index), index=episodes_info.index), 12).plot(legend=True,
xticks=range(0,500,25),
yticks=range(125,160,5))
def word_frequency_chart(words, xticks=None, yticks=None):
return pd.DataFrame(dict((word, [word_counts[ep_num].get(word.lower(), 0) for ep_num in episodes_info.index])
for word in words)).cumsum().plot(legend=True, xticks=xticks, yticks=yticks)
def word_occurence_chart(words, xticks=None, yticks=None):
return pd.DataFrame(dict((word, [int(word.lower() in word_counts[ep_num]) for ep_num in episodes_info.index])
for word in words)).cumsum().plot(legend=True, xticks=xticks, yticks=yticks)
The Astaro and Dell data may be biased by the fact that it was around the time of their sudden change in frequency that the transcripts stopped including the advertising live reads from Leo. You can see a more subtle change in the frequency of 'Sponsor' around the same time.
words = ['AOL', 'Astaro', 'Audible', 'Carbonite', 'Dell', 'ProXPN', 'Sponsor']
devnull = word_frequency_chart(words=words, xticks=range(0,500,25), yticks=range(0,700,25))
words = ['Krebs', 'Schneier', 'Kaminsky', 'Bernstein']
devnull = word_frequency_chart(words=words, xticks=range(0,500,25), yticks=range(0,140,10))
words = ['Flash', 'Java', 'Reader']
devnull = word_frequency_chart(words=words, xticks=range(0,500,25), yticks=range(0,1400,50))
Currently the code can't easily combine IE and Explorer, but if you add where they end, you can see that Firefox just barely edges it out.
words = ['Firefox', 'Chrome', 'Chromium', 'Opera', 'Explorer', 'IE']
devnull = word_frequency_chart(words=words, xticks=range(0,500,25), yticks=range(0,2000,50))
words = ['NSA', 'Snowden', 'ProXPN', 'VPN']
devnull = word_frequency_chart(words=words, xticks=range(0,500,25), yticks=range(0,1000,50))
words = ['Sci-Fi', 'Breach', 'Jenny', 'Bitcoin', 'Zerafa']
devnull = word_frequency_chart(words=words, xticks=range(0,475,25), yticks=range(0,800,50))
There's only about an 11 minute spread.
x = pd.rolling_mean(episodes_info['Length (mins)'].groupby(lambda x: -1 if x < 0 else (x-21)%52).mean(),4)
devnull = x.plot(xticks=range(0,53,4))
devnull = pd.expanding_mean(episodes_info['Length (mins)']).plot()