Security Now! Stats

Steve and Leo often joke about how long Security Now! episodes have been getting, and I'd sort of had it in the back of my mind for a while to actually look into it. The other day the moon was finally in the right phase, so I typed up a quick script to scrape the Security Now! page on GRC.com for the episode lengths and plot them on a chart.

A chart showing that the length of Security Now! episodes has been increasing over time

A chart showing that the length of Security Now! episodes has been increasing over time

It got a positive reception from Steve and many people who saw him retweet it, and I was wondering what I could find with some deeper analysis, so after doing one more on-the-fly chart separating feedback vs. non-feedback episodes, I started over with the intention of doing things in a more scalable, flexible manner, versus the off-the-cuff code I had used before.

A chart showing episode length, but separating feedback and non-feedback episodes; the two lines have similar shapes, but the feedback episodes line is higher than the non-feedback line at all but two places.

Get in touch

You can get in touch with me at or on Twitter @cyphase. I'm also cyphase on TWiT IRC (webchat).

If you appreciate this work, and/or want to see more sooner rather than maybe someday, you can send me bitcoins: 1AWZy5X89KH54ntYcZXELgDfuuoFosmm7q

A Bitcoin QR code

Credits & Thanks

Steve Gibson (@SGgrc), for all his work on the podcast, and everything else.

Leo Laporte (@leolaporte), for convincing Steve to do the show and making it possible.

Elaine Farris (@ElaineFarris), for her awesome transcripts, without which much of this analysis would have been impossible.

TWiT, for giving us something to listen to.

I'm also grateful for all the too-numerous-to-mention developers of Python, IPython, NumPy, pandas, matplotlib, Vincent, prettyplotlib, seaborn, Anaconda, and everything else in the PyData stack and beyond.

Document Notes

Non-code text like this is the main textual content that most people will be interested in, in addition to the charts and tables of course.

The code is displayed by default; however, if you're not interested in and/or don't want to see the code, you can toggle it here, or with the bigger link below. JavaScript must be enabled for the toggle to work.

In [1]:
%matplotlib inline
""" This is a Python docstring. """

def make_cool_charts(data):
    print "Processing %s data..." % data
    print "Done!"

security_now_data = "Security Now!"

make_cool_charts(security_now_data)
Processing Security Now! data...
Done!

Let's begin!

Initialization and data-gathering

This section is mostly about initializing, retrieving the data and cleaning it up; there isn't any analysis here.

In [2]:
#
# Import dependencies and initialize
#

import re

from collections import namedtuple, Counter
from datetime import datetime
from math import log
from urllib2 import urlopen
from HTMLParser import HTMLParser

import pandas as pd

import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
# Set the default matplotlib figure size
pylab.rcParams['figure.figsize'] = (20, 12)

#import vincent
## Initialize vincent's IPython integration
#vincent.core.initialize_notebook()

# Whether the transcripts are available locally
TRANSCRIPTS_EXIST_LOCALLY = True
# Should the transcripts be saved to disk once they're loaded
TRANSCRIPTS_SAVE_TO_DISK = not TRANSCRIPTS_EXIST_LOCALLY
# The remote URL as a format string that takes the episode number
TRANSCRIPTS_REMOTE_URL = 'https://www.grc.com/sn/sn-%03d.txt'
# The local path as a format string that takes the episode number
TRANSCRIPTS_LOCAL_PATH = 'sn_transcripts/%03d.txt'
In [3]:
#
# Get raw HTML from episode archive pages
#

# Order of the URLs doesn't matter of course, but doing it this way will
# retrieve the information in order of descending episode number
sn_episode_archive_urls = ['https://www.grc.com/securitynow.htm']
sn_episode_archive_urls.extend('https://www.grc.com/sn/past/%s.htm' % year for year in range(2011,2004,-1))

sn_episode_archive_html = '\n'.join(urlopen(url).read() for url in sn_episode_archive_urls)
In [4]:
#
# Get episode information from the raw HTML of the archive pages
#

raw_episodes_info_regex = re.compile(
    r'Episode.*?(?P<ep_num>\d*?) \| (?P<ep_date>.*?) \| (?P<ep_len>\d*).*?'
    '<b>(?P<ep_title>.*?)</b>.*?', flags=re.MULTILINE|re.DOTALL)
raw_episodes_info = raw_episodes_info_regex.findall(sn_episode_archive_html)

Here's a sample of the collected data (the most recent five episodes), fresh from the regular expression used to scrape the archive pages:

In [5]:
raw_episodes_info[:5]
Out[5]:
[('453', '29 Apr 2014', '111', 'Certificate Revocation Part 1'),
 ('452', '22 Apr 2014', '103', 'Listener Feedback #186'),
 ('451', '15 Apr 2014', '101', 'TrueCrypt &amp; Heartbleed Part 2'),
 ('450', '08 Apr 2014', '96', 'How the Heartbleeds'),
 ('449', '01 Apr 2014', '128', 'Listener Feedback #185')]
In [6]:
#
# Create DataFrame and clean the data
#

episodes_info = pd.DataFrame(raw_episodes_info)

episodes_info.columns = ['Episode #', 'Date', 'Length (mins)', 'Title']

func_list = (int, pd.to_datetime, int, HTMLParser().unescape)

for name,func in zip(episodes_info.columns, func_list):
    episodes_info[name] = episodes_info[name].apply(func)

episodes_info.index = episodes_info['Episode #']
del episodes_info['Episode #']
episodes_info.sort_index(inplace=True)

The most recent five episodes, cleaned up and loaded into a pandas.DataFrame:

In [7]:
episodes_info.tail(5)
Out[7]:
Date Length (mins) Title
Episode #
449 2014-04-01 128 Listener Feedback #185
450 2014-04-08 96 How the Heartbleeds
451 2014-04-15 101 TrueCrypt & Heartbleed Part 2
452 2014-04-22 103 Listener Feedback #186
453 2014-04-29 111 Certificate Revocation Part 1

5 rows × 3 columns

In [8]:
# Create "Is Feedback" column
episodes_info['Is Feedback'] = episodes_info['Title'].str.contains(
    r'(?=Listener)|(?=Feedback)|(?=Q\&A)')

Added a new column, Is Feedback, indicating whether the episode is a feedback episode:

In [9]:
episodes_info.tail(5)
Out[9]:
Date Length (mins) Title Is Feedback
Episode #
449 2014-04-01 128 Listener Feedback #185 True
450 2014-04-08 96 How the Heartbleeds False
451 2014-04-15 101 TrueCrypt & Heartbleed Part 2 False
452 2014-04-22 103 Listener Feedback #186 True
453 2014-04-29 111 Certificate Revocation Part 1 False

5 rows × 4 columns

In [10]:
#
# Get/load transcripts, and possibly save them to disk
#

def get_transcript(ep_num):
    if TRANSCRIPTS_EXIST_LOCALLY:
        with open(TRANSCRIPTS_LOCAL_PATH % ep_num, 'r') as f:
            data = f.read()
    else:
        data = urlopen(TRANSCRIPTS_REMOTE_URL % ep_num).read()
    
    return data

episodes_info['Raw Transcript'] = [get_transcript(ep_num)
                                   for ep_num in episodes_info.index]

# Possibly save the transcripts to disk
if TRANSCRIPTS_SAVE_TO_DISK and TRANSCRIPTS_LOCAL_PATH:    
    for idx in episodes_info.index:
        with open(TRANSCRIPTS_LOCAL_PATH % idx, 'w') as f:
            f.write(episodes_info['Raw Transcript'][idx])

The raw transcript information has been added for each episode under the column Raw Transcript:

In [11]:
episodes_info.head(5)
Out[11]:
Date Length (mins) Title Is Feedback Raw Transcript
Episode #
1 2005-08-19 18 As the Worm Turns False GIBSON RESEARCH CORPORATION\thttp://www.GRC.co...
2 2005-08-25 25 " HoneyMonkeys " False GIBSON RESEARCH CORPORATION\thttp://www.GRC.co...
3 2005-09-01 25 NAT Routers as Firewalls False GIBSON RESEARCH CORPORATION\thttp://www.GRC.co...
4 2005-09-08 24 Personal Password Policy False GIBSON RESEARCH CORPORATION\thttp://www.GRC.co...
5 2005-09-15 20 Personal Password Policy — Part 2 False GIBSON RESEARCH CORPORATION\thttp://www.GRC.co...

5 rows × 5 columns

Transcript Parsing

The following blob of code implements sufficient transcript parsing and querying for the current uses.

In [12]:
ParsedLine = namedtuple('ParsedLine', ['kind', 'tag', 'content'])
AnalyzedLine = namedtuple('AnalyzedLine', ['kind', 'metadata', 'content'])
Speech = namedtuple('Speech', ['speaker', 'content'])

colontag_pattern = re.compile(r'^([A-Z][A-Z ]*):')
brackettag_pattern = re.compile(r'^\[(.*?)\]')
words_pattern = re.compile(r'[a-z][-a-z]*[a-z]')

class Transcript(object):
    def __init__(self, raw_transcript):
        self._cleaned_raw_transcript = self._clean_transcript(raw_transcript)
        self._lines = [line.strip()
                       for line in self._cleaned_raw_transcript.splitlines()
                       if line.strip()]
        self._parsed_lines = [self._parse_line(line) for line in self._lines]
        self._analyzed_lines = self._analyze_lines(self._parsed_lines)
        self._normalize_speakers()
    
    def get_speech_by(self, speakers=[], speech_type='', not_in=False):
        speech_type = "speech.%s" % speech_type
        for a_line in self._analyzed_lines:
            if a_line.kind.startswith(speech_type):
                speaker_test = a_line.metadata['speaker'] in speakers
                if not_in:
                    speaker_test = not speaker_test
                if speaker_test:
                    yield Speech(a_line.metadata['speaker'], a_line.content)
    
    @property
    def all_speech(self):
        return self.get_speech_by(not_in=True)
    
    def get_word_freq(self, speakers=[], speech_type='', not_in=False):
        speeches = self.get_speech_by(speakers, speech_type, not_in)
        
        c = Counter()
        for speech in speeches:
            c.update(words_pattern.findall(speech.content.lower()))
        
        return c
    
    def _normalize_speakers(self):
        pass
    
    @staticmethod
    def _clean_transcript(raw_transcript):
        replacement_list = [
            ('CLIP:', '[CLIP:ONELINE]'),
            ('[Clip]', '[CLIP:START]'), ('[CLIP]', '[CLIP:START]'),
            ('[Begin clip]', '[CLIP:START]'), ('[Video clip]', '[CLIP:START]'), ('[Pause clip]', '[CLIP:PAUSE]'),
            ('[Resume clip]', '[CLIP:RESUME]'), ('[End clip]', '[CLIP:END]'), ('[End video clip]', '[CLIP:END]'),
            ('[Begin KABC7 interview]', '[CLIP:START]'), ('[End interview]', '[CLIP:END]'),
            ('[Begin embedded clip]', '[SUBCLIP:START]'), ('[Begin 1990 recording]', '[CLIP:START]'),
            ('[Applause and SpinRite giveaway]', '[SOUND:APPLAUSE]\n[CUSTOM:SpinRite giveaway]\n[CLIP:END]'),
            ('[Talking simultaneously]', '[SOUND:CROSSTALK]'), ('[Speaking simultaneously]', '[SOUND:CROSSTALK]'),
            ('[Crosstalk]', '[SOUND:CROSSTALK]'), ('[Laughter]', '[SOUND:LAUGHTER]'), ('[Laughing]', '[SOUND:LAUGHTER]'),
            ('[laughing]', '[SOUND:LAUGHTER]'), ('[laughter]', '[SOUND:LAUGHTER]'), ('[Music]', '[SOUND:MUSIC]'),
            ('[Commercial break]', '[COMMERCIAL]'), ('[Interruption]', '[INTERRUPTION]'),
            ('[Loud yabba-dabba do]', '[YABBA_DABBA_DO]'), ('[Barely audible "yabba-dabba do"]', '[YABBA_DABBA_DO]'),
            ('[indiscernible]', '[SOUND:INDISCERNIBLE]'), ('[Indiscernible]', '[SOUND:INDISCERNIBLE]'), ('[sic]', '[SIC]'),
            ('[Sighing]', '[SOUND:SIGHING]'), ('[sighing]', '[SOUND:SIGHING]'), ('[Australian accent]', '[ACCENT:AUSTRALIAN]'),
            ('[Indian accent]', '[ACCENT:INDIAN]'), ('[With accent]', '[ACCENT]'), ('[Italian accent]', '[ACCENT:ITALIAN]'),
            ('[Bad accent]', '[ACCENT]'), ('[in bad Italian accent]', '[ACCENT:ITALIAN]'),
            ('[In a British accent]', '[ACCENT:BRITISH]'), ('[Accent]', '[ACCENT]'), ('[Dracula accent]', '[ACCENT:DRACULA]')]
        
        replacement_list.extend([('Title:\t\t', 'TITLE:\t\t')])
        
        for start,end in replacement_list:
            raw_transcript = raw_transcript.replace(start, end)
        
        return raw_transcript
    
    @staticmethod
    def _parse_line(line):
        colontag_match = colontag_pattern.match(line)
        if colontag_match:
            pl_kind = 'colon'
            pl_tag = colontag_match.group(1)
            pl_content = line[len(colontag_match.group(0)):].strip()
            parsed_line = ParsedLine(pl_kind, pl_tag, pl_content)
        else:
            brackettag_match = brackettag_pattern.match(line)
            if brackettag_match:
                pl_kind = 'bracket'
                pl_tag = brackettag_match.group(1)
                pl_content = line[len(brackettag_match.group(0)):].strip()
                parsed_line = ParsedLine(pl_kind, pl_tag, pl_content)
            elif line.startswith('GIBSON RESEARCH CORPORATION'):
                parsed_line = ParsedLine(kind='head', tag=None, content=line)
            else:
                parsed_line = ParsedLine(kind=None, tag=None, content=line)
        
        return parsed_line
    
    @staticmethod
    def _analyze_lines(parsed_lines):
        header_colontags = ['DATE', 'DESCRIPTION', 'EPISODE', 'FILE ARCHIVE',
                            'GUEST', 'INTRO', 'SERIES', 'SHOW TEASE',
                            'SOURCE FILE', 'SPEAKERS', 'TITLE']
        misc_colontags = ['BOTH', 'CLIP', 'DNS']
        
        analyzed_lines = []
        for idx,p_line in enumerate(parsed_lines):
            if p_line.kind == 'colon':
                if p_line.tag in header_colontags:
                    al_kind = p_line.tag
                    al_metadata = None
                    al_content = p_line.content
                    analyzed_line = AnalyzedLine(al_kind, al_metadata, al_content)
                else:
                    al_kind = 'speech.initial'
                    al_metadata = {'speaker': p_line.tag.lower()}
                    al_content = p_line.content
                    analyzed_line = AnalyzedLine(al_kind, al_metadata, al_content)
            elif p_line.kind == 'bracket':
                al_kind = 'bracket'
                al_metadata = None
                al_content = p_line.content
                analyzed_line = AnalyzedLine(al_kind, al_metadata, al_content)
            elif p_line.kind == None:
                last_speech_line = (analyzed_lines[i] for i in xrange(len(analyzed_lines)-1, -1, -1)
                                    if analyzed_lines[i].kind.startswith('speech.')).next()
                al_kind = 'speech.cont'
                al_metadata = {'speaker': last_speech_line.metadata['speaker']}
                al_content = p_line.content
                analyzed_line = AnalyzedLine(al_kind, al_metadata, al_content)
            elif p_line.kind == 'head':
                al_kind = 'head'
                al_metadata = None
                al_content = p_line.content
                analyzed_line = AnalyzedLine(al_kind, al_metadata, al_content)
            else:
                raise Exception, "This should never happen"
            analyzed_lines.append(analyzed_line)
        return analyzed_lines

STEVE = ['steve', 'steve gibson']
LEO = ['leo', 'leo laporte']
In [13]:
transcripts = dict((ep_num,Transcript(episodes_info["Raw Transcript"][ep_num]))
                   for ep_num in episodes_info.index)
In [14]:
word_counts = dict((ep,transcripts[ep].get_word_freq(not_in=True)) for ep in episodes_info.index)
steve_word_counts = dict((ep,transcripts[ep].get_word_freq(speakers=STEVE)) for ep in episodes_info.index)
leo_word_counts = dict((ep,transcripts[ep].get_word_freq(speakers=LEO)) for ep in episodes_info.index)

Finally, some charts!

Basic Chart

A simple chart without any smoothing or other modifications. It's very volatile, but you can see the upward trend.

In [15]:
devnull = episodes_info['Length (mins)'].plot(legend=False, xticks=range(0,500,25), yticks=range(0,140,10))

4-episode and 12-episode Moving Average

This moving average calculation uses a window centered on the x coordinate; it includes the preceding 6 and following 5 values. Because 6 preceding values are required for the calculation, the lines don't start at the y-axis (x = 0).

In [16]:
episodes_info['4-episode Moving Average'] = pd.rolling_mean(episodes_info['Length (mins)'], 4, center=True)
episodes_info['12-episode Moving Average'] = pd.rolling_mean(episodes_info['Length (mins)'], 12, center=True)

to_plot = ['4-episode Moving Average', '12-episode Moving Average']
devnull = episodes_info[to_plot].plot(style=['g', 'r'], legend=True, xticks=range(0,500,25), yticks=range(0,140,10))

Expanding Maximum of Episode Length

In [17]:
length_expanding_max = pd.expanding_max(episodes_info['Length (mins)'])
dups = length_expanding_max.duplicated()
max_pusher_table = pd.DataFrame(episodes_info['Length (mins)'][dups == False]).T

The expanding maximum is the maximum of all the episode lengths up to the current episode.

As you can see, there, there have only been 18 episodes that increased the maximum episode length.

In [18]:
devnull = pd.expanding_max(episodes_info['Length (mins)']).plot(legend=False, xticks=range(0,500,25), yticks=range(0,140,10))
max_pusher_table
Out[18]:
Episode # 1 2 7 11 15 19 20 32 36 40 68 126 151 165 177 196 208 449
Length (mins) 18 25 36 38 43 53 54 55 56 71 97 101 107 108 118 121 123 128

1 rows × 18 columns

12-episode Moving Average of Average Words Per Minute (WPM) Per Episode

This is just an educated guess, but I believe the reason for the drop in WPM between episodes 75-100 is the fact that the transcripts stopped including the advertising live reads from Leo. That dropped the number of transcribed words per episode, but not the length, and so the WPM decreased. See the next chart showing the frequency of ad-related words for some evidence of this.

In [19]:
devnull = pd.rolling_mean(pd.Series((
    sum(word_counts[ep_num].values()) / episodes_info['Length (mins)'][ep_num]
    for ep_num in episodes_info.index), index=episodes_info.index), 12).plot(legend=True,
                                                                             xticks=range(0,500,25),
                                                                             yticks=range(125,160,5))

Word Frequency Over Time (Cumulative Sum)

In [20]:
def word_frequency_chart(words, xticks=None, yticks=None):
    return pd.DataFrame(dict((word, [word_counts[ep_num].get(word.lower(), 0) for ep_num in episodes_info.index])
                             for word in words)).cumsum().plot(legend=True, xticks=xticks, yticks=yticks)

def word_occurence_chart(words, xticks=None, yticks=None):
    return pd.DataFrame(dict((word, [int(word.lower() in word_counts[ep_num]) for ep_num in episodes_info.index])
                             for word in words)).cumsum().plot(legend=True, xticks=xticks, yticks=yticks)

Sponsors

The Astaro and Dell data may be biased by the fact that it was around the time of their sudden change in frequency that the transcripts stopped including the advertising live reads from Leo. You can see a more subtle change in the frequency of 'Sponsor' around the same time.

In [21]:
words = ['AOL', 'Astaro', 'Audible', 'Carbonite', 'Dell', 'ProXPN', 'Sponsor']
devnull = word_frequency_chart(words=words, xticks=range(0,500,25), yticks=range(0,700,25))

Security Guys

In [22]:
words = ['Krebs', 'Schneier', 'Kaminsky', 'Bernstein']
devnull = word_frequency_chart(words=words, xticks=range(0,500,25), yticks=range(0,140,10))

Attack Surfaces

In [23]:
words = ['Flash', 'Java', 'Reader']
devnull = word_frequency_chart(words=words, xticks=range(0,500,25), yticks=range(0,1400,50))

Browsers

Currently the code can't easily combine IE and Explorer, but if you add where they end, you can see that Firefox just barely edges it out.

In [24]:
words = ['Firefox', 'Chrome', 'Chromium', 'Opera', 'Explorer', 'IE']
devnull = word_frequency_chart(words=words, xticks=range(0,500,25), yticks=range(0,2000,50))
In [25]:
words = ['NSA', 'Snowden', 'ProXPN', 'VPN']
devnull = word_frequency_chart(words=words, xticks=range(0,500,25), yticks=range(0,1000,50))

Miscellaneous

In [26]:
words = ['Sci-Fi', 'Breach', 'Jenny', 'Bitcoin', 'Zerafa']
devnull = word_frequency_chart(words=words, xticks=range(0,475,25), yticks=range(0,800,50))

Average Episode Length Throughout the Calendar Year (by week with a 4-week rolling mean)

There's only about an 11 minute spread.

In [27]:
x = pd.rolling_mean(episodes_info['Length (mins)'].groupby(lambda x: -1 if x < 0 else (x-21)%52).mean(),4)
devnull = x.plot(xticks=range(0,53,4))

Expanding Mean of Episode Length

In [28]:
devnull = pd.expanding_mean(episodes_info['Length (mins)']).plot()

More to come...