Exporting your Last.fm history into an SQLite DB for analysis

Ever heard an album or a song for the first time and loved it so much you played it on repeat for three days straight? That's happened to me a few times, and I wanted to dig into my listening history to find out which songs I loved, got sick of, and forgot about for years.

Good thing I use Last.fm obsessively. Last.fm is a service that "scrobbles"—keeps track of—every song you listen to past the halfway mark. I've been using Last.fm since 2006 and have a seriously extensive scrobble log. I wanted to move that into an SQL database so I can analyze my listening patterns or tie it into other apps.

How to scrape your Last.fm history

Last.fm has a JSON API for accessing your music history. As with many APIs, it requires an API key to access—you can get an API account here. Additionally, it's paginated, so we need to access our tracks in requests of 200 at a time.

We'll be using the user.getRecentTracks API method to access our listening history. We're going to start at the most recent page and work backwards until we reach the end.

Here's the code I used to retrieve all pages of my scrobble log and store the raw paginated data in a simple Python dict:

In [1]:
import requests

# Enter your own username and API key here.
user = 'trickybeta'
api_key = 'YOUR_LAST_FM_API_KEY_GOES_HERE'

# These are the API parameters for our scraping requests.
per_page = 200
api_url = 'http://ws.audioscrobbler.com/2.0/?method=user.getrecenttracks&user=%s&api_key=%s&format=json&page=%s&limit=%s'

def recent_tracks(user, api_key, page, limit):
    """Get the most recent tracks from `user` using `api_key`. Start at page `page` and limit results to `limit`."""
    return requests.get(api_url % (user, api_key, page, limit)).json()

# We need to get the first page so we can find out how many total pages there are in our listening history.
resp = recent_tracks(user, api_key, 0, 200)
total_pages = int(resp['recenttracks']['@attr']['totalPages'])

# Retrieve all pages of user listening history and store them into all_pages.
all_pages = []
for page_num in xrange(1, total_pages + 1):
    print 'Page', page_num, 'of', total_pages
    page = recent_tracks(user, api_key, page_num, 200)
    all_pages.append(page)
Page 1 of 211
Page 2 of 211
Page 3 of 211
Page ...
Page 209 of 211
Page 210 of 211
Page 211 of 211

Cool. We have all the paginated data, but it's not exactly clean or orderly.

Cleaning up the raw API data

Let's take a look at what one of the tracks looks like after decoding its JSON:

In [2]:
raw_track = all_pages[0]['recenttracks']['track'][0]
raw_track
Out[2]:
{u'album': {u'#text': u'', u'mbid': u''},
 u'artist': {u'#text': u'John Legend',
  u'mbid': u'75a72702-a5ef-4513-bca5-c5b944903546'},
 u'date': {u'#text': u'9 Apr 2014, 08:24', u'uts': u'1397031847'},
 u'image': [{u'#text': u'', u'size': u'small'},
  {u'#text': u'', u'size': u'medium'},
  {u'#text': u'', u'size': u'large'},
  {u'#text': u'', u'size': u'extralarge'}],
 u'mbid': u'',
 u'name': u'All Of Me (Steve James Remix)',
 u'streamable': u'0',
 u'url': u'http://www.last.fm/music/John+Legend/_/All+Of+Me+(Steve+James+Remix)'}

There's lots of nesting and nasty # characters that will make SQL really hard to write. Here are some helper functions to clean up our data into something that plays nice with a column-row database:

In [3]:
import collections

def flatten(d, parent_key=''):
    """From http://stackoverflow.com/a/6027615/254187. Modified to strip # symbols from dict keys."""
    items = []
    for k, v in d.items():
        new_key = parent_key + '_' + k if parent_key else k
        if isinstance(v, collections.MutableMapping):
            items.extend(flatten(v, new_key).items())
        else:
            new_key = new_key.replace('#', '')  # Strip pound symbols from column names
            items.append((new_key, v))
    return dict(items)

def process_track(track):
    """Removes `image` keys from track data. Replaces empty strings for values with None."""
    if 'image' in track:
        del track['image']
    flattened = flatten(track)
    for key, val in flattened.iteritems():
        if val == '':
            flattened[key] = None
    return flattened

Now that we can clean up our data, let's see how process_track operates on one of our tracks:

In [4]:
process_track(raw_track)
Out[4]:
{u'album_mbid': None,
 u'album_text': None,
 u'artist_mbid': u'75a72702-a5ef-4513-bca5-c5b944903546',
 u'artist_text': u'John Legend',
 u'date_text': u'9 Apr 2014, 08:24',
 u'date_uts': u'1397031847',
 u'mbid': None,
 u'name': u'All Of Me (Steve James Remix)',
 u'streamable': u'0',
 u'url': u'http://www.last.fm/music/John+Legend/_/All+Of+Me+(Steve+James+Remix)'}

Looks much nicer! Now 'album_text' will be SQL NULL instead of the empty string.

Exporting to a database

Our data is relatively structured. We know our track data is all the same because the Last.fm API guarantees our data is structured consistently. So we don't need to do any hardcore database or schema messing about.

Accessing a database using Dataset

Instead of declaring our DB schema explicitly, let's save time and use the Python Dataset package.

Why do we like Dataset? Because it gives us:

A simple abstraction layer removes most direct SQL statements without the necessity for a full ORM model - essentially, databases can be used like a JSON file or NoSQL store.

Basically, we can store stuff straight into our database without worrying about schema declarations. Great!

To get started with the database, we need to first:

  • Connect to the SQLite DB, creating the DB file if it doesn't already exist
  • Create a table for our data

Iterating through our data

Once we have our DB and our shiny new table, all we need to do is:

  • Iterate through our raw page data
  • Iterate through the raw tracks on each page
  • Process each track
  • Insert each processed track into the table we just created
In [5]:
import dataset

# Create a SQLite database with the filename `lastfm.sqlite`
db = dataset.connect('sqlite:///lastfm.sqlite')

# Set up a table called `tracks`
tracks = db['tracks']

# Iterate through all pages
num_pages = len(all_pages)
for page_num, page in enumerate(all_pages):
    print 'Page', page_num + 1, 'of', num_pages
    # On each page, iterate through all tracks
    for track in page['recenttracks']['track']:
        # Process each track and insert it into the `tracks` table
        tracks.insert(process_track(track))

# Confirm our tracks were inserted into the database
print 'Done!', len(tracks), 'rows in table `tracks`.'
Page 1 of 211
Page 2 of 211
Page 3 of 211
Page ...
Page 209 of 211
Page 210 of 211
Page 211 of 211
Done! 42039 rows in table `tracks`.

Fantastic. Now let's verify our database's structure and data. We'll start by checking the columns in table tracks:

In [6]:
tracks.columns
Out[6]:
['id',
 'streamable',
 'date_uts',
 'name',
 'url',
 'album_mbid',
 'mbid',
 'album_text',
 'artist_mbid',
 'artist_text',
 'date_text']

And now let's grab the first track from the database to see how it looks:

In [7]:
dict(tracks.all().next())
Out[7]:
{u'album_mbid': None,
 u'album_text': None,
 u'artist_mbid': u'75a72702-a5ef-4513-bca5-c5b944903546',
 u'artist_text': u'John Legend',
 u'date_text': u'9 Apr 2014, 08:24',
 u'date_uts': u'1397031847',
 u'id': 1,
 u'mbid': None,
 u'name': u'All Of Me (Steve James Remix)',
 u'streamable': u'0',
 u'url': u'http://www.last.fm/music/John+Legend/_/All+Of+Me+(Steve+James+Remix)'}

Perfect. All our data is in our database and ready to query!

Querying your scrobble log

At this point, you can:

  • Use Dataset to query your DB
  • Open your lastfm.sqlite DB in an external program and do the querying there

Let's try running a query in Python using Dataset. Say I want to find all the times I played the song "Cheap Sunglasses" by RAC.

In [8]:
query = '''
SELECT name, artist_text, date_text
FROM tracks t
WHERE t.'name' LIKE '%cheap sunglasses%'
ORDER BY date_uts ASC;
'''

results = db.query(query)
for result in results:
    print dict(result)
{u'artist_text': u'RAC', u'name': u'Cheap Sunglasses', u'date_text': u'8 Apr 2014, 20:21'}
{u'artist_text': u'RAC', u'name': u'Cheap Sunglasses', u'date_text': u'8 Apr 2014, 20:25'}
{u'artist_text': u'RAC', u'name': u'Cheap Sunglasses', u'date_text': u'8 Apr 2014, 20:37'}
{u'artist_text': u'RAC', u'name': u'Cheap Sunglasses (feat. Matthew Koma)', u'date_text': u'8 Apr 2014, 20:56'}
{u'artist_text': u'RAC', u'name': u'Cheap Sunglasses', u'date_text': u'8 Apr 2014, 21:11'}
{u'artist_text': u'RAC', u'name': u'Cheap Sunglasses', u'date_text': u'8 Apr 2014, 21:18'}
{u'artist_text': u'RAC', u'name': u'Cheap Sunglasses', u'date_text': u'9 Apr 2014, 02:40'}
{u'artist_text': u'RAC', u'name': u'Cheap Sunglasses', u'date_text': u'9 Apr 2014, 05:45'}
{u'artist_text': u'RAC', u'name': u'Cheap Sunglasses (feat. Matthew Koma)', u'date_text': u'9 Apr 2014, 08:04'}

Fantastic. I really love that song.

Back to the Start

At the start of this experiment, I wanted to find the songs I played obsessively within the first week of hearing them. We can use SQL to look for exactly that!

Let's define our heuristic:

I want to see a list of all the tracks I played more than 3 times within a week of their first play.

This may not sound significant, but it turns out this filters a lot of tracks out and leaves me with a lot of tracks I found memorable.

Structuring our SQL Query

Here's the query I'm going to run to implement the above heuristic:

SELECT COUNT(fp.url) AS plays, fp.date_text AS first_played, fp.artist_text AS artist, fp.name AS track, fp.album_text AS album

Title the columns nicely. Create a column that tracks play count within the first week called plays.

Thanks to how Last.fm organizes their track data, we can tell one track from another by comparing URLs.

The url column is a link to the Last.fm page for that track. If two rows have the same value in the URL column, we know that the two rows represent two different plays of the same track.

FROM (SELECT * FROM tracks GROUP BY url ORDER BY date_uts) fp

Subquery: Get unique plays. Order by date_uts ascending.

This leaves us with a table where:

  • each track is represented at most once
  • each track has a date_uts value that represents when the track was first played

INNER JOIN tracks ap ON fp.url = ap.url AND ap.date_uts - fp.date_uts <= (86400 * 7)

Join our subquery and our full table, keeping only rows where the play date is within 7 days of the first-played date.

GROUP BY ap.url

Group duplicate tracks into one row...

HAVING COUNT(ap.url) > 3

...but only keep rows where there are at least 3 plays within 7 days of the first-played date.

ORDER BY ap.date_uts DESC

Put most recently discovered tracks at the top of the list.

LIMIT 10

For this experiment, let's just check out the my 10 most recently obsessed-over tracks instead of grabbing all of them.

In [9]:
query = '''
SELECT COUNT(fp.url) AS plays, fp.date_text AS first_played, fp.artist_text AS artist, fp.name AS track, fp.album_text AS album FROM
    (SELECT * FROM tracks GROUP BY url ORDER BY date_uts) fp
INNER JOIN tracks ap ON
    fp.url = ap.url AND
    ap.date_uts - fp.date_uts <= (86400 * 7)
GROUP BY ap.url
HAVING COUNT(ap.url) > 3
ORDER BY ap.date_uts DESC
LIMIT 10
'''

results = db.query(query)
for result in results:
    r = dict(result)
    print r['plays'], 'plays', '\t', r['first_played'], '\t', r['artist'], '-', r['track']
7 plays 	8 Apr 2014, 20:21 	RAC - Cheap Sunglasses
5 plays 	4 Apr 2014, 21:23 	Basement Jaxx Vs. Metropole Orkest - Do Your Thing
5 plays 	4 Apr 2014, 22:46 	Basement Jaxx Vs. Metropole Orkest - Where’s Your Head At
4 plays 	18 Mar 2014, 06:12 	The-Dream, Pusha T, Ma$e, Cocaine 80s - Higher (Album Version (Edited))
4 plays 	9 Mar 2014, 16:49 	Robert DeLong - Global Concepts (Robert DeLong Club Remix)
4 plays 	12 Mar 2014, 01:29 	Skrillex - All Is Fair In Love And Brostep
4 plays 	5 Mar 2014, 05:51 	RAC - Let Go (feat. Kele & MNDR)
4 plays 	28 Feb 2014, 19:57 	Kendrick Lamar - Sing That Shit (20syl Juicy Remix)
4 plays 	28 Feb 2014, 19:44 	Pharrell Williams - Gust of Wind
4 plays 	20 Feb 2014, 06:17 	Com Truise - Mind

$wag.

Conclusion

Now you've got an SQLite database that has all of your music history, ready for analysis.

Let me know if you come up with something cool! Email me at matt@mplewis.com or tweet me @mplewis.

Here are some of my humble suggestions for external SQLite management programs:

  • Base 2, the finest SQLite manager known to OS X
  • Valentina Studio, an excellent free option for all platforms