Contents
del.icio.us Link Fetching
Below is an outline for building a rough & ready script to get del.icio.us bookmark data - a script very much driven by necessity.
Rationale
Why am I doing this? Well, I've been bookmarking links using del.icio.us since March 2009 and have, very slowly, built up a steady collection of over 900 bookmarks, most of which are probably still quite useful.
The ups and downs of social bookmarking service del.icio.us are very well documented here, so I won't go into all the details, but suffice to say that reading the aforementioned article gave me all the impetus I needed to go ahead and homestead my list of bookmarks. Simple, I thought.
Then came the snag...
We're sorry, but due to heavy load on our database we are no longer able to offer an export function. Our engineers are working on this and we will restore it as soon as possible.
Great... what now? Time to write a script of course. I've initially created this article using a Jupyter Notebook which may, I hope, explain the recipe feel to the writing.
The Plan
Well, initially the plan was just to grab the bookmarks by scraping my del.icio.us pages from beginning to end and turn that data straight into some form of bookmark HTML. Thinking about it more, I decided to opt for a JSON data format; that way I could easily turn the data to any format I like at my leisure.
What Data?
From a few investigations I found that for a bookmark to be successfully imported into most browsers, one ideally needs a few key pieces of data:
- Title
- URL
- Date added (Human readable and epoch)
- Private or public
- Tags list
- Comment (or title if no description)
- Icon data (Optional, but provides a nice icon by each link)
Annoyingly, there isn't really a standard that browsers use for importing or exporting bookmarks; it all seems a little ad-hoc, so I've generalised somewhat.
I would like to finish with some JSON that describes the bookmark like so:
{
"comment":"Docker Compose - Docker Documentation",
"add_epoch":"1431953037",
"title":"Docker Compose - Docker Documentation",
"url":"http://docs.docker.com/compose/",
"tags":[
"docker",
"development",
"python"
],
"icon_uri":"http://docs.docker.com/favicons/favicon-32x32.png",
"private":0,
"add_date":"2015-05-18 12:43:57",
"icon_data":""
}
Getting the data
I'll now go ahead and start getting the pieces of data I need. Firstly I am importing all the libraries I'll need for the entire exercise, as well as a bit of logging:
import datetime
import urllib
import ssl
import json
import base64
import codecs
import logging
from urlparse import urlparse
from bs4 import BeautifulSoup
logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.DEBUG)
So I am starting this by fetching one links page from del.icio.us, breaking it down and getting each required data item piece by piece.
To be clear, before I started the next bit, I used a browser dev tool to identify the page structure. Firstly then, fetch the URL, parse the returned data with the fantastic BeautifulSoup library to get all the bookmark links outer blocks.
url = 'https://del.icio.us/chrisramsay?&page=1'
try:
r = urllib.urlopen(url)
except IOError:
logging.warning('Could not open %s', url)
else:
soup = BeautifulSoup(r, 'html.parser')
bookmarks = soup.find_all("div", class_="articleThumbBlockOuter")
print 'We have {} bookmarks'.format(len(bookmarks))
We have 10 bookmarks
Result! I have ten, seeing as there are that many per page. From here on in I am working on a single bookmark - this whole next section would be in a loop over the list of bookmarks. I am going to use item zero of the list from this point.
Firstly I get the bookmark title:
entry = {}
bookmark_zero = bookmarks[0]
title = bookmark_zero.find_all('div', class_='articleTitlePan')[0]
entry['title'] = title.a.attrs['title']
print entry
{ 'title': u'DIY Science: Measuring Light with a Photodiode II | Outside Science' }
Next, the bookmark URL:
href = bookmark_zero.find_all('div', class_='articleInfoPan')[0].find_all('p')[0]
entry['url'] = href.a.attrs['href']
print entry
{ 'url': u'https://outsidescience.wordpress.com/2012/11/03/diy-science-measuring-light-with-a-photodiode-ii/', 'title': u'DIY Science: Measuring Light with a Photodiode II | Outside Science' }
Get the bookmark favicon (ICON_URI):
parsed = urlparse(entry['url'])
entry['icon_uri'] = u'{}://{}/favicon.ico'.format(parsed.scheme, parsed.netloc)
Get the bookmark save date:
entry['add_date'] = str(datetime.datetime.fromtimestamp(int(bookmark_zero.attrs['date'])))
entry['add_epoch'] = bookmark_zero.attrs['date']
print entry
{ 'add_epoch': u'1431293424', 'url': u'https://outsidescience.wordpress.com/2012/11/03/diy-science-measuring-light-with-a-photodiode-ii/', 'title': u'DIY Science: Measuring Light with a Photodiode II | Outside Science', 'icon_uri': u'https://outsidescience.wordpress.com/favicon.ico', 'add_date': '2015-05-10 21:30:24' }
Get the bookmark tags (if there are any):
try:
tags = bookmark_zero.find_all('ul', class_='tagName')[0].find_all('li')
except IndexError:
tags = []
entry['tags'] = [f.a.text for f in tags]
print entry
{ 'add_epoch': u'1431293424', 'tags': [u'arduino', u'circuits', u'photodiodes', u'photodiode'], 'url': u'https://outsidescience.wordpress.com/2012/11/03/diy-science-measuring-light-with-a-photodiode-ii/', 'title': u'DIY Science: Measuring Light with a Photodiode II | Outside Science', 'icon_uri': u'https://outsidescience.wordpress.com/favicon.ico', 'add_date': '2015-05-10 21:30:24' }
Get any comments:
comment = None
for a in [l for l in bookmark_zero.find('div', class_='thumbTBriefTxt').children]:
try:
comment = a.p.contents[0]
except AttributeError:
continue
if comment is not None:
entry['comment'] = comment
else:
entry['comment'] = entry['title']
print entry
{ 'comment': u'DIY Science: Measuring Light with a Photodiode II | Outside Science', 'add_epoch': u'1431293424', 'tags': [u'arduino', u'circuits', u'photodiodes', u'photodiode'], 'url': u'https://outsidescience.wordpress.com/2012/11/03/diy-science-measuring-light-with-a-photodiode-ii/', 'title': u'DIY Science: Measuring Light with a Photodiode II | Outside Science', 'icon_uri': u'https://outsidescience.wordpress.com/favicon.ico', 'add_date': '2015-05-10 21:30:24' }
Taking a look at the whole bookmark as a JSON string dump:
json.dumps(entry)
'{ "comment":"DIY Science: Measuring Light with a Photodiode II | Outside Science", "add_epoch":"1431293424", "tags":[ "arduino", "circuits", "photodiodes", "photodiode" ], "url":"https://outsidescience.wordpress.com/2012/11/03/diy-science-measuring-light-with-a-photodiode-ii/", "title":"DIY Science: Measuring Light with a Photodiode II | Outside Science", "icon_uri":"https://outsidescience.wordpress.com/favicon.ico", "add_date":"2015-05-10 21:30:24" }'
Well, that looks about right to me. Obviously a collection of bookmarks in this JSON form (dictionary) will have to be expressed as a list of dictionaries.
Time to think about getting the icon data.
Aside - Icons
Take a look at existing bookmark HTML files from any browser and it will become apparent that they store icons locally in the form of a base 64 encoded string. This seems obvious once thought about; better to locally cache a copy than have to fetch a copy each time the browser is started up. As part of the bookmark fetching process I therefore decided to see if it would be possible to visit the bookmarked page itself and extract an icon.
The code is below; an explanation follows:
def get_icon(fetch):
"""Fetch an icon, base64 encode and return it."""
icon_path = None
icon_data = {'icon_uri': None, 'icon_data': None}
# Fetch the root page, parse the HTML and get the LINK elements within HEAD.
try:
r_icon = urllib.urlopen(fetch)
except (IOError, ssl.CertificateError):
# Total fail, just return empty data
return icon_data
else:
# Carry on & get the icon rel link
icon_soup = BeautifulSoup(r_icon, 'html.parser')
head = icon_soup.find('head')
if head is None:
return icon_data
link = head.find('link', rel='icon')
if link is not None:
logging.debug('Icon URL: %s', link.attrs['href'])
# Check for relative or absolute paths
if ':' in link.attrs['href']:
icon_path = link.attrs['href']
else:
icon_path = u'{}{}'.format(fetch, link.attrs['href'])
# If we have icon_path, get the icon data
if icon_path is not None:
icon_data['icon_uri'] = icon_path
try:
icon_ulib = urllib.urlopen(icon_path)
except (IOError, ssl.CertificateError):
return icon_data
# Get the content type to avoid things like 404 HTML
content_type = icon_ulib.headers.getheader('Content-Type')
if 'image' in icon_ulib.headers.getheader('Content-Type'):
icon_data['icon_data'] = 'data:{};base64,{}'.format(
content_type, base64.b64encode(icon_ulib.read()))
# We are done
return icon_data
What I show in the code block above is neither pretty nor particularly discriminating; it grabs the first link element it finds containing a property rel with the value of icon from the head of the page, then extracting the value of the href property.
Next, examining the href property it decides whether the link is relative or absolute and modifies the link appropriately. Whichever the case an icon URL is formed from which the icon itself is fetched. Headers are checked, content type decided and if we get to the end we have a base 64 encoded string complete with content type information.
Below are a couple of test run results using absolute and relative icon href:
Testing: Absolute links to icons
fetch = 'https://www.gov.uk'
sixfour_img = get_icon(fetch)
DEBUG:Icon URL: https://assets.publishing.service.gov.uk/static/favicon-8d811b8c3badbc0b0e2f6e25d3660a96cc0cca7993e6f32e98785f205fc40907.ico
print sixfour_img
{ 'icon_data': '', 'icon_uri': u'https://assets.publishing.service.gov.uk/static/favicon-8d811b8c3badbc0b0e2f6e25d3660a96cc0cca7993e6f32e98785f205fc40907.ico' }
Testing: Relative links to icons
For example: /images/favicon.png?1486118657
fetch = 'http://scrapy.org'
sixfour_img = get_icon(fetch)
DEBUG:Icon URL: /images/favicon.png?1486118657
print sixfour_img
{ 'icon_data': '', 'icon_uri': u'http://scrapy.org/favicons/favicon-192x192.png' }
Now that I have a fairly reliable method for fetching icon data, in terms of getting all the data I need for the definitive bookmarks JSON, I am about ready to put everything together into a dirty fire-and-forget script!
Joining things up
So, here is a function that basically does the bookmark fetching and icon fetching. Simple and not too resilient but should just about get the job done. It makes use of the get_icon()
method I defined earlier.
def get_entry(bookmark):
"""Parses an individual del.icio.us entry"""
entry = {}
# Locate and set Title
title = bookmark.find_all('div', class_='articleTitlePan')[0]
entry['title'] = title.a.attrs['title']
# Locate and set bookmark URL
href = bookmark.find_all('div', class_='articleInfoPan')[0].find_all('p')[0]
entry['url'] = href.a.attrs['href']
logging.debug('Getting URL: %s', entry['url'])
# Parse the URL - we need this to get the Icon
parsed = urlparse(entry['url'])
# Get the icon URL and base64 encoded icon image
icon = get_icon(u'{}://{}'.format(parsed.scheme, parsed.netloc))
entry['icon_uri'] = icon['icon_uri']
entry['icon_data'] = icon['icon_data']
# Get the date the bookmark was added, both in readable and epoch format
entry['add_date'] = str(datetime.datetime.fromtimestamp(int(bookmark.attrs['date'])))
entry['add_epoch'] = bookmark.attrs['date']
# Just set the bookmark to public
entry['private'] = 0
# Get all the tags (if there are any)
try:
tags = bookmark.find_all('ul', class_='tagName')[0].find_all('li')
entry['tags'] = [tag.a.text for tag in tags]
except IndexError:
entry['tags'] = ['none']
# Get the comment (if there is one, else use the title)
comment = None
for brief_text in [l for l in bookmark.find('div', class_='thumbTBriefTxt').children]:
try:
comment = brief_text.p.contents[0]
except AttributeError:
continue
if comment is not None:
entry['comment'] = comment
else:
entry['comment'] = entry['title']
# We are done
return entry
The above code is fairly well commented so there is not much to be said about it, so without any further ado, here is some test output:
link = json.dumps(get_entry(bookmarks[2]))
print link
DEBUG:Getting URL: http://www.martyncurrey.com/arduino-nano-as-an-isp-programmer/ DEBUG:Icon URL: http://www.martyncurrey.com/wp-content/uploads/2016/10/favicon2.ico
{"comment": "Arduino Nano as an ISP Programmer | Martyn Currey", "add_epoch": "1430895155", "title": "Arduino Nano as an ISP Programmer | Martyn Currey", "url": "http://www.martyncurrey.com/arduino-nano-as-an-isp-programmer/", "tags": ["arduino", "programming"], "icon_uri": "http://www.martyncurrey.com/wp-content/uploads/2016/10/favicon2.ico", "private": 0, "add_date": "2015-05-06 06:52:35", "icon_data": ""}
And there it is. I ran the code against my own bookmarks in del.icio.us and, after a few false starts where their site was unreachable (no reason given by the guys at del.icio.us), I managed to safely retrieve all my data.
And now for a couple of other things which are connected with what I have been doing whilst working on this bookmark business.
Aside - Outputting the Results
At some point I wanted to incrementally write results to a file in JSON format. Occasionally a data fetch would fail and the entire process would end resulting in a somewhat truncated file. So, I decided to make a three part approach:
- Write an opening list delimiter.
- In a loop, write out fetched JSON dictionary objects, skipping failures.
- Write a closing list delimiter
Here's what I came up with; rough and ready, but worked quite happily.
filename = '/testing/links.json'
# Write starting character
with open(filename, 'w') as start:
start.write('[')
# Here you might have something that fetched a page at a time and created
# a list of bookmarks per page - then interate through that list next.
# For each, blah, append links - add a comma after each bookmark except for the last bookmark
# Here's hoping that the last list item is not a failure.
for idx, bookmark in enumerate(bookmarks):
with codecs.open(filename, 'a', encoding='utf-8') as incr:
incr.write(json.dumps(get_entry(bookmark)))
if idx < len(bookmarks) - 1:
incr.write(',')
# Done, we write the ending character
with open(filename, 'a') as end:
end.write(']')
Aside - Making an importable bookmarks HTML file
Here are some brief notes regarding turning JSON entries into bookmark HTML which I can import into a browser.
Below is a line of desired link HTML:
<DT><A HREF="http://www.something.com/some-article/" ADD_DATE="1395356195" PRIVATE="0" TAGS="stuff,things" ICON_URI="http://www.something.com/favicon2.ico" ICON="data:image/x-icon;base64,[...]>Article Title</A>
In the code below, link
is a JSON dict in the same format as I generated earlier.
bm_load = json.loads(link)
outlink = u'<DT><A HREF="{}" ADD_DATE="{}" PRIVATE="0" TAGS="{}" ICON_URI="{}" ICON="{}>{}</A>'.format(
bm_load['url'], bm_load['add_epoch'], ','.join(bm_load['tags']), bm_load['icon_uri'],
bm_load['icon_data'], bm_load['comment']
)
print outlink
And here we are:
u'<DT><A HREF="http://www.martyncurrey.com/arduino-nano-as-an-isp-programmer/" ADD_DATE="1430895155" PRIVATE="0" TAGS="arduino,programming" ICON_URI="http://www.martyncurrey.com/wp-content/uploads/2016/10/favicon2.ico" ICON=">Arduino Nano as an ISP Programmer | Martyn Currey</A>'
Finally, writing all the links out to a file as HTML. Open and read the bookmarks data JSON file; for each entry, append a formatted piece of HTML:
import codecs
bookmarks = []
with codecs.open('/testing/links.json', 'r', encoding='utf8') as rfh:
for bookmark in json.loads(rfh.read()):
bookmarks.append(u'<DT><A HREF="{}" ADD_DATE="{}" PRIVATE="0" TAGS="{}" ICON_URI="{}" ICON="{}">{}</A>\n'.format(
bookmark['url'], bookmark['add_epoch'], ','.join(bookmark['tags']), bookmark['icon_uri'],
bookmark['icon_data'], bookmark['comment']
))
Create a suitable header:
header = u"""
<!DOCTYPE NETSCAPE-Bookmark-file-1>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Bookmarks</TITLE>
<H1>Bookmarks</H1>
<DL><p>
"""
Now write the whole lot out to a file ready for consumption by a browser:
with codecs.open('/testing/links.html', 'w', encoding='utf8') as wfh:
wfh.write(header)
for line in bookmarks:
wfh.write(line)
With that, we are done. Until next time.
Comments
comments powered by Disqus