Getting Bookmark Data from del.icio.us ###################################### :title: Getting Bookmark Data from del.icio.us :date: 2017-02-19 :category: Software :tags: programming, python :slug: getting-bookmark-data-from-delicious :author: Chris Ramsay :status: published :language: en :show_source: True .. role:: py(code) :language: python .. contents:: del.icio.us Link Fetching ========================= Below is an outline for building a rough & ready script to get `del.icio.us`_ bookmark data - a script very much driven by necessity. Rationale --------- .. PELICAN_BEGIN_SUMMARY Why am I doing this? Well, I've been bookmarking links using del.icio.us since March 2009 and have, very slowly, built up a steady collection of over 900 bookmarks, most of which are probably still quite useful. The ups and downs of social bookmarking service del.icio.us are `very well documented here`_, so I won't go into all the details, but suffice to say that reading the aforementioned article gave me all the impetus I needed to go ahead and *homestead* my list of bookmarks. Simple, I thought. Then came the snag... .. We're sorry, but due to heavy load on our database we are no longer able to offer an export function. Our engineers are working on this and we will restore it as soon as possible. Great... what now? Time to write a script of course. I've initially created this article using a `Jupyter Notebook`_ which may, I hope, explain the recipe feel to the writing. The Plan -------- Well, initially the plan was just to grab the bookmarks by scraping my del.icio.us pages from beginning to end and turn that data straight into some form of bookmark HTML. Thinking about it more, I decided to opt for a JSON data format; that way I could easily turn the data to any format I like at my leisure. .. PELICAN_END_SUMMARY What Data? ---------- From a few investigations I found that for a bookmark to be successfully imported into most browsers, one ideally needs a few key pieces of data: - Title - URL - Date added (Human readable and epoch) - Private or public - Tags list - Comment (or title if no description) - Icon data (Optional, but provides a nice icon by each link) Annoyingly, there isn't really a standard that browsers use for importing or exporting bookmarks; it all seems a little ad-hoc, so I've generalised somewhat. I would like to finish with some JSON that describes the bookmark like so: .. code-block:: javascript { "comment":"Docker Compose - Docker Documentation", "add_epoch":"1431953037", "title":"Docker Compose - Docker Documentation", "url":"http://docs.docker.com/compose/", "tags":[ "docker", "development", "python" ], "icon_uri":"http://docs.docker.com/favicons/favicon-32x32.png", "private":0, "add_date":"2015-05-18 12:43:57", "icon_data":"" } .. _`does the bookmark fetching`: Getting the data ---------------- I'll now go ahead and start getting the pieces of data I need. Firstly I am importing all the libraries I'll need for the entire exercise, as well as a bit of logging: .. code-block:: python import datetime import urllib import ssl import json import base64 import codecs import logging from urlparse import urlparse from bs4 import BeautifulSoup logging.basicConfig(format='%(levelname)s:%(message)s', level=logging.DEBUG) So I am starting this by fetching one links page from del.icio.us, breaking it down and getting each required data item piece by piece. To be clear, before I started the next bit, I used a browser dev tool to identify the page structure. Firstly then, fetch the URL, parse the returned data with the fantastic `BeautifulSoup library`_ to get all the bookmark links outer blocks. .. code-block:: python url = 'https://del.icio.us/chrisramsay?&page=1' try: r = urllib.urlopen(url) except IOError: logging.warning('Could not open %s', url) else: soup = BeautifulSoup(r, 'html.parser') bookmarks = soup.find_all("div", class_="articleThumbBlockOuter") print 'We have {} bookmarks'.format(len(bookmarks)) .. parsed-literal:: We have 10 bookmarks Result! I have ten, seeing as there are that many per page. From here on in I am working on a single bookmark - this whole next section would be in a loop over the list of bookmarks. I am going to use item zero of the list from this point. Firstly I get the bookmark title: .. code-block:: python entry = {} bookmark_zero = bookmarks[0] title = bookmark_zero.find_all('div', class_='articleTitlePan')[0] entry['title'] = title.a.attrs['title'] print entry .. parsed-literal:: { 'title': u'DIY Science: Measuring Light with a Photodiode II | Outside Science' } Next, the bookmark URL: .. code-block:: python href = bookmark_zero.find_all('div', class_='articleInfoPan')[0].find_all('p')[0] entry['url'] = href.a.attrs['href'] print entry .. parsed-literal:: { 'url': u'https://outsidescience.wordpress.com/2012/11/03/diy-science-measuring-light-with-a-photodiode-ii/', 'title': u'DIY Science: Measuring Light with a Photodiode II | Outside Science' } Get the bookmark favicon (ICON\_URI): .. code-block:: python parsed = urlparse(entry['url']) entry['icon_uri'] = u'{}://{}/favicon.ico'.format(parsed.scheme, parsed.netloc) Get the bookmark save date: .. code-block:: python entry['add_date'] = str(datetime.datetime.fromtimestamp(int(bookmark_zero.attrs['date']))) entry['add_epoch'] = bookmark_zero.attrs['date'] print entry .. parsed-literal:: { 'add_epoch': u'1431293424', 'url': u'https://outsidescience.wordpress.com/2012/11/03/diy-science-measuring-light-with-a-photodiode-ii/', 'title': u'DIY Science: Measuring Light with a Photodiode II | Outside Science', 'icon_uri': u'https://outsidescience.wordpress.com/favicon.ico', 'add_date': '2015-05-10 21:30:24' } Get the bookmark tags (if there are any): .. code-block:: python try: tags = bookmark_zero.find_all('ul', class_='tagName')[0].find_all('li') except IndexError: tags = [] entry['tags'] = [f.a.text for f in tags] print entry .. parsed-literal:: { 'add_epoch': u'1431293424', 'tags': [u'arduino', u'circuits', u'photodiodes', u'photodiode'], 'url': u'https://outsidescience.wordpress.com/2012/11/03/diy-science-measuring-light-with-a-photodiode-ii/', 'title': u'DIY Science: Measuring Light with a Photodiode II | Outside Science', 'icon_uri': u'https://outsidescience.wordpress.com/favicon.ico', 'add_date': '2015-05-10 21:30:24' } Get any comments: .. code-block:: python comment = None for a in [l for l in bookmark_zero.find('div', class_='thumbTBriefTxt').children]: try: comment = a.p.contents[0] except AttributeError: continue if comment is not None: entry['comment'] = comment else: entry['comment'] = entry['title'] print entry .. parsed-literal:: { 'comment': u'DIY Science: Measuring Light with a Photodiode II | Outside Science', 'add_epoch': u'1431293424', 'tags': [u'arduino', u'circuits', u'photodiodes', u'photodiode'], 'url': u'https://outsidescience.wordpress.com/2012/11/03/diy-science-measuring-light-with-a-photodiode-ii/', 'title': u'DIY Science: Measuring Light with a Photodiode II | Outside Science', 'icon_uri': u'https://outsidescience.wordpress.com/favicon.ico', 'add_date': '2015-05-10 21:30:24' } .. _`same format as I generated earlier`: Taking a look at the whole bookmark as a JSON string dump: .. code-block:: python json.dumps(entry) .. parsed-literal:: '{ "comment":"DIY Science: Measuring Light with a Photodiode II | Outside Science", "add_epoch":"1431293424", "tags":[ "arduino", "circuits", "photodiodes", "photodiode" ], "url":"https://outsidescience.wordpress.com/2012/11/03/diy-science-measuring-light-with-a-photodiode-ii/", "title":"DIY Science: Measuring Light with a Photodiode II | Outside Science", "icon_uri":"https://outsidescience.wordpress.com/favicon.ico", "add_date":"2015-05-10 21:30:24" }' Well, that looks about right to me. Obviously a collection of bookmarks in this JSON form (dictionary) will have to be expressed as a list of dictionaries. Time to think about getting the icon data. .. _`method I defined earlier`: Aside - Icons ------------- Take a look at existing bookmark HTML files from any browser and it will become apparent that they store icons locally in the form of a base 64 encoded string. This seems obvious once thought about; better to locally cache a copy than have to fetch a copy each time the browser is started up. As part of the bookmark fetching process I therefore decided to see if it would be possible to visit the bookmarked page itself and extract an icon. The code is below; an explanation follows: .. code-block:: python def get_icon(fetch): """Fetch an icon, base64 encode and return it.""" icon_path = None icon_data = {'icon_uri': None, 'icon_data': None} # Fetch the root page, parse the HTML and get the LINK elements within HEAD. try: r_icon = urllib.urlopen(fetch) except (IOError, ssl.CertificateError): # Total fail, just return empty data return icon_data else: # Carry on & get the icon rel link icon_soup = BeautifulSoup(r_icon, 'html.parser') head = icon_soup.find('head') if head is None: return icon_data link = head.find('link', rel='icon') if link is not None: logging.debug('Icon URL: %s', link.attrs['href']) # Check for relative or absolute paths if ':' in link.attrs['href']: icon_path = link.attrs['href'] else: icon_path = u'{}{}'.format(fetch, link.attrs['href']) # If we have icon_path, get the icon data if icon_path is not None: icon_data['icon_uri'] = icon_path try: icon_ulib = urllib.urlopen(icon_path) except (IOError, ssl.CertificateError): return icon_data # Get the content type to avoid things like 404 HTML content_type = icon_ulib.headers.getheader('Content-Type') if 'image' in icon_ulib.headers.getheader('Content-Type'): icon_data['icon_data'] = 'data:{};base64,{}'.format( content_type, base64.b64encode(icon_ulib.read())) # We are done return icon_data What I show in the code block above is neither pretty nor particularly discriminating; it grabs the first `link` element it finds containing a property `rel` with the value of `icon` from the `head` of the page, then extracting the value of the `href` property. Next, examining the `href` property it decides whether the link is relative or absolute and modifies the link appropriately. Whichever the case an icon URL is formed from which the icon itself is fetched. Headers are checked, content type decided and if we get to the end we have a base 64 encoded string complete with content type information. Below are a couple of test run results using absolute and relative icon `href`: Testing: Absolute links to icons ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For example: https://assets.publishing.service.gov.uk/static/favicon-8d811b8c3badbc0b0e2f6e25d3660a96cc0cca7993e6f32e98785f205fc40907.ico .. code-block:: python fetch = 'https://www.gov.uk' sixfour_img = get_icon(fetch) .. parsed-literal:: DEBUG:Icon URL: https://assets.publishing.service.gov.uk/static/favicon-8d811b8c3badbc0b0e2f6e25d3660a96cc0cca7993e6f32e98785f205fc40907.ico .. code-block:: python print sixfour_img .. parsed-literal:: { 'icon_data': '', 'icon_uri': u'https://assets.publishing.service.gov.uk/static/favicon-8d811b8c3badbc0b0e2f6e25d3660a96cc0cca7993e6f32e98785f205fc40907.ico' } Testing: Relative links to icons ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For example: /images/favicon.png?1486118657 .. code-block:: python fetch = 'http://scrapy.org' sixfour_img = get_icon(fetch) .. parsed-literal:: DEBUG:Icon URL: /images/favicon.png?1486118657 .. code-block:: python print sixfour_img .. parsed-literal:: { 'icon_data': '', 'icon_uri': u'http://scrapy.org/favicons/favicon-192x192.png' } Now that I have a fairly reliable method for fetching icon data, in terms of getting all the data I need for the definitive bookmarks JSON, I am about ready to put everything together into a dirty fire-and-forget script! Joining things up ----------------- So, here is a function that basically `does the bookmark fetching`_ **and** icon fetching. Simple and not too resilient but should just about get the job done. It makes use of the :py:`get_icon()` `method I defined earlier`_. .. code-block:: python def get_entry(bookmark): """Parses an individual del.icio.us entry""" entry = {} # Locate and set Title title = bookmark.find_all('div', class_='articleTitlePan')[0] entry['title'] = title.a.attrs['title'] # Locate and set bookmark URL href = bookmark.find_all('div', class_='articleInfoPan')[0].find_all('p')[0] entry['url'] = href.a.attrs['href'] logging.debug('Getting URL: %s', entry['url']) # Parse the URL - we need this to get the Icon parsed = urlparse(entry['url']) # Get the icon URL and base64 encoded icon image icon = get_icon(u'{}://{}'.format(parsed.scheme, parsed.netloc)) entry['icon_uri'] = icon['icon_uri'] entry['icon_data'] = icon['icon_data'] # Get the date the bookmark was added, both in readable and epoch format entry['add_date'] = str(datetime.datetime.fromtimestamp(int(bookmark.attrs['date']))) entry['add_epoch'] = bookmark.attrs['date'] # Just set the bookmark to public entry['private'] = 0 # Get all the tags (if there are any) try: tags = bookmark.find_all('ul', class_='tagName')[0].find_all('li') entry['tags'] = [tag.a.text for tag in tags] except IndexError: entry['tags'] = ['none'] # Get the comment (if there is one, else use the title) comment = None for brief_text in [l for l in bookmark.find('div', class_='thumbTBriefTxt').children]: try: comment = brief_text.p.contents[0] except AttributeError: continue if comment is not None: entry['comment'] = comment else: entry['comment'] = entry['title'] # We are done return entry The above code is fairly well commented so there is not much to be said about it, so without any further ado, here is some test output: .. code-block:: python link = json.dumps(get_entry(bookmarks[2])) print link .. parsed-literal:: DEBUG:Getting URL: http://www.martyncurrey.com/arduino-nano-as-an-isp-programmer/ DEBUG:Icon URL: http://www.martyncurrey.com/wp-content/uploads/2016/10/favicon2.ico .. parsed-literal:: {"comment": "Arduino Nano as an ISP Programmer | Martyn Currey", "add_epoch": "1430895155", "title": "Arduino Nano as an ISP Programmer | Martyn Currey", "url": "http://www.martyncurrey.com/arduino-nano-as-an-isp-programmer/", "tags": ["arduino", "programming"], "icon_uri": "http://www.martyncurrey.com/wp-content/uploads/2016/10/favicon2.ico", "private": 0, "add_date": "2015-05-06 06:52:35", "icon_data": ""} And there it is. I ran the code against my own bookmarks in del.icio.us and, after a few false starts where their site was unreachable (no reason given by the guys at del.icio.us), I managed to safely retrieve all my data. And now for a couple of other things which are connected with what I have been doing whilst working on this bookmark business. Aside - Outputting the Results ------------------------------ At some point I wanted to incrementally write results to a file in JSON format. Occasionally a data fetch would fail and the entire process would end resulting in a somewhat truncated file. So, I decided to make a three part approach: 1. Write an opening list delimiter. 2. In a loop, write out fetched JSON dictionary objects, skipping failures. 3. Write a closing list delimiter Here's what I came up with; rough and ready, but worked quite happily. .. code-block:: python filename = '/testing/links.json' # Write starting character with open(filename, 'w') as start: start.write('[') # Here you might have something that fetched a page at a time and created # a list of bookmarks per page - then interate through that list next. # For each, blah, append links - add a comma after each bookmark except for the last bookmark # Here's hoping that the last list item is not a failure. for idx, bookmark in enumerate(bookmarks): with codecs.open(filename, 'a', encoding='utf-8') as incr: incr.write(json.dumps(get_entry(bookmark))) if idx < len(bookmarks) - 1: incr.write(',') # Done, we write the ending character with open(filename, 'a') as end: end.write(']') Aside - Making an importable bookmarks HTML file ------------------------------------------------ Here are some brief notes regarding turning JSON entries into bookmark HTML which I can import into a browser. Below is a line of desired link HTML: .. code-block:: html
{}\n'.format( bookmark['url'], bookmark['add_epoch'], ','.join(bookmark['tags']), bookmark['icon_uri'], bookmark['icon_data'], bookmark['comment'] )) Create a suitable header: .. code-block:: python header = u""" Bookmarks

Bookmarks

""" Now write the whole lot out to a file ready for consumption by a browser: .. code-block:: python with codecs.open('/testing/links.html', 'w', encoding='utf8') as wfh: wfh.write(header) for line in bookmarks: wfh.write(line) With that, we are done. Until next time. .. footnotes .. links .. _`del.icio.us`: http://del.icio.us .. _`very well documented here`: https://sixtwothree.org/posts/homesteading-a-decades-worth-of-shared-links .. _`Jupyter Notebook`: http://jupyter.org/ .. _`BeautifulSoup library`: https://www.crummy.com/software/BeautifulSoup/