Saturday, August 05, 2006

del.icio.us; Taxonomy / auto-categorization tools

DRAFT: Taxocop is all commercially licensed stuff. I fully intend to link open source tools, or ruminate on how to do it in python, C, weka and R.

Taxocop's taxonomy tools
(found at Swik. )

Univ. Western Ontario prof's ruminate about tagging (Western, as well as Waterloo, Queens, U of Toronto, are top-notch unis, but nobody knows about them in the U.S.). Reminiscent software people talking about Christopher Alexander & design patterns, so I probably won't read this either.
(found at Facetag)

Masters thesis about delicious and flickr. This one's interesting.
(from facetag also)

Facetmap: what's their algorithm? (start reading at page 8)

Interesting conceptual ruminations here

social bookmarking universe

1st things 1st (I said it before, I'll say it again:

Non-gentle (Stringent or Strident) Reminder: Read the "Terms of Service", "Conditions of Use", "Fair Use of our Site", whatever it's called, for any site you're thinking of scraping. Also the "robots.txt". Abide by them. We're talking about getting a handful of pages here, *not* about downloading dozens or more pages, let alone their whole database.

Choose your corpus:

Suppose you needed to learn everything about a del.icio.us tag. Delicious is hardly the best for social bookmarking / folksonomy /tag clouding, URLs are repeated ad nauseam with different user descrips, you can't see the tagged URL til you mouse over it, can't sort URLs by number of times tagged, etc. But it's got a lot of URL's tagged, and you can search on, say, "Ruby GC" and get a finite number of good tagged URLs. (That's 1 way to do focussed search on delicious, the other is to look thru ruby tags of somebody you know is a solid developer).

What do i want in a bookmarking/blogsearch site? A decent search engine, obviously: search Digg for "IDE" turns up everything with "guide" or "video" in title. A way to find newly tagged URLs that've been tagged only a few times. Easy markup and a lot of tagged URLs per page.

Reddit's not bad, tho you can't tell how many database hits you get for your tag or search term, and the markup's moderately messy to scrape.

ma.gnolia: very easy markup, lots of tech tagged URLs.

digg, searches for "ruby GC" and "python GC" turned up nothing, lousy search engine, as noted above. But for the most part, python tagged URLs are good, markup's not to bad to extract from

technorati covers tech blogs pretty well. You can filter out stuff in languages you don't read, the markup's not too bad to scrape, and, while you have to mouse over to see the underlying URL, it's got a nice extract of the web page as a summary for teh tag.

Blinklist. Now this is nice: when you type "Python" into search box, the drop down list shows most popular searches: "python tutorials", "python 3D" etc. But: when you scroll thru URLs for 1 tag, page refreshes are done thru AJAX, so you can't easily grab HTML and run thru HTML parser or save to hard drive. Still working on this.


Haven't investigated yet:


Dzone.com digg for developers
Smarking Not bad, but they obviously didn't spend a lotta time on domain names or learning CSS /design

Fantacular: I think it's brand new: 20 tagged python articles(!?)

shadows another where you can't capture successive pages by requesting new URLs. But they expose the Rails FilterBySearchItem at the bottom, so you can push those buttons and capture all the pages for 1 tag (I think). Also can't "View Source" for the page you actually have up (in Firefox). Nasty markup too, <div> within <div> etc etc. A good challenge.

Connotea find out how scientists are using py and rb. easiest to parse markup you'll ever see, cause physicists have already filled up their left brains.

rexa.info by Andrew McCallum. Info-extracted authors info + pictures, citations, etc.

Citeulike.org. YAATS yet another academic tagging site

Feedstertechnorati's main competition, apparently. Poor spamblog filtering, probably not worth bothering with
Sphere.com a better technorati work-alike

bloglines good lists of French links for rb/py. Lots of dups, but easy markup to parse

co.mments
TailRank

Yahoo MyWeb
scuttle 1000s of py/rb links, not that interesting tho

Netvouz


Feed Me Links Since 2002!

delirious not bad for py/rb, easy markup.

Also rans:

wists Social shopping??
Segnalo.com Practice your Italian. Only a few dozen ruby and python things here, but looks nice, some good URLs.

Furl (not bad, actually, and pretty easy markup to parse),

Unalog

RawSugar a few good links

Spurl (broken functionality, you have to register to see any tags),

simpy,(4,400 URLs tagged "Python", messy deep markup)
jots (96% spam, little technology content),
pubsub I have to figure out how to get at other people's tag subscriptions

linkAGoGo not much content, this guy knows less about CSS and design than I do.
NewsVine
BlinkBits truly sad: PHPbb, so they'll be HaX0rd from time to time, no content,

fark No comment, but here's a scraper

butterfly
StumbleUpon

ClipMarks No content and very hard to scrape
BlueDot.US not much tech content, yet
Maple.NU

Q: Where the heck did you find all these?
A: affiliateBlog and Roxomatic.DE
And this guy and a Quimble poll
So there's the raw material. Now we have to figure out how to pull the pages, get the URLs, page description, tag scores or numbers of people who've tagged that page etc. Save a few pages from each social networking site into its own subdir, fire up BeautifulSoup, or whatever HTML parser you like. The (X)HTML is generally correct, but markup trees can get deep, some sites have a bunch of javascript for each tagged URL.

BTW here's how to put a bazillion folksonomy buttons on your rails

(Aside, from the Biting the hand that feeds you Dept: Blogger's markup is odd, it has a huge stylesheet inline in every page, and the blog entry is all on 1 line with <br>'s. Easy to scrape tho. As another aside, Blogger's captchas are murderously difficult, even for somebody with 20/20 vision)

Next we're ready to go to Starbuck's and start our little script hoovering all the pages from each social networking site. Like OReilly's Spidering Hacks says, be nice, download pages on a human scale, one every few seconds max, don't DoS anybody. Save to harddrive, don't hit their server over and over and over. You might have to set User-Agents and HTTP-Referrer.

How do you categorize, dedup and easily search all your URLs? How's about Excel spreadsheets. Yes, it's clumsy to right-click, copy /paste your link into browser, but I haven't found a better tool for scrolling/visual inspection and easy searches.