Friday, June 16, 2006

del.icio.us to Knowledge base

Non-gentle (Stringent or Strident) Reminder: Read the "Terms of Service", "Conditions of Use", "Fair Use of our Site", whatever it's called, for any site you're thinking of scraping. Also the "robots.txt". Abide by them. We're talking about getting a handful of pages here, *not* about downloading dozens or more pages, let alone their whole database.

<aside> One of ISI folk writes this is an ontology waiting to happen. But how?. Alright, so you reduce all of del.icio.us' software tags and tagged URLs to a term-doc matrix. This is the "n >> m" discussed by Berry/Browne, pg 30. THen do tf, idf, normalize columns (documents or URLs in this case). SVD to a rank-200 approximation. </aside>

Then we look at each tag's tokenization(stemming /stopword), orthographic, syntactic (Part of speech/composite word), chunking, and gazzettes. Viz, is this one fo the 250 or 500 most common English words? Is it all caps, 3 words run together like German, noun, adjective in WordNet? What % tagged URLs does the tag occur in (unpleasant regex building/maint exercise to ensue...) and what kind of chunks. At this point we have a big loglinear generative modelling exercise.

May help to think in terms of Software Design Patterns: are these tags "is-a" "has-a", "has-many", "produces-a", i.e. are they Adapters, Facades, Composition, Delegation, Factories, etc? Facetmap is producing gazettes, which is impressive if there's no human intervention, i.e. unsupervised or minimally supervised. But i think we really can think semantically rather than phonologically/syntactically.

Maybe that's how this big Swik.net tag cloud became this small one. Swik's interesting, they have this huge tag cloud from somewhere, but not a whole lot of URLs that they present as their content.

If Bill Joy, Matz, Damien Conway, Guido, Audrey Tang, Tim Peters & Mauricio Fernandez told you what their del.icio.us user IDs were, what would you do with the info? If you had access to click-thru rates, you could calculate which tags, or users, had highest click thru rates.

One of my favorite strategies: ID del.icio.us users who are solid python developers and early taggers of rb/py URLs. dehora, rtomayko, rhymes, masterchef, brunns.

We have polysemy /synonymy. Rails tags are: rails, rubyonrails,RubyonRails, etc. Sam Ruby's articles sometimes get tagged ruby.

Another ISI guy idea: Rexa could bring google-trends type metrics to academia: "Seminal!" "Derivative!" "Debunked!" </aside>