Saturday, August 05, 2006; Taxonomy / auto-categorization tools

DRAFT: Taxocop is all commercially licensed stuff. I fully intend to link open source tools, or ruminate on how to do it in python, C, weka and R.

Taxocop's taxonomy tools
(found at Swik. )

Univ. Western Ontario prof's ruminate about tagging (Western, as well as Waterloo, Queens, U of Toronto, are top-notch unis, but nobody knows about them in the U.S.). Reminiscent software people talking about Christopher Alexander & design patterns, so I probably won't read this either.
(found at Facetag)

Masters thesis about delicious and flickr. This one's interesting.
(from facetag also)

Facetmap: what's their algorithm? (start reading at page 8)

Interesting conceptual ruminations here

social bookmarking universe

1st things 1st (I said it before, I'll say it again:

Non-gentle (Stringent or Strident) Reminder: Read the "Terms of Service", "Conditions of Use", "Fair Use of our Site", whatever it's called, for any site you're thinking of scraping. Also the "robots.txt". Abide by them. We're talking about getting a handful of pages here, *not* about downloading dozens or more pages, let alone their whole database.

Choose your corpus:

Suppose you needed to learn everything about a tag. Delicious is hardly the best for social bookmarking / folksonomy /tag clouding, URLs are repeated ad nauseam with different user descrips, you can't see the tagged URL til you mouse over it, can't sort URLs by number of times tagged, etc. But it's got a lot of URL's tagged, and you can search on, say, "Ruby GC" and get a finite number of good tagged URLs. (That's 1 way to do focussed search on delicious, the other is to look thru ruby tags of somebody you know is a solid developer).

What do i want in a bookmarking/blogsearch site? A decent search engine, obviously: search Digg for "IDE" turns up everything with "guide" or "video" in title. A way to find newly tagged URLs that've been tagged only a few times. Easy markup and a lot of tagged URLs per page.

Reddit's not bad, tho you can't tell how many database hits you get for your tag or search term, and the markup's moderately messy to scrape.

ma.gnolia: very easy markup, lots of tech tagged URLs.

digg, searches for "ruby GC" and "python GC" turned up nothing, lousy search engine, as noted above. But for the most part, python tagged URLs are good, markup's not to bad to extract from

technorati covers tech blogs pretty well. You can filter out stuff in languages you don't read, the markup's not too bad to scrape, and, while you have to mouse over to see the underlying URL, it's got a nice extract of the web page as a summary for teh tag.

Blinklist. Now this is nice: when you type "Python" into search box, the drop down list shows most popular searches: "python tutorials", "python 3D" etc. But: when you scroll thru URLs for 1 tag, page refreshes are done thru AJAX, so you can't easily grab HTML and run thru HTML parser or save to hard drive. Still working on this.

Haven't investigated yet: digg for developers
Smarking Not bad, but they obviously didn't spend a lotta time on domain names or learning CSS /design

Fantacular: I think it's brand new: 20 tagged python articles(!?)

shadows another where you can't capture successive pages by requesting new URLs. But they expose the Rails FilterBySearchItem at the bottom, so you can push those buttons and capture all the pages for 1 tag (I think). Also can't "View Source" for the page you actually have up (in Firefox). Nasty markup too, <div> within <div> etc etc. A good challenge.

Connotea find out how scientists are using py and rb. easiest to parse markup you'll ever see, cause physicists have already filled up their left brains. by Andrew McCallum. Info-extracted authors info + pictures, citations, etc. YAATS yet another academic tagging site

Feedstertechnorati's main competition, apparently. Poor spamblog filtering, probably not worth bothering with a better technorati work-alike

bloglines good lists of French links for rb/py. Lots of dups, but easy markup to parse


Yahoo MyWeb
scuttle 1000s of py/rb links, not that interesting tho


Feed Me Links Since 2002!

delirious not bad for py/rb, easy markup.

Also rans:

wists Social shopping?? Practice your Italian. Only a few dozen ruby and python things here, but looks nice, some good URLs.

Furl (not bad, actually, and pretty easy markup to parse),


RawSugar a few good links

Spurl (broken functionality, you have to register to see any tags),

simpy,(4,400 URLs tagged "Python", messy deep markup)
jots (96% spam, little technology content),
pubsub I have to figure out how to get at other people's tag subscriptions

linkAGoGo not much content, this guy knows less about CSS and design than I do.
BlinkBits truly sad: PHPbb, so they'll be HaX0rd from time to time, no content,

fark No comment, but here's a scraper


ClipMarks No content and very hard to scrape
BlueDot.US not much tech content, yet

Q: Where the heck did you find all these?
A: affiliateBlog and Roxomatic.DE
And this guy and a Quimble poll
So there's the raw material. Now we have to figure out how to pull the pages, get the URLs, page description, tag scores or numbers of people who've tagged that page etc. Save a few pages from each social networking site into its own subdir, fire up BeautifulSoup, or whatever HTML parser you like. The (X)HTML is generally correct, but markup trees can get deep, some sites have a bunch of javascript for each tagged URL.

BTW here's how to put a bazillion folksonomy buttons on your rails

(Aside, from the Biting the hand that feeds you Dept: Blogger's markup is odd, it has a huge stylesheet inline in every page, and the blog entry is all on 1 line with <br>'s. Easy to scrape tho. As another aside, Blogger's captchas are murderously difficult, even for somebody with 20/20 vision)

Next we're ready to go to Starbuck's and start our little script hoovering all the pages from each social networking site. Like OReilly's Spidering Hacks says, be nice, download pages on a human scale, one every few seconds max, don't DoS anybody. Save to harddrive, don't hit their server over and over and over. You might have to set User-Agents and HTTP-Referrer.

How do you categorize, dedup and easily search all your URLs? How's about Excel spreadsheets. Yes, it's clumsy to right-click, copy /paste your link into browser, but I haven't found a better tool for scrolling/visual inspection and easy searches.

Monday, July 24, 2006

Python metablog, newsgroup, mail list, digest, survey, aggregator, portal, manual summary, link dumps

Python has a bunch (ruby has more: monstrous blogroll
loosely mirrored at

3 areas to become proficient in:
- metaprogramming (the links below),
- Design Pattern / O-0 design (no recommendations yet), and
- unit testing / Test-driven Development(NRY also/either)

python-dev read this 1st

Py wiki table contents daily URL 90's style link dump Assoc. Francophone PY, beaucoup de renseignements sur le Py probably more ruby/rails mindshare now French guys work on Zope, etc YABR
Dr. Dobb's Python-URL No archive, but good to peruse

Sunday, July 09, 2006

NLP, IR, Data Mining books

- Jurafsky/ Martin updating SLP. Half the chapter drafts are up.

- Manning and Schütze, Foundations Statistical NLP. Get 6th printing, 2003, with most of critical errata folded in (still have to look at errata ) <OT>why is it so hard to find at the Stanford bookstore? </OT>

- Manning, Prabhakar Raghavan, Schütze: IR book out in Oct.

- Jackson and Moulinier, NLP Online Apps (2002) recommended, 200 page survey Retrieval, Info Extraction, Category /cluster, Text Mining. footnotes on implementing: "labor intensive", "regular maintenance" etc.

- Oxford Handbook Comp Linguistics review here)
- Charniak, Statistical Language Learning, 1993
- Norvig and Russell, AI

- Bod, Hay, Jannedy, eds. Probabilistic Linguistics

- Hal Daumé's excellent blog also suggests Allen's text as well
- Alias-I's Bob Carpenter's 20-book Amazon list. Excellent
Dissertations i like: Klein, Finn;
List of disserts from this UMass student, again via Hal Daumé's blog
- Witten and Frank, Data Mining
- Han and Kamber, Data Mining
- Hand, Mannila, Smyth, Data Mining

- Kumar, Intro Data Mining
- Chakrabati, Mining the Web. Excellent. Fairly rigorous math, covers conditional probability modelling, supervised/semi- / and unsupervised inference, how to build crawlers, graph algorithms (HITS, pagerank)

, Survey Text Mining
- Springer is doing a series of Web Intelligence books.

- Hatcher, Lucene in Action thorough coverage Java open source search engine (how to roll it out, not architectural / algorithmic detail)

- Hemenway/Calishain: Spidering Hacks. Pragmatics, so your corpus gets collected in finite time. "Perl examples easily translated to ruby/python" is all I'll say about that ;-|

- Google Hacks
- Berry and Browne, Understand Search Engines (2nd ed.)
Excellent, 100 pages intro to mechanics: stemming/stopwords, tf/idf, QR & SVD in C,
Not discussed: -other matrix decompositions;
-all the wrappers around SVDPACKC:
- by Doug Rohde
- by Stanford Computational Semantics lab
- by UT-Austin
Data Mining has hit the mass market: Borders misshelves Data Mining books with books for Oracle DBAs. A few bookstores stock non-trivial # above books: Barnes&Noble NYC, Seminary Book Co-op Chicago, Powells Portland, Stanford Bookstore. Cody's on Telegraph(Berkeley) *was* a wonderful bookstore. There must be others in LA/Pasadena and Seattle or Vancouver as well.

Browse Amazon, click on "Customers who bought this also bought:". They're pretty good at clusterin ;-}

Friday, June 16, 2006 to Knowledge base

Non-gentle (Stringent or Strident) Reminder: Read the "Terms of Service", "Conditions of Use", "Fair Use of our Site", whatever it's called, for any site you're thinking of scraping. Also the "robots.txt". Abide by them. We're talking about getting a handful of pages here, *not* about downloading dozens or more pages, let alone their whole database.

<aside> One of ISI folk writes this is an ontology waiting to happen. But how?. Alright, so you reduce all of' software tags and tagged URLs to a term-doc matrix. This is the "n >> m" discussed by Berry/Browne, pg 30. THen do tf, idf, normalize columns (documents or URLs in this case). SVD to a rank-200 approximation. </aside>

Then we look at each tag's tokenization(stemming /stopword), orthographic, syntactic (Part of speech/composite word), chunking, and gazzettes. Viz, is this one fo the 250 or 500 most common English words? Is it all caps, 3 words run together like German, noun, adjective in WordNet? What % tagged URLs does the tag occur in (unpleasant regex building/maint exercise to ensue...) and what kind of chunks. At this point we have a big loglinear generative modelling exercise.

May help to think in terms of Software Design Patterns: are these tags "is-a" "has-a", "has-many", "produces-a", i.e. are they Adapters, Facades, Composition, Delegation, Factories, etc? Facetmap is producing gazettes, which is impressive if there's no human intervention, i.e. unsupervised or minimally supervised. But i think we really can think semantically rather than phonologically/syntactically.

Maybe that's how this big tag cloud became this small one. Swik's interesting, they have this huge tag cloud from somewhere, but not a whole lot of URLs that they present as their content.

If Bill Joy, Matz, Damien Conway, Guido, Audrey Tang, Tim Peters & Mauricio Fernandez told you what their user IDs were, what would you do with the info? If you had access to click-thru rates, you could calculate which tags, or users, had highest click thru rates.

One of my favorite strategies: ID users who are solid python developers and early taggers of rb/py URLs. dehora, rtomayko, rhymes, masterchef, brunns.

We have polysemy /synonymy. Rails tags are: rails, rubyonrails,RubyonRails, etc. Sam Ruby's articles sometimes get tagged ruby.

Another ISI guy idea: Rexa could bring google-trends type metrics to academia: "Seminal!" "Derivative!" "Debunked!" </aside>