/blog: information, data, and scholarship

Scientometrics 2.0

I’m excited that I’ve had two papers accepted this week: “Scientometrics 2.0: Toward new metrics of scholarly impact on the social Web,” with Brad Hemminger, and “How and why scholars cite on Twitter” (online soon) with Kaitlin Costello.

What’s special about these two papers is that they are the start of  a research project that I hope will become my dissertation, an idea I’m somewhat reluctantly calling “scientometrics 2.0.” (do we really need more 2.0s?) Scientometrics is

…the science of measuring and analysing science. In practice, scientometrics is often done using bibliometrics which is a measurement of the impact of (scientific) publications. (Wikipedia)

My idea is that we should be looking beyond this, and starting to mine Web 2.0 sources for signals of scholarly impact. There are a few big advantages to this approach:

  1. It’s much faster.  Once a scholarly article is published, it takes a years for citations to that article to accumulate.  But it can take just days for, say, Diggs or tweets to show up: in our Twitter sample we found that nearly half the links to peer-reviewed articles appeared within a week of those articles’ publication.  This speed could be harnessed to make real-time, personal filters that inform scholars what’s groundbreaking across a broad set of fields. As the velocity and volume of science grow, this could be very valuable.
  2. If I cite something, it probably had an impact in my work.  But what kind of impact?  What if I read it and talked about it, and it informed my general thinking–but not enough to cite?  Just looking at citations, we’re missing many other kinds of impact.  Ten years ago, this was the best we could do.  But today, scholars are using online tools like CiteULike, Mendeley, and Zotero to manage their libraries; Faculty of 1000 to review articles;  and Twitter, FriendFeed, and ResearchBlogging.org to discuss them.  Tools like these–and importantly, the open APIs many of them offer–allow us to lift the curtain and observe scholars in their native habitat.  Scientometrics 2.0 offers a chance for us to develop a richer, more nuanced picture of scholarly impact.
  3. Finally, this approach allows us to break the centuries-old monopoly of the peer-reviewed article or monograph on scientific communication.  We can measure reactions not just to these articles, but also to blog posts, datasets, or videos.  If a certain blog post in your field is generating lots of buzz, there’s a good chance it’s worth your time.  Scientometrics 2.0 can support a sort of informal, “soft peer-review” that works for free, on everything.

At first, this approach will mostly be used for relatively “pure” academic study–learning more about how scholars communicate how impact is transmitted.  Soon, however, young scholars will start making a case to tenure and promotion committees that their heavily tweeted or bookmarked article should count in their favor. Ultimately, I think we’ll see tools that leverage this information to help direct scholars to the most important and relevant work for them, kind of a PostRank for academics.

Of course, there are some obstacles to this.  The most important one for now is getting people to trust that these alternative sources really mean anything.  Who cares if an article is tweeted a lot?  Won’t people game this?  What about scholars who don’t use social media (a majority, for now)?  These questions have answers, but they need to be taken seriously (see the articles for more detailed discussions).

Ultimately, scientometrics 2.0 is going to have to be something we investigate very carefully, and in the proper context.  However, in that context I think it has the potential to be quite valuable, and I”m excited about working toward this in the next several years.

(Note: for a bunch of relevant citations, see the first article.)

Markup languages: who’s who?

markup languages timelineIs HTML XML?  This question came up in a conversation with Sarah and @k8lin, and ended up being harder than I thought it’d be.  There seems to be a fair amount of confusion on the topic, especially given the W3C’s recent abandonment of XHTML 2.0 and growing use of HTML5.

So, I decided to lay it all out in a (relatively) simple timeline format; as far as I know, this doesn’t exist anywhere else.  You’re welcome, The Internet.  Below are my sources and some notes; where possible, links are to the original recommendations or RFCs:

SGML is an ISO standard from the 80’s.  Unlike the other standards on this list, it’s not open (the ISO sells copies for >$200).  HTML is an “SGML application“, and has been from the beginning. The Wikipedia article has a lot more information on its origins, as does the W3C.

HTML 2.0 and HTML 3.2 , the first two W3C specs, are both pretty straightforward. Also straightforward is XML, which dropped in February 1998. Like HTML, XML is “an application profile…of SGML.”

In December 1999, the HTML 4.01 recommendation came out, followed a month later by XHTML 1.0.  The important thing to note is that both of these are still HTML 4; however, XHTML is “a reformulation of HTML 4 as an XML 1.0 application,” while HTML 4.01 is still plain ol’ SGML.

No one knows yet exactly what HTML5 is going to look like, as it’s still several years off.  However, the W3C tells us that the HTML5, like HTML4, is going to have two different “serializations.”  One will be an XML syntax, and is currently being called XHTML 5 (wait, why not “XHTML 2?”  Hang on, we’ll get there). You might expect that the other serialization would be SGML a la HTML 4.01.  You’d be wrong.

Although HTML is technically SGML, most browsers and authoring tools couldn’t care less about the broader SGML standard; they just implement HTML.  So the W3C’s plan seems to be to ditch the SGML legecy and replace it with “html” (note the lowercase), an entirely new standard…which happens to look pretty much like HTML has always looked.

Whew, we’re almost done.  OK, what about XHTML2?  Despite the name, project was not a “next step;” it was a huge break with the whole HTML/XHTML tradition, an effort to completely remake web markup.  In July, the W3C decided to let it die on the vine and focus on HTML5.  So XHTML 5, with its HTML lineage, will be a more incremental change than XHTML 2 would’ve been.

There you have it.  If I missed anything or got something turned around, let me know.

Portrait of the artist as a phrenology illustration

An assignment in my infoVis class: self-portrait as a phrenology illustrationThe first assignment in my infoVis class was to make a visual introduction to ourselves.  I drew a self-portrait in profile, then added my categorized interests in the style of a 19th-century phrenology illustration (compare with actual period illustrations here and here).

Phrenology is interesting stuff.  Though phrenologists had nearly everything wrong, modern neuroimaging has demonstrated that one  important part of their core idea was right: many psychological functions really are highly localized in the brain.  And they made a lot of really cool infographics.  Actually, maybe this is pseudoscience in general; palmistry and astrology also make silly data into some neat-looking infovis.  This same exercise would be fun with made-up star charts and palm diagrams.

$35 homemade whiteboard coffee table

DSC02327Whiteboards are great infovis tools, but expensive and need space.  Solution: the whiteboard coffee table.  It’s the very poor man’s Microsoft Surface (with no BSOD!).  Also, if your taste in home decor tends toward the spartan (as does mine), this makes a great dinner table; it’s durable and really easy to clean.  Most importantly, it’s cheap and you only need a drill and few hours to make it.  Here’s how:

.

.

.

.

Materials:

  • Some 1×2 boards (you can pre-sanded ones for about $2 a piece)
  • A panel of “tile board,” which you can get from Home Depot or whatever for about 10 bucks.
  • some 3″ drywall screws
  • some 1 1/2″ drywall screws
  • wood glue

Tools:

  • Drill with a screwdriver bit
  • handsaw (may need it, may not; see below)
  • tablesaw or circular saw to cut the tileboard (may need it, may not; see below)

coffee table copy

Construction:

  1. Decide on the dimensions you want, and figure how many 1×2’s you need (see the diagram above for the general plan).  You may need to be flexible here, depending on the sized of tile board panel you’re able to procure.
  2. Get the materials.  If you ask nice, a lot of times the store will cut the tile board for you, or they may have a 2′ x 4′  piece available.  You can probably get them to cut the 1 x 2’s for you, as well.
  3. Once you get the materials home, cut anything that still needs cuttin’.
  4. Fasten everything together with the appropriate-sized drywall screws (The diagram shows where they go).  I added glue, but you don’t really need it.  Once the frame is done, glue the top on. Done!

Use Zotero in a separate window

zotero-two-screens1

As I’ve written before, I love the free citation manager Zotero.   And the group and sharing features that just dropped as part of v2.0b7, while still a little buggy, are taking the awesomeness up another level.

But one thing about Zotero has always really annoyed me: the horizantally-split screen.  I never feel like I have enough vertical context for either my Zotero library or the web page I’m viewing.   Meanwhile, I’ve got a whole ‘nother monitor just sitting there empty. Some other folks have complained about this too, suggesting a sidebar view for Zotero.

Today, though, I realized that there’s a really obvious solution: just open up a new Firefox window (ctrl+n), put it on my other monitor, and display Zotero full-screen there.  Dual-monitor workflow bliss.

Obfuscate no more: why your email address should go au naturale

screenshot of the obfuscation decoder demoI was recently redesigning my homepage, and I wanted to include my email address.  I knew that only n00b looz3rz display their addy in plain site for spambots to harvest, so I applied a little light obfuscation,  like they do on php.net and million other sites: “myname at jasonpriem dot com.”

“Take that, spammer scum!” I thought as I finished, basking in my newfound invulnerability to the v1@gr@-hawking vermin.  After all, if lots of people use address munging, it must work, right?

Right?

Darn it, now I’ve got to start reading about it.  So I did.  And after a few hours of reading blogs and writing code, I am now an Expert With Advice (hey, this is the internet).  And the advice is this:

Stop trying to obfuscate your email address.  Stop now.

I’ve got two reasons (and for a few more, some other folks have blogged about this, too).  First, the more theoretical one:

Spam is a problem for you–obfuscation makes it a problem for your users.

After all, they’re the ones who are going to have to do all the de-munging.  Are they always going to notice that they have to remove “.invalid” from the end?  Do they all know that the English “at” means “@”?   Do they have time to edit text in their address lines?   Address munging is fundamentally inelegant, because it intentionally works against clarity.

People have been making this argument for a very long time. It’s particularly relevant nowadays, though, because of the growing promise of the semantic web.  We want data to be machine readable, because then we can do cool stuff with it.  FOAF and the hCard microformat are pretty pointless if they don’t have real email addresses to work with.  “Hide the data from the machines” is a good strategy for fighting Skynet, but not for the future of the web.  Ok, reason two:

Address munging just doesn’t work.

It can’t.  It’s putting glasses on Superman.  Although in theory a valid email can be pretty hard to identify, in practice, emails addresses use a very limited vocabulary–and computers are good at identifying limited vocabularies.  Don’t forget, everyone has been using the same old [at] and “dot” tricks for decades–this is security through obscurity at its very worst.

But don’t take my word for it.  I took a couple hours and worked up a demo email obfuscation decoder that breaks the vast majority of text-based obfuscations; it’s also got an input field for you to test out your own munges (some other people have built similar demos, too).  It’s not perfect, but it correctly decodes most obfuscations–and remember that this is a novice programmer, working for an afternoon.  It’s that easy. Supporters of obfuscation argue that spammers will go after the low-hanging fruit; folks, text-based obfuscation is the low-hanging fruit.

Now, the Alert Reader has by this time noticed that I’ve limited my critique to text-based munging.  “What about more sophisticated methods,” the Alert Reader now asks?  “What about using an image, or CSS, or Javascript to hide addresses?”  Good questions, Alert Reader; you are very alert.  Alright, let’s take a quick look at these, too:

Images

There’s not really much I can say about this one, save this: making content completely opaque to visually-impaired users simply shouldn’t be an option. And of course, spammers still can OCR your images.

CSS

Obviously, something like  foo@bar<span style=”display:none”>NULL</span>.com is silly; the spambot can filter out “display:none” spans pretty easily, or even just discard everything in a span.  <span class=’a’>foo</span><span class=’b’>bar</span>@“<span class=’c’>foo</span><span class=’d’>bar</span>.com at least requires the bot to open your stylesheet to see which spans are hidden.  But remember, your server will happily dish out your easily-parsed css to anyone who asks for it; this is not a good place to hide secrets.

Javascript

There are too many js methods to cover in any detail here.  Some are better than others; a few try to degrade gracefully for users without Javascript support.  All of them, though, share the same weakness as CSS: everyone can read your Javascript.  And you certainly don’t need a browser to run it; there are lots of JS interpreters that are more than happy to run on a spammer’s server.

Sure, you can get pretty clever with this technique (I particularly like the idea of decoding not on the onload event, but on a click event), but you can’t change the fact that ultimately the bad guys can do everything with your code that a browser does–and eventually, they will.

Now, I’ll admit that images, CSS, and Javascript approaches are more effective than text-based ones.  All of them (when done properly) require the spammer to pay for more bandwidth and/or processor cycles.  But they all also inconvenience some or all of your users, and none of them are compatible with the sementic web.  They all give you false sense of security, and they’re ugly, hackish solutions. True, some obfuscations have performed well empirically–but keep in mind that these (pretty informal) experiments are years old.  As more people have adopted these measures, be sure that more spammers are spending the time to counter them, as well.

Now, I can’t go so far as to condemn anyone who obfuscates an address; I get that spam is a pain, and filters aren’t perfect.  Sometimes an ugly, hackish solution is the only way.  But I’m suggesting that you think twice before you give in to the spammers and obfuscate, especially given the relative ineffectiveness of many commonly-used methods.  The Web reaches its full promise when information is made easier to find, not harder.

Prezi: presentation junk 2.0

prezi logoIt’s 2009.  I think everyone out there knows that Powerpoint is, at best, overused (at worst:Stalin).  Particularly gruesome is the animated slide-transition “feature,” which I think most agree has the same communication effectiveness and subtle charm as “<blink>” tags, mouse-cursor trails, and hilarious animated gifs of cats.

So how is it that presentation tool Prezi is suddenly the toast of the town?  The quick sell looks like this:

“Prezi allows anyone who can sketch an idea on a napkin to create and perform stunning non-linear presentations with relations, zooming into details, and adjusting to the time left without the need to skip slides.”

I love how the first phrase suggests that there’s this great mass of napkin-sketching geniuses out there who can’t get their ideas out (until now!).  I mean, I like mind maps, but turning one into an outline is pretty easy.   So the presentations are “non-linear.”  Does that mean the audience can interact with them, zooming in on sub-points of interest?  If it does, let me show you this thing called “hyperlinks.”   And is skipping slides really this tremendous problem?

When it comes down to it, the real selling point of Prezi is just the “stunning” presentation.  Now, perhaps I’m jaded, but “zoom-in/zoom-out” leaves me unstunned.  More importantly, though, this seems a textbook example of chartjunk: a “really great” visual effect that serves only to obscure or distract from real information.  I think (hope) it’ll have the lasting appeal of Powerpoint’s racecar-noise-with-flying-in-bullet-point.

Perhaps I’m missing something (feel free to correct me in the comments) or just being curmudgeonly, but I think Prezi is vastly overhyped.  Powerpoint is bad enough.  Also: I like how the Prezi logo, by mixing case, suggests that the product may in fact be called “Pretzl.”  Ok, now that’s definitely being curmudgeonly.

Quick book review: Dreaming in Code

I imagine Scott Rosenberg reckoned he’d picked a winner when he started Dreaming in Code, his 2007 book chronicling the development of the Chandler personal information manager. The project seemed to have everything going for it. It had all the fashionable features: GTD! Open Source! Peer-to-peer! Level the silos! It was headed by software legend Mitch Kapor. It had infinite funding. It had talented programmers with impeccable resumes—decades upon decades of successful experience creating good software.

Over the course of Dreaming, though,  we see this elite team gradually self-destruct. We see vague spec. We see unrealistic deadlines. We see huge mid-stream course changes.  As Rosenberg writes, “By now, I know, any software developer reading this volume has likely thrown it across the room in despair, thinking, ‘Stop the madness! They’re making every mistake in the book!’”  Dreaming finally ends four years into Chandler’s development—with version 1.0 still a distant vision (it was finally released, mostly to yawns, last August ).

Rosenberg, though, is savvy enough to turn the Chandler team’s failure into his own success.  Not only does he use the story to anchor an excellent (if basic) introduction into the practices and quirks of the industry as a whole, he weaves an engrossing and deeply human narrative.

Aristotle said tragedy should evoke fear and pity in the viewer, and Rosenberg deftly supplies us with both. On the one hand, Dreaming reads like watching a horror movie: “No! Why are you splitting up to explore the house!? Why do you keep changing the UI every 6 months!? Noooo!!!!” At the same, Rosenberg does a pretty good job of making us really like many of the characters. Kapor, in particular, comes off as both an intelligent visionary and genuinely good guy. Watching Chandler implode, I feel bad for him.

In interviews, Rosenberg shows again and again how the characters, all experienced programmers, understand the Classic Mistakes. Then he describes with agonizing clarity how they turn right around and proceed to make just those mistakes. I think it’s this quality that put me so in mind of classical tragedy, where the noble hero is undone by just these sorts of tragic flaws or mistakes.

Rosenberg resist the temptation to write another Lessons From Software Failure manual.  Instead he shows how smart, capable programmers working in an ideal environment can reenact the same fatal mistakes programmers were cataloging decades ago. Like Greek drama, Dreaming confronts the ineluctability of failure head-on.  Rosenberg’s ultimate thesis is nothing more or less than the classic words of  Donald Knuth, with which he opens the book: Software is hard. Sophocles would be proud.

Other reviews I liked:

  • Amazon
  • Joel Spolsky: discusses the technical aspects more; doesn’t think Chandler was a very good idea to begin with.  Has some good points, here.
  • Adam Barr: discusses the individual parts of the book more.

FeedVis 2.0: custom visualization for your feeds

this is what feedvis looks like

My FeedVis project–the interactive tagcloud for a group of feeds–has been out for a week now, I’ve been thrilled at the positive response I’ve gotten so far.  One rather glaring problem with the program, though, was that you could only look at the top 50 edublogs.

Not anymore.  After a few late nights, I’ve got a beta system for uploading and analyzing your own sets of feeds.  You just upload your opml, wait a few minutes, and you’re set: FeedVis gives you a custom page that you can bookmark and return to anytime you like; it’ll continue to update every time you visit.  You can also browse visualizations of other people’s feeds.

It’s pretty untested, and I’m sure use will uncover some bugs.  But it’s got potential; I’m excited to see what people think.

FeedVis: a deeper tagcloud for edublogs

a screenshoto of feedvis

Tagclouds have value, but, as I’ve written before, they’ve a number of shortfalls as well.  I’ve just finished my attempt to remedy some of these problems: FeedVis.  It’s an animated tagcloud that lets you compare word frequencies accross different time periods and authors, then check out the posts that used the words.  The demo is using the feeds for Scott McLeod’s Technorati-compiled list of top 50 edublogs, since that’s what got me started about feeds and tagclouds in the first place (although the program will work with any set of feeds).  More details about how it works are on the demo page.

I think what I’m really most excited about is the way this uses animation to let you actually see the words changing from one sample to the next.    Motion is such an important part of the way we see the world, and it’s been underemployed in information visualization, I think (although this changing; Hans Rosling’s TED talks have gotten a lot of buzz, for instance).

The project has been really fun, and a great learning experience; it’s gotten me really pumped about inofVis for learning about online interaction.  I think there is a lot of potential there for ed tech research.  I’m also pretty excited about programming; I started learning in February (with php), and then started javascript a couple months ago.  It’s been a really mind-expanding experience, and I’m looking foward to my next project, probably once I get done with grad school apps.