Links for 2010-02-19 [del.icio.us]

What Perl 5’s Version Numbers Mean

Perl 5.11.5 comes out tomorrow and Perl 5.12 should be out soon. (Much credit goes to people such as Jesse Vincent and David Golden, to name two, for getting Perl 5 on a regular release cycle.) I've long promised to write about the Perl 5 support and deprecation policy and how that affects users.

Perl 5.10.1 was, by definition, a minor revision. Perl 5.12 is a major revision. The nominal difference is which component of the version number increases. By intent, users of 5.10 (actually 5.10.0, but often abbreviated) should be able to upgrade that installation in place to any subsequent minor release in the 5.10 family. The upgrade isn't always completely transparent, but the intent is that, modulo bugfixes, it should be.

When 5.10.0 came out, work started on a new Perl 5 release family called 5.11 (that's not entirely true, but it's sufficiently true for this explanation). This is the unstable series intended for development and testing which will become 5.12 in the next couple of months. You are welcome to download, configure, build, test, and even install 5.11, but you should be comfortable without support from p5p for upgrades and changes.

The monthly releases in the 5.11 (and soon, the 5.13) series represent points of stability and review so that the Perl 5 developers can concentrate on the quality of what will become 5.12.0.

When 5.12.0 comes out, you will notices changes from 5.10.0 in terms of new features, removed features, and upgrades to the standard library. While most code should work unmodified with 5.12.0 as it did with 5.10.0, some modules will need updates. You likely also have to recompile any modules with XS components.

In subsequent entries, I'll write more about the implications of all of this, when you should upgrade, how deprecations and changes work, and the binary compatibility policies of Perl 5.

Why SDL Perl Matters

I read a book proposal years ago on the subject of teaching kids to program with C++. "After a week," it said, "children will know enough to create their own simple text games and animations."

I was perhaps six years old when I saw my first minicomputer. I flipped open the first page of the manual and typed in the lines verbatim—except I left off the line numbers, likely thinking that they were merely a convenience for readers. Perhaps I've had good taste from the beginning.

My typing skills were, as you might expect, abysmal. Even so, I had feedback from the computer within fifteen minutes or less. If I'd had to spend a week learning things to move characters around on the screen, I'd have given up.

I like games. I enjoy thinking about how they work. I like writing stories. I play games. The mechanics of rules and balance and design and enjoyment and player participation and perception are fascinating. Even more important is the idea that games can have a didactic purpose.

I spent a lot of time in my childhood years playing games but also breaking games. A bit of work with a hex editor could give my party more experience points so that two or three well-placed fireball spells would clean out the kobold lair. (Any role-playing system which starts magic users with four hit points won't have them surviving the tetanus shot before they get their passports.)

Because I could only get time on computers at school if they had an educational purpose, I taught myself how to write programs so I could write games. I don't suggest my experience is representative of all children, but it's not so far different from that of many of my friends.

A few years ago, I tried to help revive SDL Perl when the maintainer retired. The experience was difficult; it's a big wad of XS code that needs plenty of probing and configuration for a handful of somewhat-optional libraries. I don't even want to think about everything required to detect which version of OpenGL you have installed and available in a cross-platform fashion.

Fortunately, Kartik Thakore is everyone's hero (and plenty of other people are helping too).

I've heard the arguments that "Kids these days are too busy texting each other!" or "It's okay that kids make YouTube mashups of pop songs and clips of their favorite anime characters, that's creativity!" and "You can teach a kid PHP and HTML and call him a programmer, and that's super fun!" I don't believe any of them.

I think instead that you can plop your smart seven year old in front of a real computer with a real keyboard and show her that typing something makes a picture appear and typing something else makes it move and give her a few other commands and boom she'll play with that for a while. Not everyone's suited to the deep, dark logic of understanding the bindings from a high level language to a shared library and memory management techniques thereof, but what a privilege to teach a younger generation that a computer isn't merely an appliance to read Wikipedia and text their friends, but a general purpose device they can control.

Show a few of them how to make pretty graphics move around on screen per their command—per textual instructions they have to reason about and maintain themselves—and you just might have something. Sure, Pygame and Pyglet are great. I've used them productively. Even so, more options for free software and free environments can only help.

CPAN Testers 2.0 mid-February update

February is a short month and the last couple weeks have flown by. Since my last update, there have been a couple of significant milestones that bring us closer to CT2.0:

  • I successfully created a test Metabase backed by Amazon S3+SimpleDB and have created, retrieved and deleted some user profile facts. This built on some of the earlier work by Leon, but reflected the work I’ve been doing to revise the guts of the Metabase libraries.
  • I wrote a program that converts a given NNTP report into a CPAN::Testers::Report object (which is itself a Metabase Fact subclass.) This wound up being trickier than I expected due to potential for ambiguous mapping of distribution names to actual distribution files. (Thank you to Tokuhirom, Andreas, Takesako and Offer Kaye for reviewing and fixing my exception list.)
  • I generated 900+ Metabase user profiles for all the known CPAN Testers, which will be used to linked with their reports during the conversion process. We still need to figure out a way to distribute these to testers so that new records will link up and other logistics, but this was a necessary step to prepare for conversion.
  • I’ve started configuring Amazon EC2 instances — one for the parallel conversion of reports and one for the CT2.0 server itself. I’m still coming up the learning curve on EC2, but see no obstacles (except time)

My immediate goals are (1) to get a test CT2.0 server up and receiving reports so we can start testing it and (2) prep and run the parallel conversion process. I’ve already gotten some help on the #catalyst channel for #1 (thank you, hobbs) and #2 is now mostly a matter of programming it up now that all the foundational prep work is done.

Hitting the March 1 deadline is going to be tight. We’re still holding at about 2 weeks behind the original estimate (where we’ve been since mid-January). But I think progress is coming fast and furious and I hope to get a solid beta launch by March 1 and negotiate with the Perl NOC for an orderly transition as we ramp up the new service.

A Decade of Lexical Filehandles

Perl 5.6.0 is almost a decade old; perldoc perlhist gives a release date of 22 March 2000.

My favorite feature of Perl 5.6.0 is lexical filehandles. Instead of having to access the IO slot of package global typeglobs, I could use lexical variables to contain filehandles -- without having to muck about localizing symbol tables or worrying about action at a distance or lifetimes of global symbols.

Yet to this day, almost a decade later, I still see the old way with all of its disadvantages (Tell the truth; do you understand every word of "the IO slot of package global typeglobs"? Do you want to explain that to novice programmers?) in new code.

Perl 5.6.2 is long dead. Perl 5.8.9 is the last of its release series too. The argument for running new code on old installations of Perl 5 is awfully thin, in that light.

Likewise I can't make a simplicity argument for the old approach. Making old-style filehandles work like people might expect is anything but simple. Throw in a local here or there and the typeglob sigil and maybe a gensym() call for good measure. Fun!

Reasonable people differ on style and technique, but I wonder what makes a feature such as pseudohashes or 5.005-style threads so hated that it eventually gets deleted, while difficult-to-use-correctly features superseded by better replacements stick around far longer than necessary. My guess is that the Perl 5 world suffers here, as usual, from a questionable abundance of old code, old tutorials, old books, and copy and paste coding from ancient sources of dubious wisdom. (This probably means I should submit patches to perldoc perluniintro and other offenders in the core documentation.)

Perhaps it's time to consider a gradual, intentional, well-tested and well-reviewed campaign to update tutorials and example code with somewhat more modern examples of maintainable Perl.

(For fun, imagine a world where the canonical printed Perl 5 reference covered a version of Perl 5 released this millennium. Then again, Perl.com thinks that 5.6.2 is "the previous version of Perl" 5.)

Chunking, Subtlety, and Whitespace

I delayed writing about references in Perl 5 in the Modern Perl book for a long time. References in Perl 5 are useful. They have their warts. They're not as difficult as most people believe, however. Novices have trouble learning how to use references effectively because most tutorials and introductions explain them poorly.

I had to think about explanations for a long time before I found a way to explain them well.

Of course, the syntax for dereferencing gets complex very quickly—but it's also an effective example of what I've been discussing this week. Perl has a handful of subtle design consistencies that, if you understand them, help you read and skim code very effectively. If you don't learn them, you'll get lost in a sea of punctuation soup.

Consider an array reference $monkeys_ref. You can get the number of monkeys by evaluating that reference as an array in scalar context in one of two ways:

# the short way
my $count = @$monkeys_ref;

# the disambiguatey way
my $count = @{ $monkeys_ref };

The former way is shorter and more idiomatic. Anyone familiar with Perl 5 references should understand what the additional sigil means ("I want a list from the following reference"). The latter syntax has the same effect, but it means instead "I want a list coerced from the expression evaluated within this block." The difference is subtle and you don't have to understand the subtleties for this example.

Trouble arrives when you deal with nested data structures or more complex expressions, such as slices:

# the short way
my $monkeys = join ',' @$monkeys_ref[@indices];

# the clearer way
my $monkeys = join ',', @{ $monkeys_ref }[@indices];

The first expression is somewhat more difficult to parse; which takes precedence, the indexing operation represented by the square brackets or the dereferencing operation indicated by the leading sigil? The second expression works because the intended order of operation is clear, at least to anyone who understands how curly-brace grouping works with complex references.

The whitespace is unnecessary, of course, but I find that it adds clarity.

A little bit of disambiguation isn't necessary to help the Perl 5 parser in this case, but it does helps the reader. Students of compiler design might argue that nested expressions this complex belong on separate lines. I can imagine how this would read in a pseudo assembly language (I work on Parrot, after all). There's definitely a balance between the complexity of nested expressions and dereferencing... but this is a place where I consider the idiomatic use of Perl 5 sufficiently expressive that spreading the list slice out over multiple lines would obfuscate the intent of the code.

Certainly it's possible to perform even more complex dereferences of data structures, but when it's difficult to identify individual chunks of the desired behavior, it's time to simplify the code or the expression or the design. Even still, readability of this code does should not depend on the desire to avoid teaching novices about references.

Chunks and Syntax Highlighting

If I'm right—if reading source code requires identifying parts of speech—then familiarity with syntax and grammar is important to programming as an adept.

Consider Damian Conway's SelfGOL. As an experienced Perl programmer, I can pick out various pieces of the code at a glance. There's an assignment. There's quoting. That's a variable. That's a list slice.

If you've never encountered Perl before (or programming in general), you might recognize some English words, such as print and die, and that's all.

One of Perl's design ideas borrowed from linguistics is that "different things should look different". To novices, everything looks different. $name isn't obviously a single chunk. It's an English identifier and one of several punctuation symbols apparently sprinkled at random throughout the program.

Good use of whitespace helps. So does the good use of parentheses as grouping constructs (though as in prose, they often get overused by novices).

One of the most subtle mechanisms to identify individual chunks floating in a sea of code is with syntax highlighting. I can't prove this. I haven't studied it in repeatable situations. Even so, I hypothesize that (modulo color choice concerns) merely highlighting different types of terms in the grammar in different ways will help novices understand how to pick out individual chunks in code.

This requires training. This demands practice. Unless you spend time reading code, you won't understand how expressions fit together, and you have little hope of understanding code. I believe it's impossible to skip this step, and thus I don't care if someone who's used C or ML has trouble reading Perl 5 code. Of course people have trouble reading when they don't know the grammar.

(Don't worry, Lisp fans. Homoiconicity—apart from additional complexity of quoting forms and reader macros—means that novices have to spend their time learning to recognize idioms and abstractions at a level higher than tokens and chunks without the benefit of patterns of chunk types as mnemonics to idioms. Then again, I think in patterns, rarely words.)

Can you help identify ambiguous CPAN distributions?

Hello, Perl community. As I work on converting legacy CPAN Testers (CT1.0) reports to the new CPAN Testers 2.0 (CT2.0) format, I’ve encountered a curious conundrum and could use some volunteer help.

CT1.0 indexes reports based on the distribution name and version, e.g. “Foo-Bar-1.23″. This is an unfortunate historical accident, since PAUSE does not prevent uploads with the same file name to different author directories:

  • JDOE/Foo-Bar-1.23.tar.gz
  • JQPUBLIC/Foo-Bar-1.23.tar.gz

CT2.0 will index reports based on the full unique distribution file path. I’m currently working on a heuristic to link any given legacy test report (on “Foo-Bar-1.23″) with the correct distribution file path for that distribution name and version for the conversion to CT2.0.

For the most part, it works. Usually, there is only one distribution file path on BackPAN that matches. Sometimes there is more than one possibility, but I’ve worked out ways to resolve the ambiguity by comparing the possibilities to information in the 01mailrc files or the 02packages.details file.

But there are about 50 distribution name-version pairs on BackPAN that my heuristic fails to resolve. Since this is a one-time conversion from CT1.0 to CT2.0, all I need is a mapping file with entries like this for these ambiguous cases:

    YAML-0.39    INGY/YAML-0.39.tar.gz

If you think you can help — either through some automated approach or just by volunteering your human brain to do some basic research to identify the “authoritative” path (e.g. historical author list in the distribution documentation files), that would be a great help for me so I can keep plugging away on the conversion code and other todos.

Even confirming that the candidates on BackPAN have the same md5 sum would be helpful since then even if we guess the wrong author, the test results are still “good” for the mistaken distribution file.

Here is the list. The name-version pair is followed by an indented list of possible paths for that pair.

Attribute-Memoize-0.01
  DANKOGAI/Attribute-Memoize-0.01.tar.gz
  MARCEL/Attribute-Memoize-0.01.tar.gz
B-Generate-1.12_03
  JCROMIE/B-Generate-1.12_03.tar.gz
  JJORE/B-Generate-1.12_03.tar.gz
Bundle-Cobalt-0.01
  HARASTY/Bundle-Cobalt-0.01.tar.gz
  JPEACOCK/Bundle-Cobalt-0.01.tar.gz
CDDB-0.9
  FONKIE/CDDB-0.9.tar.gz
  KRAEHE/CDDB-0.9.tar.gz
Catalyst-Plugin-Session-Store-File-0.07
  ESSKAR/Catalyst-Plugin-Session-Store-File-0.07.tar.gz
  KARMAN/Catalyst-Plugin-Session-Store-File-0.07.tar.gz
Catalyst-Plugin-Static-0.05
  MRAMBERG/Catalyst-Plugin-Static-0.05.tar.gz
  SRI/Catalyst-Plugin-Static-0.05.tar.gz
Catalyst-Plugin-Static-Simple-0.14
  AGRUNDMA/Catalyst-Plugin-Static-Simple-0.14.tar.gz
  MRAMBERG/Catalyst-Plugin-Static-Simple-0.14.tar.gz
Crypt-SSLeay-0.51
  CHAMAS/Crypt-SSLeay-0.51.tar.gz
  TAKESAKO/Crypt-SSLeay-0.51.tar.gz
Curses-UI-0.72
  MARCUS/Curses-UI-0.72.tar.gz
  MMAKAAY/Curses-UI-0.72.tar.gz
Curses-UI-0.73
  MARCUS/Curses-UI-0.73.tar.gz
  MMAKAAY/Curses-UI-0.73.tar.gz
DateManip-5.20
  PHOENIX/DateManip-5.20.tar.gz
  SBECK/DateManip-5.20.tar.gz
Finance-Bank-HSBC-1.04
  BISSCUITT/Finance-Bank-HSBC-1.04.tar.gz
  MWILSON/Finance-Bank-HSBC-1.04.tar.gz
Finance-Bank-HSBC-1.05
  BISSCUITT/Finance-Bank-HSBC-1.05.tar.gz
  MWILSON/Finance-Bank-HSBC-1.05.tar.gz
Locale-Object-0.73
  EMARTIN/Locale-Object-0.73.tar.gz
  FOTANGO/Locale-Object-0.73.tar.gz
MARC-0.81
  BBIRTH/MARC-0.81.tar.gz
  ESUMMERS/MARC-0.81.tar.gz
MARC-1.13
  ESUMMERS/MARC-1.13.tar.gz
  PETDANCE/MARC-1.13.tar.gz
Mail-Thread-2.41
  RCLAMP/Mail-Thread-2.41.tar.gz
  SIMON/Mail-Thread-2.41.tar.gz
Math-MatrixReal-1.1
  ANDK/Math-MatrixReal-1.1.tar.gz
  STBEY/Math-MatrixReal-1.1.tar.gz
Maypole-Authentication-Abstract-0.6
  BOBTFISH/Maypole-Authentication-Abstract-0.6.tar.gz
  SRI/Maypole-Authentication-Abstract-0.6.tar.gz
Maypole-Config-YAML-0.1
  BOBTFISH/Maypole-Config-YAML-0.1.tar.gz
  SRI/Maypole-Config-YAML-0.1.tar.gz
Maypole-Loader-0.1
  BOBTFISH/Maypole-Loader-0.1.tar.gz
  SRI/Maypole-Loader-0.1.tar.gz
Maypole-Plugin-Authentication-Abstract-0.10
  BOBTFISH/Maypole-Plugin-Authentication-Abstract-0.10.tar.gz
  SRI/Maypole-Plugin-Authentication-Abstract-0.10.tar.gz
Maypole-Plugin-Component-0.05
  BOBTFISH/Maypole-Plugin-Component-0.05.tar.gz
  SRI/Maypole-Plugin-Component-0.05.tar.gz
Maypole-Plugin-Config-YAML-0.04
  BOBTFISH/Maypole-Plugin-Config-YAML-0.04.tar.gz
  SRI/Maypole-Plugin-Config-YAML-0.04.tar.gz
Maypole-Plugin-Exception-0.03
  BOBTFISH/Maypole-Plugin-Exception-0.03.tar.gz
  SRI/Maypole-Plugin-Exception-0.03.tar.gz
Maypole-Plugin-I18N-0.02
  BOBTFISH/Maypole-Plugin-I18N-0.02.tar.gz
  SRI/Maypole-Plugin-I18N-0.02.tar.gz
Maypole-Plugin-Loader-0.03
  BOBTFISH/Maypole-Plugin-Loader-0.03.tar.gz
  SRI/Maypole-Plugin-Loader-0.03.tar.gz
Maypole-Plugin-Relationship-0.03
  BOBTFISH/Maypole-Plugin-Relationship-0.03.tar.gz
  SRI/Maypole-Plugin-Relationship-0.03.tar.gz
Maypole-Plugin-Transaction-0.02
  BOBTFISH/Maypole-Plugin-Transaction-0.02.tar.gz
  SRI/Maypole-Plugin-Transaction-0.02.tar.gz
Maypole-Plugin-Untaint-0.04
  BOBTFISH/Maypole-Plugin-Untaint-0.04.tar.gz
  SRI/Maypole-Plugin-Untaint-0.04.tar.gz
Net-DNS-0.02
  ANDK/Net-DNS-0.02.tar.gz
  MFUHR/Net-DNS-0.02.tar.gz
Net-SSH2-0.07
  AWA/AWA/Net-SSH2-0.07.tar.gz
  DBROBINS/Net-SSH2-0.07.tar.gz
NetPacket-0.04
  ATRAK/NetPacket-0.04.tar.gz
  CGANESAN/NetPacket-0.04.tar.gz
PDL-2.3.2
  CSOE/PDL-2.3.2.tar.gz
  KGB/PDL-2.3.2.tar.gz
PNGgraph-1.11
  DMOW/PNGgraph-1.11.tar.gz
  SBONDS/PNGgraph-1.11.tar.gz
POE-Session-Attributes-0.01
  CFEDDE/POE-Session-Attributes-0.01.tar.gz
  JSN/POE-Session-Attributes-0.01.tar.gz
Plucene-1.19
  SIMON/Plucene-1.19.tar.gz
  STRYTOAST/Plucene-1.19.tar.gz
RT-Extension-MergeUsers-0.02
  JESSE/RT-Extension-MergeUsers-0.02.tar.gz
  KEVINR/RT-Extension-MergeUsers-0.02.tar.gz
SNMP-1.6
  GSM/SNMP-1.6.tar.gz
  WMARQ/SNMP-1.6.tar.gz
SXIP-Membersite-1.0.0
  KGRENNAN/SXIP-Membersite-1.0.0.tar.gz
  TOKUHIROM/SXIP-Membersite-1.0.0.tar.gz
Scalar-Defer-0.13
  AUDREYT/Scalar-Defer-0.13.tar.gz
  NUFFIN/Scalar-Defer-0.13.tar.gz
Term-Prompt-0.02
  ALLENS/Term-Prompt-0.02.tar.gz
  DAZJORZ/Term-Prompt-0.02.tar.gz
Term-Prompt-0.05
  ALLENS/Term-Prompt-0.05.tar.gz
  DAZJORZ/Term-Prompt-0.05.tar.gz
Test-Warn-0.07
  BIGJ/Test-Warn-0.07.tar.gz
  MPRESSLY/Test-Warn-0.07.tar.gz
Time-0.01
  JPRIT/Time-0.01.tar.gz
  PGOLLUCCI/Time-0.01.tar.gz
Tk-Wizard-Bases-1.07
  LGODDARD/Tk-Wizard-Bases-1.07.tar.gz
  MTHURN/Tk-Wizard-Bases-1.07.tar.gz
UUID-0.03
  CFABER/UUID-0.03.tar.gz
  LZAP/UUID-0.03.tar.gz
Win32-EventLog-Carp-1.21
  IKEBE/Win32-EventLog-Carp-1.21.tar.gz
  RRWO/Win32-EventLog-Carp-1.21.tar.gz
YAML-0.39
  INGY/YAML-0.39.tar.gz
  KING/YAML-0.39.tar.gz
finance-yahooquote_0.19
  DJPADZ/finance-yahooquote_0.19.tar.gz
  EDD/finance-yahooquote_0.19.tar.gz
libapreq-1.33
  GEOFF/libapreq-1.33.tar.gz
  STAS/libapreq-1.33.tar.gz
pg95perl5-1.2.0
  MERGL/pg95perl5-1.2.0.tar.gz
  YVESP/pg95perl5-1.2.0.tar.gz

Chunking and Programming Languages

Some of my biases are transparent. For example, I believe that many of the complaints of Perl's "unreadability" are from people who've never bothered to learn how to read the language. You often see this from people who say "Sigils? Pfft. They're useless—mere syntactic noise!"

Linguists may disagree.

One of the early inventions in written language was punctuation. In specific, adding spaces between words (and even vowels, in some languages... yes, my history studies have come in useful while programming) makes documents easier to read. The same goes for punctuation. It's easy enough to write sentences with ambiguous meanings, depending on where you put a comma to delineate logically separate clauses. (Languages with greater riches of declensions and tenses and numbers and other forms are more flexible in word order, but they do retain some degree of poetic license. It's not all meter and rhyme scheme however.)

The basic idea behind all of these ancient inventions is that "Communicating is difficult enough without verbal and body language cues. Making different things look different helps."

To read source code, you have to be able to identify nouns and verbs. You have to be able to group related items and ideas while not grouping unrelated ideas. You need to be able to identify separate expressions as well as idioms.

One reason assembly language can be difficult to read is that its regularity (op arg1, arg2 or op arg1, arg2, arg3) precludes skimmability. That may sound odd; if you're reading code, why do you need to skim code, but it's important. Programming encompasses so many small details that you must understand the code in the small in the context of the local component as a part of the system as a whole.

Uniformity of syntax means that you have to rely on cues external to the source code or patterns of repeated details within the source code to indicate structure.

I have the same problem reading Lisp code, with its homoiconicity; the shape of the code gives me few cues as to what's different between sections of code. As well, Python's use of vertical whitespace to end blocks means that my eyes slip off of the end of logical blocks and I can't tell what happens where.

A lot of that is familiarity and personal preference (or quirks of the way my brain works). Some of that is the effect of deliberate design decisions.

If you embrace the idea, like Perl does, that different things should look differently, you reach some interesting conclusions. I don't think you can learn Perl effectively without understanding those conclusions, at least at an intuitive level. I'll write about that next time.

Help keep the world safe from SQL injection

A while back, I put up bobby-tables.com as a repository for showing people the right way to handle external data in their SQL calls. Whenever someone pops up on a mailing list or IRC and they're building SQL statements using external tainted data, you can just refer them to the site.

In the past few days, I've spiffed up the site (with design help from Jeana Clark) and added pages on Perl and PHP. I need more examples, though. It's 2010, and there's no reason anyone shouldn't know about parameterized SQL calls.

The site source is hosted on github, so if you have any contributions, please fork it and let me know about your applied changes, or you can email me directly.

Thanks!

P.S. In the next few days, I hope to fire up some redesign on perl101.org, too.