Perl::Critic finds annoying little bugs in your code.

My work colleague Mike O'Regan created a policy for the latest version of Perl::Critic.

Now if you have a line of code like this:


my $n += somefunc();
# Should be my $n = somefunc();

Perl::Critic will tell you

Augmented assignment operator '+=' used in declaration at line X, column Y. Use simple assignment when initializing variables.

If you haven't let Perl::Critic loose on your code yet, now's a great time to try.

To the loyal Perl::Critic users, what's the nastiest bug Perl::Critic found for you? Let me know in the comments.

Do It Wrong Sometimes

The YAPC::NA 2012 call for presentations has opened! As with every YAPC I've attended, this is a great opportunity to meet other programmers, learn things you know better and don't know yet, and to practice your presentation skills.

A few months ago I exchanged emails with JT Smith about my idea for a talk this year. I've mentioned in passing a few times a small side project my business is investing in. It's a side project, deliberately minimal, and—from the development side—definitely the kind of skunkworks, just get it working, maintain it as little as possible and let it run uninterrupted software that you're likely to find.

That doesn't mean it's quick or dirty. That doesn't mean it's not tested well, or that it has a slapdash design. All it means is that the most important criterion for any design or implementation decision is "is this the simplest thing that could possibly work" instead of "is this elegant" or "what's the standard modern Perl orthodoxy for this problem".

So far the results have been enlightening.

I don't want to give away too many of the details of my talk (if it's accepted), but here are two small hints which may or may not help you.

First, just because a good ORM such as DBIx::Class makes searching and manipulating existing data easy doesn't make it the best way to insert big batches of new data.

Second, while LWP and especially WWW::Mechanize are great tools for automating the behavior of a web client, sometimes wget or curl in a shell script is quicker, easier to parallelize, and more robust.

(As a bonus, consider also that if you're parsing semi-structured data out of HTML that removing all of the HTML is sometimes even easier than using a real HTML parser or even CSS selectors. Sure, semantic markup helps when you can rely on it, and sure, using a regex to remove HTML tags is a bad idea, but there are ways to turn HTML into plain text quickly and easily without doing anything on your own.)

Perlbuzz news roundup for 2012-01-02

These links are collected from the Perlbuzz Twitter feed. If you have suggestions for news bits, please mail me at andy@perlbuzz.com.

Interested in "The Year in Perl"?

A lot can happen in a year. Think back to 2005 and what we had and didn't have in Perl compared to now.

In previous jobs, I collected "The Year In Perl" a couple of times for Perl.com. This required a significant investment of time over a couple of days for the research and writing.

Perl.com these days is easier to update and to manage (though me carving out editing time is more difficult). What interest exists in putting together a document about the interesting developments in the Perl world in 2011?

In particular, we can concentrate on:

  • Community Events (especially significant developments such as the first or second occurrence of an event)
  • Important releases (5.14 counts, as well as big new improvements of existing projects)
  • Plans and announcements (Jesse Vincent's "Perl 5.16 and Beyond" stake in the ground, for example)
  • Products (development products, books, et cetera)

I have a small list on my own and will refine it if there's further interest. Feel free to reply here as a comment or contact me (chromatic at cpan dot org) as you prefer.

Perl Documentation in Terms of Tasks

The core Perl community—if you care to draw lines around a group of people who use Perl seriously and call that a community—is like many other core F/OSS communities. Real work happens on mailing lists and IRC. I unsubscribed from several mailing lists and deliberately spent as little time on IRC as possible this year, for various uninteresting reasons. (I haven't even made it to the Portland Perl Mongers meetings for several months.)

While that's been good for my productivity, it's also produced an interesting sense of disconnect, and that makes me wonder. Consider a thought experiment. Suppose you have six months to build a new green-field project. Your primary language is Perl. You're the only developer on the project, but you do have coworkers to do some of the non-coding work. You don't have access to IRC or mailing lists, but you do have access to the whole of the CPAN. In other words, your social connections are limited but your technical decisions are not.

In this situation, how do you find the best libraries and techniques to use for your requirements and how do you solve problems and get your questions answered?

Assume you have access to web forae such as PerlMonks and Stack Overflow and of course Duck Duck Go.

I can answer this partially for me: thank goodness for the degree of maturity the CPAN and its ecosystem encourages among its best projects. I have a lot of confidence in the stack I've chosen of Moose, Plack, DBIx::Class, and Catalyst, sprinkled liberally by great new tools such as perlbrew, cpanm, and Try::Tiny—but even so, the documentation and community support available without real-time discussion with contributors and developers isn't always sufficient to solve problems quickly.

(How interesting to note that all of these tools hew from a post-Perl 6 world, and how any Perl 6 implementation as it stands now only barely obviates the need for part of two of the named projects and deigns even to consider the others.)

For example, what's the best way to manage passwords and authentication in a Perl-based web application? Do you handle it at the Plack level or the Catalyst level? What if your user table doesn't match the example in the Catalyst authentication plugin example? How much better is bcrypt than SHA-1 or SHA-256? What if your business requirements mandate that users verify their accounts before they can login? How do you modify/subclass/extend/advise the plugin you use to meet this requirement?

Anyone who's done a few projects with this stack should be able to give a good answer to these questions, as should anyone who's spent a few weeks in the relevant IRC channels or a couple of months reading the right mailing lists. They're not difficult questions, but they are detailed questions. You could ask the same questions about the right way to manage DBIC schemas you expect to deploy frequently while allowing for schema updates and changes.

The interesting question isn't how to accomplish these things, it's how someone finds this information without mandating access to IRC or the mailing list.

I make the assumption that it's valuable to have multiple sources of information. We write copious documentation including ::Manual and ::Tutorial PODs in our top-level distribution namespaces, after all. We do an admirable job of producing Perl Advent Calendars (thanks, Andrew Grangaard!), but I'm very glad to see Catalyst retiring its calendar in favor of monthly articles. Publishing on a schedule is difficult, but the need for current information is present the other eleven months of the year.

I wish I could say that Perl and project wikis were more useful, but they seem neither popular nor currently useful to me. Maybe I looked in the wrong places. (I know I promised to give Catalyst a list of questions about things that weren't screechingly obvious; I have a list, but I haven't shown it yet. I have patched a few parts of the Plack documentation.) Yet it seems to me that for all of the energy and output of the core Perl community, the practical non-code results tend to be directed in ephemeral directions. In the past couple of months, people such as Gabor Szabo and Christian Walde spent a lot of time to improve the results for searching for "Perl tutorial" by creating a central place to list and evaluate Perl tutorials.

Again, maybe I looked in the wrong places—but I'd like to see a 2012 focus on making the knowledge and experience of core project members available further, in many other media. Perl.com always welcomes your submissions of course, but that's not the only persistent and updated medium for project knowledge.

If we want people to use our code and projects for real work, to solve real problems, and to accomplish real tasks, we need to continue to provide practical code and useful documentation at or above the high quality level we currently enjoy. Yet we also have to work to approach this audience from their point of view: in particular, in terms of the tasks they want to accomplish.

That is the resolution I suggest for the Perl community in 2012.

Don’t TSA That Data!

A Vanity Fair article asks Does Airport Security Really Make Us Safer?. Fortunately, the writer of the article used Bruce Schneier as a source. (If you've been to an airport in the US, you know that the answer is "No; why would you even ask?")

The article's penultimate paragraph makes what should be an obvious point. (At least, it's obvious if you want to prevent terrorism as much as possible. If your goal is to spend lots of taxpayer money in a very flashy, showy way without worrying about efficacy, please continue.) In particular:

What the government should be doing is focusing on the terrorists when they are planning their plots. "That's how the British caught the liquid bombers," Schneier says. "They never got anywhere near the plane. That's what you want--not catching them at the last minute as they try to board the flight."

I read this article moments after sending an email commiserating about the silly (lack of) Unicode handling in a programming language which isn't Perl. Then something clicked.

One of my persistent desires for Parrot was to simplify the internals by reducing the amount of complexity and genericity in the core. In terms of Unicode, this means knowing the encoding of incoming data and the desired encoding of outgoing data, then transcoding to and from a single internal encoding. This way the core could operate on a single encoding and push the complexity of transcoding to the edges.

If Parrot hasn't changed this since I looked at it most recently, its string system requires each string to carry information about its encoding (which makes each string structure that much larger, increasing memory pressure) and each string operation to check for the need to transcode strings to mutually compatible encodings (which takes time for the comparison in every case, as well as time and memory for the transcoding in other cases).

Worse yet, string literals encoded in the source code of Parrot itself tend to have a specific encoding (ASCII or at least Latin-1 in the case of literals in the C code) and they ought to be constant, so transcoding in place isn't an option and, if you're working primarily with another encoding, that means always performing transcoding from that incompatible encoding.

It's not free to perform encoding at the edges, and you sometimes notice this when working with large chunks of data (though if you're processing multi-terabyte satellite images, treat them as binary and skip this encoding altogether), but it's the right thing to do.

The same principle applies for trusting incoming data. Secure it at the borders of the application. Don't spread those checks throughout the system. Harden the edges and don't let nonsense through. Fail early for suspicious things.

Otherwise you'll go mad trying to track down all of the possible interactions and possibilities of maliciousnesses that people could perpetuate if you lack a sane sanity policy. In other words, stop doing a lot of busy work to make it look like you know what you're doing. Do it right.

How Would You Track User Behavior with Plack and Catalyst?

One of the persistent questions which keeps entrepreneurs on the edge is "Are we building the right thing?"

In the first web bubble, the Silly side of Silicon Valley chased vanity metrics such as "the number of eyeballs on the site" and "brand awareness" and "unique visitors". Those numbers are only interesting when you can correlate them to producing value for customers and bringing in real cash in the form of revenue.

I've enjoyed the book The Lean Startup by Eric Ries because he offers a much better mechanism to track the success or failure of any attempt to produce real value to customers. While split testing (or A/B testing) is useful to see how small changes lead to different customer behaviors, Ries recommends cohort analysis, where you can see the behavior of real customers through the sales funnel and correlate the X-axis with individual changes to your business or product.

That means tracking customer behavior. If you're building some sort of software as a service product, and if the mechanism of delivery of that product is primarily a web site, you probably already know the punchline.

Assume I already know how to identify and log events for each salient customer action type. (I've built that kind of system before.) Assume I don't want to collect personally identifiable information (I don't). Assume I'm using Plack and its middleware heavily, and assume I'm happy using Catalyst as a web framework.

How can I identify unique users (with and without accounts) on a daily basis, anonymize them, but group their actions across the site such that my automated daily cohort graphs correspond with reality?

So far I've identified few points of possible contention. I can rely on browser cookies for unique identification of users if I know that user sessions have unique identifiers within a 24 hour period. (I could generate GUIDs for this, but that may be overdoing things.) I think< I also have to track the transition from anonymous visitor to authenticated user, but I might be able to convince myself that either replacing the current session or smple subtraction of successful login events from total number of unique anonymous visitors would give the right numbers.

(I also haven't dived much into how Catalyst 5.9 and Plack interact in terms of session and cookie handling. Everything's just worked, so I've ignored the details until now.)

I don't mind building such a system if necessary, but if all of the pieces are out there and available—or if someone's already built this and can give guidance—so much the better.

Have you solved this problem? If so, how did you do it? If not, how would you do it? Would you handle logging at the Plack level or the application level? Would you worry about tracking session changes? Does Catalyst need to know about this?

Perlbuzz news roundup for 2011-12-19

These links are collected from the Perlbuzz Twitter feed. If you have suggestions for news bits, please mail me at andy@perlbuzz.com.

When Print Debugging Fails

I have a medium sized project which is effectively a state machine. While I keep promising to write a reusable modular system which lets you specify the states and transitions between the states and let behavior manage itself, I haven't done that yet.

This means that occasionally I have to debug the transition logic.

Suppose I have a series of articles in a publication queue, and suppose each article has a state() method accessor/mutator. Moving an article between states (from SOLICIT to EDIT to PREVIEW to PUBLISHED) means calling state() and passing a token which represents the appropriate state.

Because I haven't yet consolidated all of the transitions into a single place, an article's state may change in any of half a dozen places in the entire codebase. That's not awful, but if state transitions are not occurring as I expect, that's multiple places to watch as I debug.

I rarely use the Perl debugger. (I'm a fan of debuggers for compiled languages such as C, and I've used debuggers in IDEs for languages which require IDEs to great success, but I've never found Perl's debugger productive.) I usually annotate my code with log messages and bisect problems that way.

This seemed easy today; use Moose advice to surround the state() method and display some logging information. (Shouldn't this be a pattern already? Certainly there must be something on the CPAN to accomplish this.)

around state => sub
{
    my ($orig, $self, @values) = @_;

    return $self->$orig() unless @values;
    my $original = $self->$orig();
    my $title    = $self->title;
    my @caller   = caller(2);

    print STDERR "Setting '$title' from $original to $values[0] " .
                 "from $caller[1]:$caller[2]\n";
};

If you already see the bug, you're doing better than I am today. After five minutes of head scratching, and looking elsewhere, I figured out why my logs showed the first transition happening successfully but nothing else happened.

The moral of the story is to be very careful what you measure, lest you change that which you observe... or in my case, fail to allow that change to occur.

If We Could Resolve Predicates at Compile Time

The Catalyst web framework uses Perl 5 function attributes effectively—I've seen few more effective uses of attributes.

Any modern web framework has to deal with the idea of routes and request routing somehow. Given a request path (such as /stocks/AA/view_analysis), how does your application know what to do?

Catalyst solves this elegantly with a feature known as chained actions. Controller methods can consume zero or more parts of the path but, when explicitly chained, can combine. Consider the example request path. The controller is Stocks.pm. The second component of the path (/AA) is the identifier for a stock (Alcoa, to be specific. I'm neither long nor short on Alcoa itself, though I probably own some shares as part of a fund somewhere.) The final component of the path, /view_analysis, is an action—a verb representing an action the controller should take on the object representing Alcoa in the system.

You can probably start to see the idea of the chain right away.

The Stock controller has a controller method called get_stock which grabs the stock symbol from the request path, looks it up in the database, and stores the object representing that stock for further processing. If no such symbol exists, it throws an exception.

The view_analysis method chains off of the get_stock method such that Catalyst will only dispatch to view_analysis when it's already successfully dispatched to get_stock. Unless you write a custom dispatch system which bypasses the dispatch rules, users will never be able to call view_analysis without a valid stock object available.

(Further, these methods are part of a chain which requires that users have successfully logged into the system; they chain off of a user authentication system.)

In code terms, the relevant attributes look something like:

sub authorized :Chained('/login/required') :PathPart('stocks') :CaptureArgs(0);

sub get_stock :Chained('authorized') :PathPart('') :CaptureArgs(1);

sub view_analysis :Chained('get_stock') :PathPart('view_analysis') :Args(0);

The :Chained attribute is most relevant here. :PathPart governs how Catalyst's dispatcher makes each method visible to user requests (get_stock doesn't consume a part of the path on its own, while authorized consumes the name of the controller and view_analysis consumes its own name). :CaptureArgs and :Args control how many other pieces of the path the methods consume; in the case of get_stock, it's the single path element between /stocks and any subsequent chained actions—in this case, /AA. As view_analysis is the end point of a chain, you use :Args instead of :CaptureArgs.

With that all explained, request method chaining is fantastic. I can reuse get_stock() for other request methods and get all of its benefits, including the fact that only authorized users can even reach this point.

Yet I want to prove these characteristics of my application.

I want to prove these features so definitively that I don't want to write tests for them. I want my program to fail to compile if these characteristics are untrue.

I see chaining from get_stock() as supplying an invariant precondition to view_analysis() such that it proves, to my satisfaction, that I can always rely on a valid stock object being available within the analysis method. Always. Similarly, I can always rely on a valid user being available within both methods. Always always.

The problem comes in that it's easy to make a typo in the name of a chain or a method, or to use :CaptureArgs instead of :Args or vice versa.

Here's the thing: all of this metadata is metadata. All of this information is available at compile time, before Perl has to execute anything.

If I had a really good and extensible type system in Perl 5, I could write a couple of pieces of predicate logic to say that every chained method should be a starting point or have a valid predecessor. These are trivial properties of my program (no matter how large it gets) and they're resolvable with the information available at the point of compilation. Even with complex controller construction through the use of roles and parametric roles, this information is available.

I know how to emulate this behavior by injecting some sort of CHECK block into the code and schlepping through the symbol table and inspecting attributes myself, but that's emulating a useful feature we could exploit in a lot of ways.

Forget the talk about making Perl into Java or C++ by adding a silly manifest static type system. We could find and fix real errors in logic—trivial errors, trivially discoverable—if we had an extensible type system which let us define our own simple predicates.

(Implementing such is left as an exercise for a small army of readers cloned from a very small army of brilliant p5p hackers with copious spare time and a habit of reading ACM papers before breakfast.)