Character Sets are Important™

(Note: since this article is about a character that shouldn’t have been able to appear on my screen, I’ve used that character several times to demonstrate.  If you can’t see it, it’s the trademark character, an elevated TM.)

A few days ago I implemented an “email this product to your friend” feature for my new employer Reusable Bags. It all went smoothly until I tested it with products like “ACME Bags™ Workhorse Style 1500″. The ™ in that name caused me endless problems, all related to one of the least known aspects of computing (at least for English speakers), character encoding.

I’ve read Joel Spolsky’s article on character encoding, so I know just enough to identify that my problem has to do with that, but not enough to know how to fix it. I find out that on our website, where the ™ displays fine, the charset is “ISO-8854-1″ a.k.a. Latin1. This is used without problems all over the place. The curiosity here is that ™ is not in that charset. Somehow Firefox translated a sequence of bits from the web page into a character that shouldn’t even exist. I couldn’t wrap my head around that, so I kind of assumed that it was expressing it some other way I didn’t know about and kept going. In the emails I was sending, the character was displaying as a sequence of 3 unusual characters, meaning it was being interpreted wrong. The charset in the email was Latin1 so that was what I would expect from the browser. Since it was 3 chars, that reinforced my idea that it was being encoded in some other unusual way (with multiple bytes) and I kept looking.

I tried everything I could figure to try and make some headway on this bug. I used every English charset I could find everywhere to see if I was inputting the character in one set and interpreting it with another, but nothing worked. I would recount everything I tried, but there was so much I don’t remember it all. I spent probably half a day just switching charsets and retrying things.

Eventually we gave up on representing the character properly and just wanted to strip it out, so I threw in a “str_replace(“™”, “”, $string)”. This didn’t work either! I could replace anything else in the string, but not that blasted ™! This problem was preposterous. There’s no way PHP isn’t recognizing this character. I wrote a testing script to verify the problem in absence of the rest of the page, and there it was recognized and replaced just fine. So what was the difference between the two scripts?

The difference was the source of the text being searched. In my testing script, I typed both the needle and the haystack. In the real page, the haystack came out of the database. I don’t think the database pays much attention to the character encoding, it just stores whatever sequence of bytes you enter. So the encoding used on that string depends on who entered it. Who did enter it? A Windows user. Therefore, the encoding was undoubtedly Windows-1252, which is one of the only encodings I found that includes the ™ character. If I had been smart about it earlier I would have realized that must be the case, because someone obviously entered the character and Windows-1252 is the only encoding that contains it in a way that’s easy to enter.

So how do I type that character in our code files that aren’t Windows-1252? Well I know that in that encoding, ™ is represented by the number 157. That means I can get php to give it to me with the call “chr(157)”. I put that into my str_replace call from earlier and it worked perfectly; detected the ™ and stripped it out no problem. Originally I was going to berate the PHP developers for assuming the Windows-1252 charset in the chr() function but I subsequently realized that it doesn’t matter what little picture is associated with character #157 in any encoding, the binary is still the same.

So the lesson here is to not assume something quasi-magical is happening when two facts seem to conflict, like when I assumed the ™ was encoded in some multi-byte extension to Latin1. It can’t be, that’s not possible. The only common encoding in the English world that includes it is Windows-1252, so that had to be what I was seeing, despite Firefox reporting otherwise. If I had realized and accepted that earlier I would have saved myself a lot of shotgun debugging. Why Firefox did that is a separate question that I don’t really care enough to answer, but IE does some auto-detecting of character encodings and displays whatever it thinks will work the best. Maybe Firefox did the same thing, ignoring the encoding specified in the document, and forgot to update the page info? That’s all I can figure.

PHP best practices

I’m currently working for a company that uses the LAMP application stack. They have only had one full time programmer since they started, and he’s a cowboy. They don’t use much of a database abstraction layer, they mix their display code with their business logic, they don’t do any testing, and even worse than not using source control at all, they sometimes use source control.

I’m starting a new project that will be fairly large and independent from the rest of the site, so I’d like to introduce some better development practices. I don’t know much about php frameworks and stuff, so if you have any suggestions for what I should use please post them in the comments. The only major requirement is that it can be used along side the existing code. So what do you suggest?