Filenames with non-ascii letters

Let’s start off with a quick question!
Can you spot the difference between the two rows below?

/images/räksmörgås.jpg
/images/räksmörgås.jpg

I couldn’t.

My browser however insisted that there was no “räksmörgås.jpg” on the webserver – a file that from my point of view clearly was there.

Since the error only occurred with filenames containing the letters å,ä & ö I at first suspected that there was an issue with mixing up UTF-8 and ISO-8859-1, however, this wasn’t the case.

My next course of action was to urlencode the requested filename and the filename from the server, and this is when I found something interesting!

ra%CC%88ksmo%CC%88rga%CC%8As
r%C3%A4ksm%C3%B6rg%C3%A5s

Now you see the difference, right?

The reason behind the difference is that there are multiple ways to represent the common Swedish letters å, ä and ö (and other non-ascii letters aswell – but for readabiltiy, let’s keep it short).

If we look at the char codes for three letters that were causing trouble in my case:

Letter | Mac OSX   | Linux
-------+-----------+------
å      | 97 + 778  | 228
ä      | 97 + 776  | 229
ö      | 111 + 776 | 246

Notice the pattern?
Mac uses ”a” (97) and ”o” (111) and then adds the circle (778) or the dots (776). Linux however has a diffrent char entirely.

There are multiple standards for representing characters in unicode, the competing normal forms here are ”Canonical Decomposition” (NFD) and ”Canonical Composition” (NFC) – and I needed to convert between the two.

My solution

I had this error on a server where files had been stored on a Mac and then re-uploaded to a Linux server. I didn’t have shell access to the server so I fixed it by using the following PHP-code that looped through all affected files and updated their names:

<?php
// Normalizes all filenames in folder
foreach(glob("*", 2) as $file){
  $after = Normalizer::normalize($file, Normalizer::FORM_C);
  if($file !== $after){
    rename($file, $after);
  }
}

You could probably use iconv or similar tools to achieve the same thing easier if you’ve got shell-access (or php exec is enabled).

Fun fact: Räksmörgås is a commonly used Swedish word used for testing that the non-ascii ÅÄÖ is working correctly.