Filenames with non-ascii letters
Let’s start off with a quick question!
Can you spot the difference between the two rows below?
/images/räksmörgås.jpg
/images/räksmörgås.jpg
I couldn’t.
My browser however insisted that there was no “räksmörgås.jpg” on the webserver – a file that from my point of view clearly was there.
Since the error only occurred with filenames containing the letters å,ä & ö I at first suspected that there was an issue with mixing up UTF-8 and ISO-8859-1, however, this wasn’t the case.
My next course of action was to urlencode the requested filename and the filename from the server, and this is when I found something interesting!
ra%CC%88ksmo%CC%88rga%CC%8As
r%C3%A4ksm%C3%B6rg%C3%A5s
Now you see the difference, right?
The reason behind the difference is that there are multiple ways to represent the common Swedish letters å, ä and ö (and other non-ascii letters aswell – but for readabiltiy, let’s keep it short).
If we look at the char codes for three letters that were causing trouble in my case:
Letter | Mac OSX | Linux
-------+-----------+------
å | 97 + 778 | 228
ä | 97 + 776 | 229
ö | 111 + 776 | 246
Notice the pattern?
Mac uses ”a” (97) and ”o” (111) and then adds the circle (778) or the dots (776). Linux however has a diffrent char entirely.
There are multiple standards for representing characters in unicode, the competing normal forms here are ”Canonical Decomposition” (NFD) and ”Canonical Composition” (NFC) – and I needed to convert between the two.
My solution
I had this error on a server where files had been stored on a Mac and then re-uploaded to a Linux server. I didn’t have shell access to the server so I fixed it by using the following PHP-code that looped through all affected files and updated their names:
<?php
// Normalizes all filenames in folder
foreach(glob("*", 2) as $file){
$after = Normalizer::normalize($file, Normalizer::FORM_C);
if($file !== $after){
rename($file, $after);
}
}
You could probably use iconv or similar tools to achieve the same thing easier if you’ve got shell-access (or php exec is enabled).
Fun fact: Räksmörgås is a commonly used Swedish word used for testing that the non-ascii ÅÄÖ is working correctly.