Tag: encoding

  • Filenames with non-ascii letters

    Filenames with non-ascii letters

    Let’s start off with a quick question!
    Can you spot the difference between the two rows below?

    /images/räksmörgås.jpg
    /images/räksmörgås.jpg

    I couldn’t.

    My browser however insisted that there was no “räksmörgås.jpg” on the webserver – a file that from my point of view clearly was there.

    Since the error only occurred with filenames containing the letters å,ä & ö I at first suspected that there was an issue with mixing up UTF-8 and ISO-8859-1, however, this wasn’t the case.

    My next course of action was to urlencode the requested filename and the filename from the server, and this is when I found something interesting!

    ra%CC%88ksmo%CC%88rga%CC%8As
    r%C3%A4ksm%C3%B6rg%C3%A5s

    Now you see the difference, right?

    The reason behind the difference is that there are multiple ways to represent the common Swedish letters å, ä and ö (and other non-ascii letters aswell – but for readabiltiy, let’s keep it short).

    If we look at the char codes for three letters that were causing trouble in my case:

    Letter | Mac OSX   | Linux
    -------+-----------+------
    å      | 97 + 778  | 228
    ä      | 97 + 776  | 229
    ö      | 111 + 776 | 246
    

    Notice the pattern?
    Mac uses ”a” (97) and ”o” (111) and then adds the circle (778) or the dots (776). Linux however has a diffrent char entirely.

    There are multiple standards for representing characters in unicode, the competing normal forms here are ”Canonical Decomposition” (NFD) and ”Canonical Composition” (NFC) – and I needed to convert between the two.

    My solution

    I had this error on a server where files had been stored on a Mac and then re-uploaded to a Linux server. I didn’t have shell access to the server so I fixed it by using the following PHP-code that looped through all affected files and updated their names:

    <?php
    // Normalizes all filenames in folder
    foreach(glob("*", 2) as $file){
      $after = Normalizer::normalize($file, Normalizer::FORM_C);
      if($file !== $after){
        rename($file, $after);
      }
    }
    

    You could probably use iconv or similar tools to achieve the same thing easier if you’ve got shell-access (or php exec is enabled).

    Fun fact: Räksmörgås is a commonly used Swedish word used for testing that the non-ascii ÅÄÖ is working correctly.