Unicode Normalization
Unicode is a standard for representing characters from all worldwide languages in one machine readable format. Files.com, like most cloud services uses Unicode when working with text. Because paths are particularly sensitive in a filesystem, Files.com follows a specific pattern for normalizing Unicode values in paths.
Files.com uses the "NFKC" (Normalization Form Compatibility Composition) algorithm for normalizing Unicode as part of path comparison.
Although Files.com normalizes Unicode for path comparison, Files.com is Unicode preserving, meaning that the path name will be stored using the actual Unicode representation used when the file or folder is first created.
Exact Algorithm For Path Normalization
Files.com uses 2 algorithms for path normalization. Our Normalize algorithm is applied to all paths provided to the Files.com service to remove noncompliance with our path requirements. If you are building an SDK or manual API integration to Files.com, we recommend that you implement this algorithm prior to sending any paths to the Files.com API to ensure that they will be treated identically on the server side as to how you provided them.
Additionally, our Normalize For Comparison algorithm is used to compare two paths to determine whether they are the same. If you are building an SDK or manual API integration to Files.com which needs to determine whether two file paths are the same, we recommend that you also implement this algorithm.
The official Files.com SDKs implement both algorithms natively and we encourage the use of our SDKs rather than implementing either of these algorithms by hand. For completeness, we describe the algorithms here. Sample code for the following algorithms can be found in our SDKs.
Normalize Algorithm
Convert the path to UTF-8
Remove any characters with byte value of 0
Convert any backslash \
characters to a forward slash /
Remove any trailing or leading slashes
Remove any path parts that are .
or ..
Replace any duplicate forward slashes (such as ///
with a single forward slash /
)
Normalize For Comparison Algorithm
Run the path through the Normalize Algorithm
Unicode Normalize the Path using Unicode NFKC algorithm
Transliterate and remove accent marks by using the official Files.com transliteration map specified below. Any instance of the first character in the map should be replaced with the remaining characters.
Convert the Path to lowercase using the case mapping found in Unicode 9.0. (Note: we are aware that this version of Unicode is fairly old and many modern programming languages now implement Unicode 15.0. The only differences affect two very rare languages and we have never seen these differences cause any actual issues in practice at Files.com. We suggest using whichever version of Unicode your environment supports, as that will most likely be fine.)
Remove any trailing whitespace (\r
,\n
,\t
or the space " "
character)
Any two paths with the same resulting string from this algorithm are considered the same file on Files.com.