Packed URL

Shortening services like TinyURL exist because short strings are hard to “compress”. I recently needed to compare a URL. I couldn’t use a URL shortener, because each URL is only used once, and they key changes on each one.

So I built my own algorithm which others will be able to use. This could also be used for QR Code data, and for Web Page Proxies which usually contain the URL as a parameter.

Let’s work backwards from the results:

We use the following input string which is 62 characters:

https://stackoverflow.com/questions/1192732/really-simple-short-string-compression

With no packing and then converting to base64 encoding it’s 112 characters and like this:

aHR0cHM6Ly9zdGFja292ZXJmbG93LmNvbS9xdWVzdGlvbnMvMTE5MjczMi9yZWFsbHktc2ltcGxlLXNob3J0LXN0cmluZy1jb21wcmVzc2lvbg__

With GZip, it needs to include a header, so on short strings you actually go backwards encoding to 124 base64 characters:

H4sIAAAAAAAEAA3L0Q5AMAwF0C-aZjwIf7MsxaLW6S3i73k.Z3NvmIngKe96sy2iT5f1oPNieNEKinHqx6En4yTyBpSjCQdsah7gVuoa.tCMgd9.dGQz6FIAAAA_

So what I achieved is significant and specialised for URLs being only 64x base64 characters, that’s 57% the size of the original base64 string:

uU0ZIOUqiVS7lkSXoZCUhbPphsOceFkkQOuxkzBek524i3mnHIVlGznjI0lJOAs_

The fact that my scheme is “specialised” is important. That’s why mine beats the general GZip compression. My probably wouldn’t beat GZip on larger code files, but that’s not what it’s for.

How it works

It works by digesting a URL in a predictable way. URLs are predictable, they’re like a specific English sentence structure. Here’s a breakdown of the example URLs:

[https://] [] [stackoverflow] [.] [com] [] [/] [questions] [/] [1192732] [/] [really-simple-short-string-compression] []

  • [https://] – The URL scheme is either HTTP or HTTPS
    • 1 bit instead of 56 bits
  • [] – There’s no @, so no username or password – 1 bit = 0
  • [stackoverflow] they are all lower case characters so a 5-bit encoding is used (65 bits), plus header (3-bits) to indicate it’s a 5-bit string, plus 3-bit-block number length (6-bits)
    • 74 bits instead of 104 ~ Running total: 76
  • [.] – In the host section, we expect multiple dots. So after every string section we encode whether there is a dot with a single bit – 1 bit down from 8
  • [com] is in our “host” dictionary which has 16 values so 4-bit, plus the header (3-bit) to indicate it was a dictionary lookup
    • 7-bits instead of 24 bits ~ Running total: 84
  • There are no more dots – 1 bit set to 0
  • [] There’s no :, so no port number override
  • [/] Now we expect folder forward slashes –
    • 1 bit set to 1 instead of 8 bits ~ Running total: 87
  • [questions] is encoded as a 5-bit string –
    • 45 bits + 3 + 6 – 54 bits instead of 72 bits ~ Running total: 141
  • [/] – 1 bit set to 1 instead of 8 bits
  • [1192732] is a number which is encoded as an 8-bit-block number needing 3 blocks + 3bit header – 27bits instead of 56 bits
  • [/] – 1 bit set to 1 instead of 8 bits
    • ~ Running total: 170
  • [really-simple-short-string-compression] – 190bits + 3 + 9 – 202 bits instead of 304 bits ~ Running total: 372
  • There’s no dot – 1 bit = 0
  • [] There’s no question mark for params – 1 bit = 0
  • total: 374

It’s possible to add your own custom values into the segment dictionaries, this would be handy if you can create a custom implementation for your application, you can have common api folder paths across multiple hosts, common subdomain names, common parameter keys, and working with specific TLDs.

Applications

Data Types

  • Expected http scheme – 1 bit
  • Expected dot – 1 bit
  • Expected folder – 1 bit
  • 4-bit string – most popular lowercase English language characters
  • 5-bit string – lower case English alphabet characters, and also – _ + which are common in URLs
  • 6-bit string – base64 using – _ and + for padding
  • GUID – 128-bit
  • Number – 8-bit block (with 1-bit extension)
  • Expected port number – 16-bit ushort
  • (Reserved) – Hex? Base-7 for special chars? Mixed (Dictionary Text)?

Future

It would be nice if it could be further reduced to around 25% of the size 80% of the time, but I don’t think that is possible at all with the current scheme and it is likely theoretically impossible. 50% is likely close to the maximum compression limit.

  • Connect with other people and groups to help contribute. see https://whatwg.org/ (Thanks Paul Brooks)
  • IP addresses as hosts (thanks Paul Brooks!) – IPv4 and IPv6
  • Punycode Support (thanks Paul Brooks!) – for unicode support
  • Fill 3 remaining unassigned characters for the 5-bit text scheme
  • 7-bit base-128 to support special characters that are allowed in URL, ie. { , ! } and more
    • Or, perhaps remove 4-bit text strings replacing that with special characters only, AND then ADD a mixed-type segment type which can include Dictionary, String, Data
  • Distinct URL Text to URL Data Model step – so we can leverage a mature URL library to ensure we don’t mis-parse a URL. Then the URL Data Model can be binary (bitstream) serialised. see https://url.spec.whatwg.org/ and https://www.urlencoder.org/
  • Release the code opensource for others to learn and expand on.
  • PackingHashes (bookmarks) – technically not needed, because servers don’t get sent these values nor process them.
  • Decoder – this is only academic so far with a prototype encoder.
  • My system already includes a 4-bit (16 letters) string encoder for when there’s only lowercase characters and the less common alphabet characters are not used.
  • I intend to ignore uppercase characters when selecting the 5-bit encoding except for parameter values. But some web servers and underlying file systems are case-sensitive so this would be an optional configuration of the encoder.
  • Mixed Dictionary and Letters – allow a match on the dictionary as well as text encoding inside a URL segment. This mode would be run alongside the normal text-only mode and the shorter solution used.
  • Include an English dictionary for common partial word segments like ‘ies’. This might only be useful for longer word segments given the overhead of Mixed Dictionary and Letters.
  • Hardcoded and common dictionaries with a header number to indicate which dictionary set to use. For example they could be regional – Australia has .net.au, .com.au, etc.. for Hosts. Also, popular hostnames such as facebook, google, bing, instagram, etc..
  • ParamsOnly mode – where the prefix is infered by the application context, but the params are the variable part of the URL string.
  • Javascript Version – currently it’s written only in C#.

Leave a Reply

Your email address will not be published. Required fields are marked *