In my quest to write a fast IPv4+6 parser, I have written a slow-but-I-think-correct parser, to use as a base of comparison.

In doing so, I have discovered more cursed IP address representations that I was previously unaware of.

A thread!

We start out simple, with IPv4 and IPv6 in what I'll call their "canonical form":

192.168.0.1
1:2:3:4:5:6:7:8

Various specs call these "dotted quad", dot-separated fields each representing 1 byte; and "colon-hex", colon-separated fields each representing 2 bytes.
The first bits of complexity come from IPv6. In canonical form, common addresses would end up with long runs of zeros in the middle. So, "::" allows you to elide 1 or more 16-bit blocks of zeros:

1:2::3:4 means 1:2:0:0:0:0:3:4
Next up, for cursed historical reasons, IPv6 permits you to write the final 32 bits of the address in dotted quad form. Effectively, you can splat an IPv4 address onto the end of IPv6 addresses!

1:2:3:4:5:6:77.77.88.88 means 1:2:3:4:5:6:7777:8888
And of course, you can combine the two!

fe80::1.2.3.4 means fe80:0:0:0:0:0:102:304
The existence of :: also introduces an annoying edge case in parsing: the "::" can be at the start or end of the address, and the "empty" side of the address is not one of the 16-bit fields.

::1 means 0:0:0:0:0:0:0:1
1:: means 1:0:0:0:0:0:0:0
:: means 0:0:0:0:0:0:0:0
That's a natural consequence of the :: rule, but it makes the parsers slightly more annoying to write.
One final rule for IPv6: technically, each colon-hex field is 4 hex digits, but you can elide leading zeros.

Fully canonically, :: is 0000:0000:0000:000:0000:0000:0000:0000

But we allow the compacted forms.

My apologies to my trypophobic followers.
That's it for IPv6, mostly. Now, on to IPv4!

Fun fact, the textual representation of IPv4 was never standardized in any document before IPv6 needed a grammar for its weirdo "trailing dotted quad" notation.
So, it's a de-facto standard that boils down to mostly "what did 4.2BSD understand?"

And hoo boy, strap yourselves in, because 4.2BSD sure had some whacky opinions!
Let's use 192.168.140.255 as an example. That's an IPv4 that people would look at and go "yes, that sure is an IPv4 address."

How else can we write that exact same address?
This is the same IP address: 3232271615

You get that by simply interpreting the 4 bytes of the IP address as a big-endian unsigned 32-bit integer, and print that.

If you visit http://3232271615 , Chrome will attempt to load http://192.168.140.255.
Okay, but that's sort-of sensible, right? An IPv4 address is 4 bytes, so printing it as a single number is a bit human-unfriendly, but broadly plausible, right?

Okay, how about 0300.0250.0214.0377 ?
Yup, that's the same address. Dotted quad, except each field is written out in octal.
If octal is supported, you might wonder about hex.

And you'd be right! 192.168.140.255 is also 0xc0.0xa8.0x8c.0xff , according to 4.2BSD.
Now, remember before we had CIDR (Classless Inter-Domain Routing) ? IPv4 addresses were Class A, Class B or Class C. It was a weird time.

And that weird time made it into IP addresses!

192.168.140.255 is the "Class C" notation.
192.168.36095 is the "Class B" notation.
192.11046143 is the "Class A" notation.

Basically, coalesce the final fields into either a 16-bit or a 24-bit integer field, because why not.
And finally, we come to one last bit of unspecified behavior: do IPv4 addresses permit an unlimited number of leading zeros in each quad? Or is there a maximum of 3 digits?

001.002.003.004 is universally recognized as valid. What about 0000000001.0000000002.0000000003.000000004?
You might be wondering if either of these numbers should be read in as octal, per one of the previous bits of this thread.

It depends! There are implementations that do either, but _most_ modern implementations have abandoned the octal notation and treat leading 0s as decimal.
Oh, and the leading zero debate also infects IPv6, to some extent! The specs tried to specify the textual representation of IPv6, but it failed to be complete. So it's unclear if 000001::00001.00002.00003.00004 is a valid IPv6 address ("common" form 1::1.2.3.4, or 1::102:304).
Most modern parsers seem to allow an unlimited amount of leading zeros in their representations, probably because they're leaning on some "parse integer" library that implements that behavior.
And so, we reach the bitter end. If you want to _truly_ parse IP addresses, this is the bullshit you have to put up with.

Currently, my slow reference parser jettisons a lot of old baggage, and sticks to what I think is a sensible subset of these possibilities.
My parser understands classic v4 dotted quad, with any number of leading zeros. It does not process Class A/B notation, or hex or octal notation. It does not process the "uint32 to the knee" representation.
For IPv6, it understands canonical colon-hex form, as well as :: and trailing-IPv4 style (where the trailing IPv4 follows the same rules as the previous tweet). Each field is allowed any number of leading zeros.
And as @alanjmcf noticed, I messed up one of the representations above.

1:2:3:4:5:6:77.77.88.88 means 1:2:3:4:5:6:4d4d:5858, not 1:2:3:4:5:6:7777:8888. I missed out a decimal-to-hex conversion in there.
This thread is now https://t.co/PlJgkNlOqz , with a little rewording, but same content, if you want to handily link it beyond twitter.

More from Internet

We’ve spent the last ten months building #CitizenBrowser, a project that aims to peek inside the Black Box of social media algorithms, by building a nationwide panel to share data with us. Today, we are publishing our first story from the project. /1

.@corintxt crunched the numbers and found that after Facebook flipped the switch for political ads, partisan content elbowed out reputable news outlets in our panelists’ news feeds.
https://t.co/Z0kibSBeQZ /2

You can learn more in our methodology, where we describe how we did this and what steps we took to ensure that we preserved the panelists' privacy. https://t.co/UYbTXAjy5i /3

Personally, this project is the culmination of years of experiments trying to figure out how to collect data from social media platforms in a way that can lead to meaningful reporting. I’ve described a couple of highlights below 👇 /4

My first attempt was in 2016 at Propublica, when I was working with @JuliaAngwin . We were interested in seeing if there was a difference in the Ad interests FB disclosed to users in their settings and the interests they showed to marketers. /5

You May Also Like

Margatha Natarajar murthi - Uthirakosamangai temple near Ramanathapuram,TN
#ArudraDarisanam
Unique Natarajar made of emerlad is abt 6 feet tall.
It is always covered with sandal paste.Only on Thriuvadhirai Star in month Margazhi-Nataraja can be worshipped without sandal paste.


After removing the sandal paste,day long rituals & various abhishekam will be
https://t.co/e1Ye8DrNWb day Maragatha Nataraja sannandhi will be closed after anointing the murthi with fresh sandal paste.Maragatha Natarajar is covered with sandal paste throughout the year


as Emerald has scientific property of its molecules getting disturbed when exposed to light/water/sound.This is an ancient Shiva temple considered to be 3000 years old -believed to be where Bhagwan Shiva gave Veda gyaana to Parvati Devi.This temple has some stunning sculptures.