Stories from a System Archaeologist: The DNS Golden Hammer

Backround and cultural references

The popular Law of Instrument is commonly simplified as: "I suppose it is tempting, if the only tool you have is a hammer, to treat everything as if it were a nail.".
The same above-linked Wikipedia article nicely articulates the familiarity of this within programming and technology referencing the golden hammer. I quote the entire section here because it is so true and related to what I am about to write:

"""Software developer José M. Gilgado has written that the law is still relevant in the 21st century and is highly applicable to software development. Many times software developers, he observed, "tend to use the same known tools to do a completely new different project with new constraints". He blamed this on "the comfort zone state where you don't change anything to avoid risk. The problem with using the same tools every time you can is that you don't have enough arguments to make a choice because you have nothing to compare to and is limiting your knowledge." The solution is "to keep looking for the best possible choice, even if we aren't very familiar with it". This includes using a computer language with which one is unfamiliar. He noted that the product RubyMotion enables developers to "wrap" unknown computer languages in a familiar computer language and thus avoid having to learn them. But Gilgado found this approach inadvisable, because it reinforces the habit of avoiding new tools."""

General Experience

I believe anyone who has used physical tools can easily understand the hammer reference above. With woodworking and electronics I have often found myself "getting by" using the tools I have, even if they are not necessarily the most appropriate one for the task at hand. Often this is for convenience: I already have a usable tool at home that I know how to use and it is already purchased. Very often there is pride in the ideaology of "Use what you've got" rather than relying on constant purchasing and learning of new tools - especially for potentially one-time tasks or smaller projects.

But what if...

I'll admit here I've only read the above exerpt from Gilgado and am not familiar with the whole body of work or the original source of that quote. I believe that in general principle I agree with his assessment - that one should use the correct tool for the job even at the cost of time spent learning. Unfortunately; I have to disagree with the intonation that it is the developer's failing when this is not done. Office beaurocracy, company requirements, financial and licensing, peers and other staff who must support something - all play heavily into the decisions about "what tool" is used in the modern technology arena.

My main trouble with this is two-fold. I am both bound that I want to not be guilty of using the what-I-know instead of the what-is-best; while also recognizing that the option of that is not always up to me.

The company's situation

My story begins with starting at a company which heralded itself as a "startup culture". While I expected this to mean rapidly-developing and making large changes still - instead this meant more-so that they still had nothing documented and everything was a mess. Some initial tasks I worked with were getting machines setup that were actually built by code (at the time, via Puppet) to replace the existing machines which were built manually by former staff. While I encountered numerous problems with things at this company, I will try to isolate to just the big-ones which relate to the topic at-hand.

Putting a machine on the network

To provision a replacement, I first needed to add the machine to DHCP so that we could use kickstart to at least get a base operating system installed. The company used static IPs for everything, but still used DHCP for the handling of the kickstart procedure. So you just get an IP and do it, right?

Problem 1: How do I get an IP?

The company used a very loose network configuration, but certain portions of the datacenter were at least isolated into /24 networks. I asked around, "How do I make sure the IP I used is correct within the company standards?" The answer? "Use anything that is available." You can see where this is going. "What's available?" This was met with a shrug by the senior staff.

The company had absolutely no IP-address/network space management or inventory at all. Provisioning a machine via DHCP and assigning it an IP and name into DNS was a total crapshoot. Sure, there are certain precautions that one could do. This feels like an entry-level interview question filled with "Gotchas" - but this was my new reality.

You could ping all IPs within a network range and eliminate those which answered from being available.
You could nmap those which did not answer.
You could check within the monitoring system to find out if - regardless of no-ping and dead-end nmap if the IP was being monitored for anything.
You could check the DNS forward zone for that IP.
You could check the DNS PTRs for references to the IP/a name assigned to it.
You could (if you had access) check the route to those IPs and see if there was traffic to/from them on any of the network switches, check firewall configurations, etc.
You could reference any configuration/source-code to see if there were any network addresses referenced as in-use.
You could search the issue/ticketing system to try to identify IPs which were not in use or perhaps had been decommissioned and thus were available.
You could search logs to see which machines have been active for any type of traffic - which could at least eliminate them from the available lists.
You could set tcpdump to listen and check for requests for IPs

Stepping back and problems with integrity

To continue being non-linear in my story here - it is a personal problem to me to see something that is a problem and not try to fix it. The "not my job" mindset is extremely hard for me to grasp. If my job is to deliver something (a machine onto a network, or perhaps a load of gravel in a pickup truck) and I notice a drastic problem (whether it be having no management of IP-space or maybe all 4 tires on the pickup truck are flat) - I believe that I should either fix it, or ensure that it will be fixed by the appropriate party. Remember that "startup culture" thing I mentioned earlier? It provided another great hurdle here: There was no appropriate party. So, if you want something done - sometimes you must do it yourself.

Experiencing Problem #1

I would love to say that the myriad of methods to decipher available-IPs above worked swimmingly and efficiently. It did not. The monitoring team had tens of thousands of IPs and hostnames in their monitoring system which were in down/red status (but not removed from monitoring) because they did not housekeep the monitoring. (The monitoring mess here is a tale for another time.) Needless, that meant monitoring was no help in finding available addresses. The DNS system was ignored and only updated during additions of new machines. Absolutely no effort was ever made to remove. Thus, the forward and PTR zones (which rarely matched) were absolutely useless references. Ping? Disabled by security team (OK). nmap? Worked from one or two machines on the network, but for the most part was absent and we had no privileges to install it. The same for nc & tcpdump, which also would have been potentially useful to a similar effect. Network configuration was all hand-managed on the switches, so there was no repository you could check in. Logs did not exist (for anything, really at all) so they could not be referenced. Issue system? Filled with mostly things like: "Fix X" which would then be marked closed months later by someone with no further information.

Eventually, after exhausting all of the above methods (which unfortunately required many cross-department and other meetings to even confirm if we could/couldn't trust the available data) I ended up just having to "do my best" to make educated guesses based upon the fragments of information I could find regarding what was available to use.

Make it bettter

Later I'll get into the long-term fix for this, but at that point the short-term was just to clean up what I could find as I found it. When I found something which appeared to be available, I would request the monitoring team first remove all references they had to it (since it was likely showing as down/dead anyways). The same towards the network team in their switch/firewall configurations. I'd remove any references to it I could find in code (including the DNS forward/reverse zones) and make sure that I had as clean of a slate as possible before beginning. A few times when things were removed-first in this manner -- magically a machine/owner of the machine (more often just a dependency rather than a real "owner") bubbled up. The old System Administrator "Unplug it from the network and see what breaks" method was no longer just a joke.

Once that was all done, I'd put the address into the DHCP configuration file noting that it was available. I'd also put it into DNS marking it as such. Life got a little bit better. If I took the time to research and identify multiple IPs at a time then I'd be able to mark 10+ as available in the file templates, allowing myself and others to not have to repeat the research and be delayed next time we were looking for just 1 or two available addresses.

Problem 2: Code Credit

At the time, the DHCP files were individually managed directly on the kickstart servers. Source-control? Privilege-separation? I'd laugh if it weren't so sad.

The accepted (but naturally, undocumented) process was to login by sshing as root directly to the kickstart server then vi the dhcp files and restart the service.

Stopping complete insanity: The dawn of written history

The above method had to stop. Unfortunately everyone in the company had both the root password and full sudo privileges for 'ALL' to every machine. I couldn't stop root access because upper management was too concerned about "breaking people's workflows". The least I could do was at least somehow track what was getting changed in these files. Again though, I had no authority to enforce this at all. I was responsible for making sure DHCP worked, but with no allowance to modify the procedure where everyone had unlimited access and would not follow any modification steps.

With only RCS available - I started ensuring that all of my own changes were logged. I checked in the dhcp files just locally on the kickstart server and made sure that all of my own edits were done by:

Using sudo to RCS check-in whatever was on the filesystem. Since everyone could edit, usually I would find uncommitted changes sitting on the disk. I would note in the commit message something aking to "Found changes, committing before modification." To make it clear that the changes were not mine. Unfortunately, I could only see that they were done by "root" from a root session, and even if I could tell what machine root came from - it was impossible to know who within the company had done it.
Next, I'd make my own changes to the file. I was particular about using sudoedit so that I could at least defer to the security log for my own changes. Also my own little method of "cover your ass". But remember how I mentioned that we didn't have logging anywhere, for anything? Yeah, I ended up making sure to use sudo on my own workstation combined with log_input and log_output to record all of my own ssh sessions.
I'd check in my changes through sudo and RCS also. Habits also made me note the current issue reference I was working on within the check-in message so I could later easily reference it (or others could).
I'd restart the service with sudo, for the same accountability reasons as above.
I documented this simple procedure and put it into the company wiki. I linked that document to every issue I did where I was making modifications to DHCP/kickstart (so that other staff - in finding that "type" of work - could also find the procedure for it).

The same procedure was documented for the DNS files. Just like DHCP - they were stored directly on one main server itself and always modified by root directly with no sudo even. Fortunately, there was at least RCS already in place for some of the zone files. Unfortunately, the log messages for it were usually completely blank, so they gave no benefit of being able to find out who made the change or why.
Finding IPs was still hell, but at least we could mark some as usable and know what was changing/what - we finally had some minor accountability. While I could still not enforce this, most of my peers were overjoyed to have a procedure clearly documented and understood the benefits of accountability and avoiding the use of root.

Problem 3: Hardware/Network inconsistency

The company that was this disorganized naturally had multiple different base images available to kickstart against. The hardware deployed in the datacenter was inconcistently built and often required different partitioning schemes, device drivers, network settings, etc. Since this had to be defined in the DHCP/kickstart configurations it was hellish to get a machine to provision properly on the first try without being aware of what it exactly needed.

Another minor win: When hardware was received (either new or from some other machine-use that was decommisioned) it was begun to be entered into DNS and DHCP. Entries like "SpareSerialNumber8139813" began to appear in both DHCP and DNS. Now this hostname could tell us that a machine is available and what the hardware was without looking elsewhere.

Marking available things meant that now instead of hunting for an IP and/or hardware, you could check DNS. You could looking within a PTR zone-file and looking 'available-192-168-100-100'. You could check for types of hardware in the DHCP or DNS source files. Things were improving - slowly.

Problem 4: Network devices

So an IP seems available by all means of inspection... until it gets assigned to a machine and it is later found that the network team improperly reported and oops that unpingable, unmonitoring, un-nmappable, un-logged IP happens to be the management interface of a switch. Enter that everything on the network must be in DNS. This most simplistic idea was a major struggle as it required the Network team to be accountable and work with the System Administrators regarding what IPs were assigned where and how they were used. Eventually, it happened though - and a new era of the Network Team's devices (even those which never would need to be accessed via a DNS name) were all required to be in DNS.

Still I lacked the authority to enforce these changes; however, ultimately after having a few postmortems after "Why did this entire network segment die" where the answer was, "After extreme due diligence and even confirming with the network team that this IP was available to be assigned, once we assigned it to a server it was found to have already been in use" it did not take long to persuade upper management that this crazy idea of putting all IPs into DNS was not only useful, but necessary.

The new situation

I imagine while reading the above it it may be misleading and sound as if this was rapidly changed. In reality the above changes along took more than 9 months to occur, and still were being cleaned up thereafter. The lack of any type of accountability or ownership for any portion of the company was terribly limiting, combined with the ideology that just because no one has been responsible for it did not mean that you could adopt it and take on all the responsibilities of it and thus also gain any right to control over it. This was (is) still hard for me to understand. If there is something hazardous in a city, and all levels of government and community are asked regarding it - and none can satisfy ownership of it - who has the right to be made when a citizen takes it upon themself to resolve the hazard?

While some improvements were made, there was still tons of manual edits and research happening.

Problem 4: Hypervisors and VMs

While early on most of the machines were all 100% baremetal/hardware, there came a time where virtual machines began being deployed. Naturally, the disorganized nature meant that as long as a VM was in its correct datacenter, no other attention was paid to what hypervisor it was hosted on. Imagine the trouble when this company was relying on having "high availability" by having just two servers for a particular service - say, mailservers, or DNS servers. While that is already laughable - (I hope you can see where this is going) - what if I told you that they put both of those servers into VMs and then put them on the same hypervisor? Yes, really.

Problem 5: VMs, kickstart/DHCP, & MAC addresses.

With the VM lack-of-design here, the VMs were also being added to DHCP with fictitious MAC addresses (which sure, is common). However, I believe you can imagine the problems that arise when you have multiple VMs within the same subnet and datacenter which are using the exact same MAC addresses for their DHCPDiscover... Yet another problem that needed to be fixed.

I came to fix DNS. I stayed to fix the company.

Amgonst other reasons I was a good candidate for the position I took - one of the major ones was my experience with DNS. The company knew that their DNS was a mess and saw me as a person to help fix it. Very unfortunately I was not included in doing the design of the new/replacement system - I was just left with implementing someone else's design - however, where I saw problems along the way - I fixed them as best I could.

Likewise though, since the only thing I could mostly control was DNS - DNS began to look like my very own golden hammer - the tool I know which I can wield to solve many problems. This is where I have trouble the most. Yes - I chose the tool-I-know; however, the company also chose it for me - by giving me no ability to improve or enforce procedures while also holding me accountable for problems outside of my control.

"DNS is down"

The company's former (at my hiring "current") DNS system was hand-managed by root on 1 machine in 1 datacenter. Instead of having other datacenters slave zones or anything else - they had a script to be run by root after each DNS edit which used scp to copy all the zone files to all other DNS servers - and then restart bind/named. All the servers thought they were the master for all of the zones.

Let's treat this like another interview questions: how many ways is this bad and dangerous?

The service was always restarted after an scp. So if the scp did not completely transfer a file, the service would still try to restart.
No syntax checking was performed at any level. No named-checkconf/named-checkzone or anything. So a restart would be attempted even if invalid zone files were sent.
Manual editing meant no increasing of serial numbers (unless someone remembered) and so often changes were pushed with no new serials. Since every machine though it was the master - the internal DNS would still reflect the change however, third party and public DNS would not necessarily slave the new changes since no serial change was seen so they would not do AXFR/IXFR.
Accessible as root via ssh.
Editing as root
Editing on the machine itself
scp from root to root
Since the script went to all of the DNS servers - and did not check for success of scp or of the restart before proceeding to the next - a zone-file or named.conf error would rather immediately (ok, about 1-2 minutes for all the syncs to finish) break DNS in every datacenter all at once.
Since the script used DNS for its syncing - once DNS was broken, even if you tried to revert a change, the main "editor" machine could not find the other DNS servers to try to scp the fix/revert.

New DNS

Despite not designing the company's replacement for that old DNS system, the design I was provided with was missing so much detail that by being the implementer I got to fix a lot of things. Since the company was rigidly against using dynamic zones for records and was already setup using static zones - the first thing I did was moved the zones off of the one server itself and RCS - and put the whole RCS history into git.

Fortunately the company was just then getting OK with the idea of using git. Unfortunately, I still had to support DNS even with people who know nothing about it directly editing it. Going to git and having the DNS zone files placed by puppet (and later by SaltStack) we still had problems.

In the salt code, I'd made it place the git-sourced files for each zone appropriately. When new zones were added or removed it was a modification through pillar code and it changed named.conf as needed. Before restarting the service, it would always check with named-checkconf that the service would be able to reload. This was a major improvement already.

Except...

Except that other teams could and would still modify DNS and break it. The company still had that whole "We have no logging of anything" problem - so while the named service would not break due to my salt-code protections -- it also would not update. Other teams would come calling, "DNS is broken! I merged a change hours ago!" Naturally, DNS was not broken. DNS was not down. It would just be refusing to update itself with bad information. It would be up to me and my team to login and see that salt refused to restart named because a named-checkconf was reporting some error - entry in wrong file, unknown type, etc.

"Well of course, you broke it." never goes over well - and so we had to protect DNS from other teams. At times it felt malicious how the other teams were using it - it was like a developer's nightmare of angry QA people shoving things in wrong just to see what happened. Fortunately, salt protected us from most - but we still had to waste time investigating each time a failure was introduced by users modifying the dataset.

Syntax checks are only so good.

Syntax checking is very useful. named-checkconf is very good at catching syntax breaks. But - what if something is syntactically valid, but still wrong? Consider:

$ORIGIN example.com.
myhost.example.com. A 192.168.0.1
another.example.com A 192.168.0.2

I know, I almost gave it away by including the $ORIGIN line. The above is completely syntactically valid for DNS. However, it probably doesn't do what the user wanted. The probem? Check out the lack of final '.' after 'another.example.com'. This means the DNS that was just created is now served as:

myhost.example.com. A 192.168.0.1
another.example.com.example.com. A 192.168.0.2

So, even syntactically correct zone files can have rather large errors that could have been parsed.

Getting strict

Early on in replacing DNS we were promised that an entire division of our company that is responsible for making front-ends/UIs and tools for other teams would manage how an end-user (staff) interacts with our system. A few weeks in it was clear that team would never prove useful. I began writing some simple code that I could run against pull-request branches to ensure basic levels of sanity of the zones before they were applied and found by the salt state failing to reload (or worse, reloading a potentially valid-but-wrong zone file).

The early things were very simple.

All records must be FQDN.

That resolved a lot of the problems of missing periods and such. This check simple made sure that the column one of the submitted hostnames matched a simple regex like:

'^[^ ]*\. .*'

Here intended to make sure that there were no spaces in the first word and that it ended with a . - catching the above example where a missing period can cause unintended consequences.

Soon it was seen that using entire departments as essentially QA found more weird ways people would insist on breaking DNS just for lack of attention or just not knowing better. The checks expanded. Up until this time my 'validation' script was being manually run by me whenever someone filed a pull-request. Around this time I started distributing it also as a pre-commit hook that others could use on their workstations so that they wouldn't have to wait for me to mark their PR as "needs improvement" when they could have just run the same check themself. Later, I finally got approval to add this as a server-side pre-receive hook (that took some effort though, as the company was barely using git and did not use hooks anywhere).

The checks grew:

All records must be FQDN
The target of all NS, MX, & CNAME records must also be FQDN.
All IPs that are targets of A records must be a valid IP.
All records in a zone must be aligned (strict columns of whitespace).
All forward zone files are sorted alphanumerically
All reverse (PTR) zone files are sorted with "normal" (humanized) numerically. (e.g., 0, 1, 2, 10, 11 as opposed to 0, 1, 10, 11, 2)

Auto serials

Another thing that was often forgotten was to increase or modify serial numbers in these files. It was tedious and silly - especially when a computer was ultimately placing the files on the destination. I wanted to just modify the serial anytime the zone was modified. For this - within salt - I made two file templates: (1) The zone.head file - consisting of only the SOA segment of the zone and an $INCLUDE line which referred to (2) the zone.data file.

Using 'watch', anytime the zone.data file was modified (e.g., masters/example.com.data) the zone.head template would be applied for the corresponding head file (masters/example.com.head). The ".head" file template would use python's "time.strftime" to create the serial as "%y%m%d%H%M" format; thus, like "2102191434" for February 19 2021 @ 14:34.

This solved a few things. (1) No one had to remember to update serials anymore (2) the serial would always be valid (as there was occasion where people would accidentally put in 11 digits or other things) (3) we could easily tell when a zone last was updated by checking its serial.

The purpose for the separation between the two files was that I did not want salt to always apply the template which changed the serial - as I did not want to increase the serial unless there was an actual change to the data the zone was serving.

Even stricter

Since all reverse zones were now managed, I could also easily see every 0-255 record that existed. This tied back in with the 'available' entries above. If an IP didn't have anything assigned to it - it would have an "available" name put in. Now looking for available IPs for provisioning was just a matter of grepping or inspecting the intended subnet's PTR/source zone file. But we still had that problem about As and PTRs not always matching.

I kept the source files as flat zone-looking files - anyone familiar with DNS will recognize something like this:

host1.example.com.                        A         192.168.0.1
host39.example.com.                       A         192.168.0.29
host90.example.com.                       A         192.168.0.129
;; or:
1.0.168.192.in-addr.arpa.                 PTR       host1.example.com.
29.0.168.192.in-addr.arpa.                PTR       host39.example.com.
129.0.168.192.in-addr.arpa.               PTR       host90.example.com.

Now I needed to also check that the A and PTR matched (so that when people were adding/removing things they were keeping both parts of the accounting up to date. I changed the hook to using a python dictionary for storage, and cross-correlated A records with PTRs. While this was a little bit of a headache, it made a lot of future things much easier. (1) Serving the zone files to the targeted salt-minions was simpler, because now instead of indexing across a large amount of static/flat files, I could use a simple zone_data.py template file to evaluate and create each zone.data file by using salt's external pillars. (2) It meant that I could use the same hook code to pre-commit, pre-receive, & generate external pillars. (3) Once all the data was built into a simply formatted python dictionary I could greatly extend other checks - which I'll address soon.

But what about round-robin A records?

I'm glad you asked. It can be troublesome to require A and PTR to always match since the broad-use of PTRs usually only evaluates the first answer and thus most places/companies treat it as if a PTR may only have one answer. But if I required A+PTR to match, I couldn't do this quite as easily:

roundrobin.example.com.                   A         192.168.0.1
roundrobin.example.com.                   A         192.168.0.29
roundrobin.example.com.                   A         192.168.0.129
;; or:
1.0.168.192.in-addr.arpa.                 PTR       roundrobin.example.com.
29.0.168.192.in-addr.arpa.                PTR       roundrobin.example.com.
129.0.168.192.in-addr.arpa.               PTR       roundrobin.example.com.

Since I checked both that the A of a hostname matched the PTR of the IP the A points to (that's a mouthful!) I also checked the reverse. Enter my first fictitional DNS record type: "ALIAS".

Since I was already building my zone data using a template, when the template evaluates/iterates over the external pillar dictionary, I made it evaluate the "record_type" of ALIAS in two ways: (1) Ensure that the target of the alias is valid - and find its IP (2) Create the record as a regular A-record. I greatly prefer this because then when debugging interactions with round-robin records - it is easier to identify the actual machine/PTR/IP that may be having some type of problem.

Thus, a source zone file may now look like this:

robinresource1.example.com.               A         192.168.0.1
robinresource2.example.com.               A         192.168.0.29
robinresource3.example.com.               A         192.168.0.129
roundrobin.example.com.                   ALIAS     robinresource1.example.com.
roundrobin.example.com.                   ALIAS     robinresource2.example.com.
roundrobin.example.com.                   ALIAS     robinresource3.example.com.
;; or:
1.0.168.192.in-addr.arpa.                 PTR       robinresource1.example.com.
29.0.168.192.in-addr.arpa.                PTR       robinresource2.example.com.
129.0.168.192.in-addr.arpa.               PTR       robinresource3.example.com.

So when the zone.data template found the record "roundrobin.example.com. ALIAS robinresource1.example.com." it would look within its own dictionary for a record_type of "A" where the hostname was "robinresource1.example.com." and use that to create:

robinresource1.example.com.               A         192.168.0.1
robinresource2.example.com.               A         192.168.0.29
robinresource3.example.com.               A         192.168.0.129
roundrobin.example.com.                   A         192.168.0.1
roundrobin.example.com.                   A         192.168.0.29
roundrobin.example.com.                   A         192.168.0.129
;; or:
1.0.168.192.in-addr.arpa.                 PTR       robinresource1.example.com.
29.0.168.192.in-addr.arpa.                PTR       robinresource2.example.com.
129.0.168.192.in-addr.arpa.               PTR       robinresource3.example.com.

What good is that?

This enforced strict 1:1 between A & PTR while also ensuring all network resources were unique - while still allowing us to make round-robin DNS records. This had the additional benefit that when a machine was retired (say, "robinresource1.example.com.") it forced the removal of the matching PTR - and since now the target of the ALIAS no longer existed, also made sure that the machine was removed from the round-robin. So a removal would actually result in something like this:

available-192-168-0-129.ips.example.com.  A         192.168.0.129
robinresource1.example.com.               A         192.168.0.1
robinresource2.example.com.               A         192.168.0.29
roundrobin.example.com.                   ALIAS     robinresource1.example.com.
roundrobin.example.com.                   ALIAS     robinresource2.example.com.
;; or:
1.0.168.192.in-addr.arpa.                 PTR       robinresource1.example.com.
29.0.168.192.in-addr.arpa.                PTR       robinresource2.example.com.
129.0.168.192.in-addr.arpa.               PTR       available-192-168-0-129.ips.example.com.

New DHCP

How does this help with DHCP records though? Wasn't that a big part of the original problem? Yes, it was. Now that I realized I could make up my own DNS record-types and enforce strict standards on them - a new record type (in our source files) was created: "KS" - for "kickstart". It was evaluated to match strict regular expressions and had requirements like:

You can only have a KS record if there is already a corresponding A record (No use booting a machine w/o ability to have its own IP)
The KS record must hold the MAC of the machine (This will become useful for kickstart/DHCP)
The KS record mentions any custom hardwrae, partitioning, drivers, etc that the machine needs.

Thus, at this time you may have something in the DNS source file like:

simpleserver.example.com.                 A         192.168.0.88
simpleserver.example.com.                 KS        "hw=atom|part=bigdisk1|mac=00:22:4d:7c:1a:16|os=openbsd"

The git hook would find the 'KS' record type and split apart each value to make sure all the key=value arguments present were valid and reject it otherwise. When zone_data.py evaluated a KS record it simply changed it into a TXT record. When a DHCP/kickstart machine queried for the same salt external pillar that built the DNS zones - it instead received all the information necessary to build its own dhcpd.conf.

So just like that, we no longer needed to modify both DNS and DHCP to get a machine online. A change to the source files for DNS would both make the machine have A & PTR records, as well as be added to the DHCP.

New DHCP & Hypervisors/VMs

Earlier I mentioned multiple problems with Hypervisors/VMs and DHCP. The worst is putting duplicate services onto the same hypervisor. Without any planning or inventory systems (and no power to create one) how could I help make sure duplicates aren't put on the same hypervisors? Why - DNS, of course!

myservertype59.example.com.             A         192.168.0.59
myservertype59.example.com.             KS        "hw=atom|part=bigdisk1|mac=00:22:4d:7c:1a:16|os=openbsd|hyp@hypervisor444.example.com."

Since this was all into dictionaries, I could now analyze how many VMs were hosted on "hypervisor444.example.com." Likewise, to prevent mismatches and bad names - I could reference within the dictionary to make sure that we actually had a real machine (A-record) for hypervisor444.example.com. It became trivial now to evaluate and protect -- all just with git hooks checking DNS source files -- whether or not anyone was trying to assign duplicate resources. Better yet - since they needed a valid KS record to be provisioned/booted by DHCP/kickstart - we could reject any change with a duplicate and stop the problem before it occurs.

Nice errors

As much as I realize this story gets ridiculous at times -- one tool being bent to fix many problems it may not be "the best" for -- one of the things I prided myself on throughout this was the extreme friendliness of this now-growing-massive git hook. I put in a lot of effort to make it very friendly to any user so that they could tell example what was expected of them to fix their problem. Here, if another 'myservertype' was attempted to be created with a KS saying it was hosted on hypervisor444 an error received to them in the git failure/reject would be perhaps: "There is already a 'myservertype' VM hosted on hypervisor444.example.com. Please check your source files as it appears you are trying to assign the duplicate server-types of "myservertype59" and "myservertype96" to the same hypervisor." This would be found along with a reference to the source file being analyzed at the time and the line-numbers where the matches/problems were identified.

Solving other VM issues: Duplicate MACs

Another problem was with creating unique mac addresses for our VMs. With no policy or procedure for that - it was found a few times that the same MACs were being assigned to different machines. Again, the power of the dictionary that was already established became useful -- as KS records were input - all MACs were put into a list. If a MAC already existed in the list, an error would be thrown by the hook - thus protecting us from finding the problem later - after it had caused more problems.

Naturally, it was also easy to ensure that typos within the mac= field were infrequent, by using regular expressions to match the field for only valid values.

Removing PTR

For a year or two I and staff did what I referred to as "double-entry-DNS" because it felt so much like double-entry accounting to me. All IPs were now in DNS - whether with an actual hostname or an "available" hostname in a .ips. subdomain. To add a new host, you'd replace the "available" A record with your hostname as an A record, and fix the PTR to point to the new host. Two new lines, two removed lines in your diff. Nice and balanced. But, annoying.

Originally I made sure all IPs in a subnet were accounted for by checking how many PTR records existed in the subnet; however, since they all also had A names this felt like too much. I modifed the hook (remember, it is also the external pillar which generates the zone_data files) to automatically create PTR records based upon the A records. I did have to modify how I counted the total amount of PTRs (to ensure each IP is accounted for), but not as much as I initially thought. Now, rather than counting them in as the file was read - I added a function to create the exact same PTR records in the dictionary as the A records were read in. At the time, deleting management of our PTR zones I believe removed some >500,000 records from our DNS management. This made edits simpler for humans and also improved the speed of the checks and pillar rendering.

Checking connections with KS

Another problem that came useful with both the ALIAS (being able to reference in the dict whether or not a hostname existed and with an A-record) was for mounting remote filesystems. Both for NFS/NAS/Gluster this became useful. Added to the KS type now was "fs=hostname.example.com.". Since each modification to the dataset re-ran the entire source files through the githook - if any FS server was being decomissioned, it would now throw errors if there were any machines in DNS which referred to using it/having it mounted. Likewise, it allowed DHCP and kickstart to be aware of any remote filesystems that should be mounted during the initial provisioning of the machine. The KS type expands again, and DNS rules another thing.

Security

Our security team now could also much better audit what was going on on our network. Now instead of just showing IPs, tcpdumps and other traffic analyzers could see exactly what was talking where - with very nice/descriptive DNS names associated. Except, the security team was seeing traffic with hostnames like "available-192.168.100.100" go by. I'd be lying if I said I wasn't extremely pleased when our security team found out that we had rogue devices in our datacenter which were directly attached to switches with manually assigned IPs -- that they only noticed and tracked down because they now knew they could trust DNS. It's worth noting that the rogue devices were where a DC technician had attached a laptop in order to assist our Network Operators in debugging a situation; however, it did set a wonderful precendent of ensuring that unknown devices would be alerted upon.

Likewise, with all IPs accounted for, it meant that the /30 networks used by our staff when connected via OpenVPN were now accountable. Instead of having dynamic pools (remember I said root-SSHing?) -- "Oh, today Bob is .56? or is it Jeff? Either way - someone from .56 connected as root and broke serverX!" -- we now had staticly assigned /30s for the VPN. Likewise, those entries got DNS.

dayid-tcp-bc.vpn.example.com.                     A         192.168.200.51
dayid-tcp-gw.vpn.example.com.                     A         192.168.200.50
dayid-tcp-net.vpn.example.com.                    A         192.168.200.48
dayid-tcp.vpn.example.com.                        A         192.168.200.49

Not only did this make it easier to know what employees are doing - it made it much easier to detect if an employee's workstation were somehow threatened. We could analyze DNS requests, outbound traffic attempts, etc which originated from the VPNd staff member. We could also be concerned if we saw that the VPN connection assigned to 'dayid' were actually connecting a lot of places as 'bob@' or otherwise - which also indicates malintent or problems.

VPN and DNS

Hey wait - if we're putting users with VPN access into the DNS - why not just use DNS as the control-point for whether or not a user has VPN access. They need the IP/30 for the access anyways - why track it in two places? That's right. your DNS record for your internal VPN subnet would match your LDAP/AD name. This could be handled by a salt state on the VPN servers which would use the same external pillar/dictionary we are already generated. Want to enable a VPN account to a particular resource/datacenter/user? Create a DNS record for it. Someone leaves and we need to deactivate it? DNS.

A massive checkout

At this point we're a long way from where I started this story with no source-control, hand-made machines, and trusting ghosts to find IPs. By now the company:

Had source code in git to handle all of this
Had clearly documented, searchable, and often-referenced materials for all precedures regarding booting, DHCP/DNS, etc.
Monitoring could now trust DNS - and you could correlate the hostnames in monitoring with DNS to ensure that everything is monitored.
Has every IP within its internal external address spaces in DNS, even those not in use.
Can provision machines using kickstart and DHCP by just updating DNS records.
Can account for all hardware, mac addresses, etc by using DNS.
Is safer while adding and removing machines and addresses due to the cross-analyzing nature of the records.
Can more safely watch and analyze network traffic - being able to trust DNS hostnames are correct.
Can audit/create/remove VPN connections via DNS
Has no more crashes of the named/bind/DNS service due to errant mistakes in the DNS zone files - by utilizing a combination of good saltstack code and python-based git hooks.

This DNS system now:

Ensured all targets and hostnames were valid (characters, length, etc) and FQDN.
Control what OS/OS Version a machine was configured with.
Enforced an IP could only have 1 hostname associated directly through A/PTR relationships.
Could handle round-robin DNS through a custom ALIAS type.
Would kindly overwrite and correct if anyone did try to modify it directly on the host (since, unfortunately, almost everyone still has root...)
Used that same ALIAS type-check to ensure that filesystem mounts, hypervisors, and other cross-service machines were present and that when removed all records could be cleaned first.
Utilized salt to pre-test named.conf changes before making modifications to the actual host system due to a complete lack of testing environment, laboratory, QA, or CI available within the company to ensure that named could not suffer due to flaws in source files or salt code.
Tracked all IPs used within the company and enforced continuous naming conventions for clarity.
Was no longer "all-master" and instead used a much-more-common set of master/slave relationships along with what we called "fuses" to allow purposeful or accidental breakage at many joints without harming the overall service.
Improved the company's salt code and was a continuous identifier of flaws with other service's code due to the "exit-upon-any-failure" methodology of writing the nameserver's state code.
Utilized one main code bit to handle the jobs of multiple git hooks, salt external pillars, local renderer, and validation tool.
Allowed the use of external/public and internal/intercorporate views along with custom views for our subsidaries - along with allowing combined-views which would serve severed portions of different zones appropriately based upon the dictionary application of rulesets for those resources.
Controlled drivers, kickstarting, bonding of network interfaces, and MAC inventory management.
Prevented provisioning duplicate services on the same subnets and hypervisors.

Is this ridiculous? Yes. Should DNS ever be in control of this many things somewhere? Probably not. But this was still that company's story. It was my story of how - armed with only the ability to change DNS - I was able to take control of a company's network and improve security, inventory, auditing, and staff interactions with multiple different types of systems.

I still sometimes laugh and miss "the old days" and I'll still happily repeat war-stories to anyone who wants to hear about the days of having to map an entire network just to find an available IP. Has it been fun perverting a simple DNS system through git hooks and salt to make it control a myriad of values within the company? Yeah... yeah it has.

Thanks

If you read this far wow - and thank you. I must say though the above has a terrible-author mixture of "I" and "We" that this was not my own project solely. Yes, I pushed to accomplish control of these things despite only truly having control of DNS - but it was not alone. A handful of other peers within the company were also driven absolutely nuts with the mess we found here. To them, a thank-you. They all have stickers from about 5 years before this project was finished (proclaiming its success) and know who they are.

Last

Sure, you can say I'm the bad guy because I used the wrong tool to solve a problem. I'll still think that when I was stranded and all I had was a hammer - I still made one hell of a village with it.

If you only have a hammer - is it a golden hammer?