Internet time server
Infoblox DNS servers
Domain controllers in master domain (let's call this D1)
PDC in subdomain (SD2, SD3, etc)
DC in SD2
Member server in SD2
One issue I found right away was that the PDC in SFD2 was getting time from ntp.D1.com. This DNS entry only pointed at the PDC in D1, resulting in a non-redundant time stream, which caused issues when we took the D1 PDC down. I enabled round robin DNS for this by adding a second DNS record with the same name (ntp.D1.com) pointing at the DC's IP.
The next issue was that the D1 DC was not pointed at the D1 PDC (domain-based hierarchy). It instead was pointing at the infoblox server. I decided to ignore this for now.
Next, I put together a list of all the PDC's in every domain and verified NTP access for each to the DC's in D1. I also audited the registry on each and found another issue. The ",0x1" flag for special polling intervals that should be added to "ntp.D1.com" in the ntpserver reg entry was missing in many domains. I'm working through getting this added, but it's not a top priority.
Next, I did audits of the event logs of several domains (~5 servers, a PDC and a DC each) and determined that there was some correlation between reboots and the time failing. I ruled out patch issues or change issues since there was nothing in common. There also seemed to be a strange inverse correlation between the PDC not receiving time properly and the member servers receiving time properly. One worked, the other didn't, and visa versa.
Last, I built an entire test domain mimicking a normal domain, turned on logging, and started testing. I found that things got screwy quick, and eventually determined by parsing the packet reception in the logs that my SD2 PDC had stratum 15 somehow. I got logging turned on in D1and found that the infoblox servers had stratum 13!!! If you count the list above, the highest stratum we should be at is 6, and MS limits stratum to 15 exactly. Stratum 15 also is an indicator flag that the stratum count may not be correct, but in this case it was incremented correctly (after the infoblox, which is the root of our issues).
We're opening a case with infoblox to figure out what's going on with them. After we fix that, I'll know if I can consider this case closed.