I ran into a rather bizarre issue recently. I’m currently in the process of replacing our existing 2012 R2 Domain Controllers with 2019 Domain Controllers. We have two domains in the forest with a parent / child relationship.
I completed the upgrade of the parent domain without a problem. This obviously included running forest prep and domain prep behind the scenes during the DCPromo process. I verified the forest schema version afterwards and it showed the expected value of 88
Get-ADObject (get-adrootdse).schemaNamingContext -Property objectVersion
Domain version showed an expected value of 16 as well
Get-ADObject -ldapfilter ‘(&(objectClass=Container)(cn=ActiveDirectoryUpdate))’ -Properties *| select Name, CanonicalName,revision
So using the same process, I went ahead and deployed the first 2019 domain controller in the child domain. Server build was fine, updated fine, rebooted fine. Everything was fine. I added the ADDS and DNS roles to the server, that seemed to go fine.
And then I promoted it to a domain controller.
Initial replication seemed to be fine. I gave it maybe 10 min or so until repadmin showed everything was happy and replicating. But then, the RDP session was suddenly black. I wiggled the mouse, and nothing happened. I shrugged and forced a reboot because whatever it’ll come right back, right?
So apparently Windows crapped the bed. I tried a couple things. Safe mode, last known good configuration, etc etc but no dice. If I use remote desktop, the screen just stays black. Attempts to launch task manager with the shortcut CTRL+SHIFT+ESC were unsuccessful. If I connect to the VMWare console, I see the spinning white dots above. I figured eh, whatever it’s a domain controller. I’ll just force remove this one from AD and build it again.
So I shut down the bad one, ran through cleaning up AD, and got to work building a new one. After fully patching and adding the roles, I promoted it to a domain controller again. Initial replication looked ok, stuff seemed to be working.
And then the screen went black again.
At this point, after thoroughly “wtf-ing”, I started diving deeper. It became apparent that the server itself was actually technically functioning. Repadmin even showed it was replicating! UNC paths worked, remote registry worked, Powershell remoting worked. EVERYTHING WORKED but I couldn’t actually get to the desktop of the server with RDP or with the VMWare console.
I figured OK, let’s see what happens if I DCPromo it again and demote it. I connected to the server via Server Manager running on another box and was able to successfully demote it. Lo and behold, after a reboot I suddenly had a desktop again.
I promoted and demoted it a few times and the process was 100% repeatable. I built another server from scratch and repeated it again and it continued happening. And c’mon now, of course I tried rebooting it. I left it with the spinning dots for over 24 hours to see if it would sort things out. Nope.
At this point I was on the phone with Microsoft. After forcing a BSOD and providing a complete memory dump for analysis, they tracked it down to the settings for “Bypass traverse checking” being set incorrectly.
To find out what policies were being applied to this domain controller, I ran RSOP.msc from an administrative command prompt on a different domain controller that was still working (because all the DCs should be inheriting the same policies) and then navigated to the following setting
Computer Configuration > Windows Settings > Security Settings > Local Policies > User Rights Assignment
In my environment, apparently the Default Domain Controller Policy had been changed to remove “Everyone” from this setting. In fact, my environment only had entries for “BUILTIN\Pre-Windows 2000 Compatible Access” and “Domain Users”.
I added “Everyone” to this GPO setting and forced the “bad” domain controller to update it’s policy with PowerShell.
Invoke-GPUpdate -Computer <ServerName> -RandomDelayInMinutes 0 -Force
I then restarted the bad domain controller and we’re back in business!
Microsoft went on to say that “this is a known issue and is currently being worked on” but they do not have an official fix for it at this time. Adding “Everyone” to the Bypass traverse checking does fix it for me. Then again, the domain was working just fine previously with the “bad” settings in place before I deployed the 2019 server. So, it does still appear to be somewhat of a bug.