There are infrastructure problems that look trivial from the outside.
“Just add a DNS entry.”
“Just create a firewall rule.”
“Just use the router’s DNS resolver.”
And technically, yes, that works — until the network grows a little, until VLANs appear, until services live on multiple interfaces, until you want reproducibility, and until every “just one more override” becomes one more small piece of hidden, non-idempotent state in a firewall web UI.
This is the story of one of those deceptively small problems: local DNS in a segmented LAN.
More precisely:
How do you run clean, fast, deterministic DNS for internal services across multiple VLANs, without manually maintaining pfSense host overrides forever?
The final solution was not exotic. No Kubernetes, no Consul, no enterprise appliance, no service mesh. Just a Debian server, CoreDNS, pfSense/Unbound, Ansible, and some careful thinking about how DNS resolution actually behaves.
And as usual with infrastructure, the hard part was not installing the software. The hard part was understanding the control flow.
The starting point: pfSense DNS worked, but not well enough
The network already used pfSense as the central firewall and DNS resolver. That is a fairly common and reasonable setup.
pfSense runs Unbound as its DNS resolver, and for many environments that is good enough:
- DHCP integration
- local host overrides
- domain overrides
- forwarding or recursive resolution
- DNS over TLS support
- a UI that is approachable enough
For a small LAN, this is perfectly fine.
But over time, the local DNS setup had grown beyond “a few host overrides.”
There were many internal services:
prometheus.myhost.lan
grafana.myhost.lan
git.myhost.lan
dev.myhost.lan
vpn.myhost.lan
Some entries were internal only. Some belonged to a public domain used internally. Some needed different answers depending on the client VLAN. And all of it was managed through the pfSense UI.
That created a few problems:
- manual changes
- no proper version control
- no easy rollback
- no idempotency
- no reusable deployment logic
- no real infrastructure-as-code workflow
This was not really a DNS performance problem. It was a DNS control plane problem.
pfSense remained excellent as a firewall and as a central resolver/cache, but it was not the right place to maintain a growing, structured, local DNS authority.
The architectural decision: separate resolver and authority
The key mental shift was this:
The firewall should not necessarily be the source of truth for all internal DNS records.
A cleaner architecture is to separate responsibilities:
Clients
↓
CoreDNS / pfSense DNS resolver
↓
Local authoritative records
↓
pfSense / Unbound
↓
Public upstream DNS
Or more specifically:
CoreDNS = local DNS authority / overlay / VLAN-aware logic
pfSense Unbound = recursive resolver, cache, upstream forwarding, firewall integration
Public DNS = global resolution
This gives us a layered model:
| Layer | Responsibility |
|---|---|
| CoreDNS | Internal records, local overlays, VLAN-aware answers |
| pfSense / Unbound | General DNS resolution, cache, upstream forwarding |
| Public resolvers | External DNS resolution |
The important part: CoreDNS does not need to replace pfSense entirely.
It can simply become the part of the system that handles local DNS logic in a declarative way.
Why CoreDNS?
CoreDNS is small, fast, plugin-based, and configured through a simple text file called a Corefile.
That makes it ideal for this kind of task:
- easy to deploy on Debian
- easy to template with Ansible
- easy to version in Git
- supports static host records
- supports forwarding
- supports Prometheus metrics
- supports client-aware views
- simple enough to reason about
The goal was not to build an overcomplicated internal DNS platform. The goal was to replace hidden UI state with a clean, reproducible system.
The basic CoreDNS model
A minimal CoreDNS setup for local records looks like this:
myhost.lan:53 {
errors
cache 300
hosts {
192.168.10.20 prometheus.myhost.lan
192.168.10.21 grafana.myhost.lan
fallthrough
}
forward . 192.168.10.1
}
The logic is:
- If CoreDNS knows the hostname, return the local IP.
- If not,
fallthrough. - Forward the unresolved query to pfSense/Unbound.
For a public domain used internally, the same idea works:
example.org:53 {
errors
cache 300
hosts {
192.168.10.30 internal-api.example.org
192.168.10.31 dev-only.example.org
fallthrough
}
forward . 192.168.10.1
}
This is useful for split-horizon style DNS:
internal-api.example.org → local IP inside LAN
www.example.org → public DNS via upstream
So far, so simple.
But the real problem was more interesting.
The multi-VLAN problem
One internal service, git.myhost.lan, was reachable from four VLANs.
The server itself had interfaces in all four VLANs:
VLAN 10 → 192.168.10.50
VLAN 20 → 192.168.20.50
VLAN 30 → 192.168.30.50
VLAN 40 → 192.168.40.50
The goal was simple:
Client in VLAN 10 → git.myhost.lan → 192.168.10.50
Client in VLAN 20 → git.myhost.lan → 192.168.20.50
Client in VLAN 30 → git.myhost.lan → 192.168.30.50
Client in VLAN 40 → git.myhost.lan → 192.168.40.50
Why?
Because then traffic stays local to the VLAN.
Without that, a client in VLAN 20 might resolve git.myhost.lan to the VLAN 10 IP, causing traffic to go through the firewall. Since the network did not yet have switches with proper inter-VLAN routing capabilities, that meant unnecessary firewall traversal and an ugly topology leak.
The workaround had been:
git1.myhost.lan → VLAN 10 IP
git2.myhost.lan → VLAN 20 IP
git3.myhost.lan → VLAN 30 IP
git4.myhost.lan → VLAN 40 IP
Functional, but not elegant.
The hostname should describe the service, not the network topology.
So the real requirement was:
Same hostname, different answer depending on the querying client subnet.
That is split-horizon DNS, or more precisely, client-subnet-aware DNS.
CoreDNS views: the promising mechanism
CoreDNS has a view plugin that can match queries based on client information.
Conceptually, this allows something like:
myhost.lan:53 {
view vlan10 {
expr incidr(client_ip(), '192.168.10.0/24')
}
hosts {
192.168.10.50 git.myhost.lan
fallthrough
}
forward . 192.168.10.1
}
And for VLAN 20:
myhost.lan:53 {
view vlan20 {
expr incidr(client_ip(), '192.168.20.0/24')
}
hosts {
192.168.20.50 git.myhost.lan
fallthrough
}
forward . 192.168.10.1
}
This was the right concept.
But getting the actual CoreDNS control flow right required several iterations.
Important lesson 1: CoreDNS must see the real client IP
For VLAN-aware DNS to work, CoreDNS needs to see the original client IP.
That sounds obvious, but it has an important implication.
If clients query pfSense, and pfSense forwards to CoreDNS, then CoreDNS may only see pfSense as the source:
Client → pfSense/Unbound → CoreDNS
From CoreDNS’ perspective, the client is then not 192.168.20.123, but the firewall.
That breaks view-based matching.
So for this setup, clients that need VLAN-aware DNS should query CoreDNS directly, usually via DHCP DNS server settings per VLAN.
Alternatively, the split-horizon logic would need to live in Unbound/pfSense itself.
For this setup, the clean path was:
Clients → CoreDNS → pfSense/Unbound → upstream
CoreDNS becomes the first resolver for the LAN clients. It handles local logic, then forwards everything else to pfSense.
Important lesson 2: binding CoreDNS correctly
The Debian server already had other DNS-related services, including dnsmasq on a Docker/libvirt-style subnet.
The initial CoreDNS attempt failed with:
listen tcp :53: bind: address already in use
At first, it looked like CoreDNS was ignoring explicit IPs. The real issue was a CoreDNS grammar detail.
This does not mean “bind to 127.0.0.1” in the intuitive way:
127.0.0.1:53 {
...
}
The correct way to restrict listener interfaces is to use the bind plugin inside the server block:
.:53 {
bind 192.168.10.53 192.168.20.53 192.168.30.53 192.168.40.53
errors
cache 300
forward . 192.168.10.1
}
And crucially: every relevant server block must contain the same bind restriction.
Otherwise CoreDNS may still try to listen broadly on port 53.
The resulting pattern:
myhost.lan:53 {
bind 192.168.10.53 192.168.20.53 192.168.30.53 192.168.40.53
...
}
This avoids conflicts with dnsmasq or other services bound to different local interfaces.
Important lesson 3: CoreDNS does not cascade between server blocks
This was the most important logic trap.
The first structure looked roughly like this:
VLAN-specific block for myhost.lan
Generic/global block for myhost.lan
The expectation was:
1. Try VLAN-specific hosts
2. If not found, fall through to global hosts
3. If not found, forward upstream
But CoreDNS does not work like that.
Once a request lands in a matching server block/view, it runs that plugin chain. It does not then continue into a later generic server block.
So this failed:
myhost.lan:53 {
view vlan10 {
expr incidr(client_ip(), '192.168.10.0/24')
}
hosts {
192.168.10.50 git.myhost.lan
fallthrough
}
forward . 192.168.10.1
}
myhost.lan:53 {
hosts {
192.168.10.20 prometheus.myhost.lan
192.168.10.21 grafana.myhost.lan
fallthrough
}
forward . 192.168.10.1
}
A VLAN 10 client asking for prometheus.myhost.lan would enter the VLAN 10 block, not find the record, then forward upstream. It would not check the later global hosts block.
That explained why the “global overlay” seemed to be ignored.
Important lesson 4: hosts can only be used once per zone block
The next idea was to place multiple hosts sections inside one block:
myhost.lan:53 {
view vlan10 { ... }
hosts {
192.168.10.50 git.myhost.lan
fallthrough
}
hosts {
192.168.10.20 prometheus.myhost.lan
fallthrough
}
forward . 192.168.10.1
}
But that also failed, because the hosts plugin cannot simply be repeated like that in the same zone block.
The eventual solution was much simpler and more robust:
Materialize the global zone records into every matching VLAN-specific block.
In other words, each VLAN-specific block contains:
VLAN-specific records
+ global records for the same zone
+ fallthrough
+ forward
That keeps the plugin chain simple and deterministic.
The working CoreDNS pattern
A simplified final CoreDNS structure looks like this:
.:53 {
bind 192.168.10.53 192.168.20.53 192.168.30.53 192.168.40.53
errors
cache 300
prometheus 192.168.10.53:9153
forward . 192.168.10.1
}
myhost.lan:53 {
bind 192.168.10.53 192.168.20.53 192.168.30.53 192.168.40.53
view vlan10 {
expr incidr(client_ip(), '192.168.10.0/24')
}
errors
cache 300
hosts {
192.168.10.50 git.myhost.lan
# Global records injected into the VLAN view
192.168.10.20 prometheus.myhost.lan
192.168.10.21 grafana.myhost.lan
192.168.10.22 alertmanager.myhost.lan
fallthrough
}
forward . 192.168.10.1
}
myhost.lan:53 {
bind 192.168.10.53 192.168.20.53 192.168.30.53 192.168.40.53
view vlan20 {
expr incidr(client_ip(), '192.168.20.0/24')
}
errors
cache 300
hosts {
192.168.20.50 git.myhost.lan
# Same global records
192.168.10.20 prometheus.myhost.lan
192.168.10.21 grafana.myhost.lan
192.168.10.22 alertmanager.myhost.lan
fallthrough
}
forward . 192.168.10.1
}
Now the behavior is exactly what we want:
VLAN 10 client:
git.myhost.lan → 192.168.10.50
prometheus.myhost.lan → 192.168.10.20
google.com → forwarded to pfSense
VLAN 20 client:
git.myhost.lan → 192.168.20.50
prometheus.myhost.lan → 192.168.10.20
google.com → forwarded to pfSense
The service-specific topology is handled locally, while general records remain shared.
Making it idempotent with Ansible
The point of this was not just to make DNS work once. The point was to make it reproducible.
The role uses variables like this:
coredns_upstreams:
- "192.168.10.1"
coredns_bind_ips:
- "192.168.10.53"
- "192.168.20.53"
- "192.168.30.53"
- "192.168.40.53"
coredns_prometheus_enabled: true
coredns_prometheus_listen: "192.168.10.53:9153"
coredns_global_zones:
- name: "myhost.lan"
records:
- "192.168.10.20 prometheus.myhost.lan"
- "192.168.10.21 grafana.myhost.lan"
- "192.168.10.22 alertmanager.myhost.lan"
- name: "example.org"
records:
- "192.168.10.30 internal-api.example.org"
- "192.168.10.31 dev-only.example.org"
coredns_views:
- name: "vlan10"
client_cidrs:
- "192.168.10.0/24"
zones:
- name: "myhost.lan"
records:
- "192.168.10.50 git.myhost.lan"
- name: "vlan20"
client_cidrs:
- "192.168.20.0/24"
zones:
- name: "myhost.lan"
records:
- "192.168.20.50 git.myhost.lan"
The key part in the Jinja template is where the global records are injected into matching view-specific zones:
hosts {
{% for record in zone.records %}
{{ record }}
{% endfor %}
{# Add global zone entries if they match the same domain #}
{% for gzone in coredns_global_zones %}
{% if gzone.name == zone.name %}
# Global records for {{ gzone.name }}
{% for grecord in gzone.records %}
{{ grecord }}
{% endfor %}
{% endif %}
{% endfor %}
fallthrough
}
That gives every VLAN view a complete local view of the zone.
It is not the most theoretically elegant model, but it is explicit, robust, and easy to reason about — which is usually what I want in infrastructure.
Docker: the loopback trap
One extra issue appeared after switching DNS on the Debian host itself.
The host used a new /etc/resolv.conf, including:
nameserver 127.0.0.1
nameserver 192.168.10.1
That worked for the host.
But Docker containers have their own network namespace. Inside a container:
127.0.0.1
does not mean the Debian host. It means the container itself.
So containers could not use the host’s loopback DNS.
The fix was to configure Docker explicitly with a DNS server reachable from containers:
{
"dns": [
"192.168.10.53",
"192.168.10.1"
]
}
in:
/etc/docker/daemon.json
Then:
sudo systemctl restart docker
And test:
docker run --rm alpine nslookup git.myhost.lan
docker run --rm alpine nslookup google.com
That restored resolution inside containers as well.
Testing before switching production systems
This is the part that matters in real infrastructure work: never switch everything blindly.
The safe test process was:
dig @192.168.10.53 git.myhost.lan +short
dig @192.168.10.53 prometheus.myhost.lan +short
dig @192.168.10.53 google.com +short
From clients in different VLANs:
dig @192.168.10.53 git.myhost.lan +short
dig @192.168.20.53 git.myhost.lan +short
dig @192.168.30.53 git.myhost.lan +short
dig @192.168.40.53 git.myhost.lan +short
Expected:
VLAN 10 → 192.168.10.50
VLAN 20 → 192.168.20.50
VLAN 30 → 192.168.30.50
VLAN 40 → 192.168.40.50
CoreDNS logs were useful during rollout:
log
errors
And system logs:
journalctl -u coredns -f
The critical thing to verify was that CoreDNS saw real client IPs, not only the firewall.
Observability: Prometheus metrics
CoreDNS can expose Prometheus metrics with one line:
prometheus 192.168.10.53:9153
Then Prometheus can scrape:
scrape_configs:
- job_name: "coredns"
static_configs:
- targets:
- "192.168.10.53:9153"
Useful metrics include:
coredns_dns_requests_total
coredns_dns_responses_total
coredns_cache_hits_total
coredns_cache_misses_total
coredns_dns_request_duration_seconds
This is not just nice to have. DNS is foundational enough that visibility matters.
High NXDOMAIN rates, slow upstream responses, broken clients, excessive lookups — all of that becomes visible.
The final architecture
The resulting setup looks like this:
LAN clients
↓
CoreDNS on central Debian server
↓
local records / VLAN-aware views
↓
pfSense Unbound
↓
public upstream DNS
For regular internal services:
prometheus.myhost.lan → shared local IP
grafana.myhost.lan → shared local IP
For topology-aware services:
git.myhost.lan → VLAN-local IP depending on client subnet
For everything else:
external domains → pfSense/Unbound → upstream DNS
And all of it is deployed through Ansible.
That means:
- no manual pfSense host override editing
- no hidden state
- no one-off UI changes
- no topology-leaking hostnames like
git1,git2,git3 - predictable local DNS behavior
- observable DNS metrics
- repeatable deployment
This is exactly the kind of infrastructure improvement that does not look glamorous, but immediately makes a network feel more coherent.
Why this matters beyond DNS
This is a local LAN topic, but the underlying pattern is much larger.
A lot of IT problems are not caused by missing tools. They are caused by fuzzy boundaries between responsibilities.
In this case:
Firewall UI
+ DNS resolver
+ local authority
+ topology routing
+ manual records
had all been collapsed into one place.
The solution was not “use a cooler DNS server.”
The solution was to separate concerns:
pfSense = firewall and resolver/cache
CoreDNS = local authority and DNS logic
Ansible = desired state
Prometheus = visibility
That is the difference between something that merely works and something that can be operated.
A small note from the consulting side
This is also the kind of work I enjoy doing professionally: taking a system that has grown organically, finding the hidden structural friction, and turning it into something cleaner, reproducible, and easier to operate.
At Neoground, we usually work on larger digital systems — AI consulting, web platforms, automation, SaaS architectures, infrastructure, and strategy — but the same principle applies everywhere:
Good technology work is not just implementation. It is structural clarification.
Sometimes that means designing a digital product. Sometimes it means untangling an internal workflow. Sometimes it means fixing DNS in a segmented LAN so the whole infrastructure finally behaves like it should.
If you have a system that works, but feels increasingly manual, fragile, or hard to reason about, that is usually a sign that it has outgrown its original control plane.
That is exactly where advisory work becomes valuable.
Final result
After the migration, DNS is now:
- fast
- local
- VLAN-aware
- reproducible
- observable
- no longer trapped in the pfSense UI
The satisfying part is not only that git.myhost.lan now resolves correctly from every VLAN.
The satisfying part is that the system now matches the mental model.
And in infrastructure, that is usually when things become calm again.
This blog post was written by me with the assistance of AI (GPT 5.5 Thinking) based on my long troubleshooting sessions, notes, and numerous tries.
Noch keine Kommentare
Kommentar hinzufügen