I can empathize with the 500-mile email problem

Big thanks to Petra on Mastodon for this one.

Don’t discount us non-techs. When we come to you, we’ve often done a lot of testing first before sounding the alarm. And often we are right.

Just like I was right with Vox when I could only get a compose window on select days over three months, or when Facebook forced an anti-malware program on us when we didn’t have malware. This one, however, takes the cake, and I have to agree with Petra on her assessment that it’s her favourite ‘non technical user with the absurd bug is right’ tale.

The author of the story, Trey Harris, has given me permission to republish it here. (Here’s the original.) I have only made slight alterations for style (e.g. using the right ellipses and dashes that weren’t available to him, and italicizing where he might have had all caps) but I have not changed any of Trey’s words. Please note the link to the FAQ at the end.

From: Trey Harris <[email protected]>
 
Here’s a problem that sounded impossible … I almost regret posting the story to a wide audience, because it makes a great tale over drinks at a conference. 🙂 The story is slightly altered in order to protect the guilty, elide over irrelevant and boring details, and generally make the whole thing more entertaining.

I was working in a job running the campus email system some years ago when I got a call from the chairman of the statistics department.

‘We’re having a problem sending email out of the department.’

‘What’s the problem?’ I asked.

‘We can’t send mail more than 500 miles,’ the chairman explained.

I choked on my latte. ‘Come again?’

‘We can’t send mail farther than 500 miles from here,’ he repeated. ‘A little bit more, actually. Call it 520 miles. But no farther.’

‘Um … Email really doesn’t work that way, generally,’ I said, trying to keep panic out of my voice. One doesn’t display panic when speaking to a department chairman, even of a relatively impoverished department like statistics. ‘What makes you think you can’t send mail more than 500 miles?’

‘It’s not what I think,’ the chairman replied testily. ‘You see, when we first noticed this happening, a few days ago …’

‘You waited a few days?’ I interrupted, a tremor tinging my voice. ‘And you couldn’t send email this whole time?’

‘We could send email. Just not more than …’

‘… Five hundred miles, yes,’ I finished for him, ‘I got that. But why didn’t you call earlier?’

‘Well, we hadn’t collected enough data to be sure of what was going on until just now.’ Right. This is the chairman of statistics. ‘Anyway, I asked one of the geostatisticians to look into it …’

‘Geostatisticians …’

‘… Yes, and she’s produced a map showing the radius within which we can send email to be slightly more than 500 miles. There are a number of destinations within that radius that we can’t reach, either, or reach sporadically, but we can never email farther than this radius.’

‘I see,’ I said, and put my head in my hands. ‘When did this start? A few days ago, you said, but did anything change in your systems at that time?’

‘Well, the consultant came in and patched our server and rebooted it. But I called him, and he said he didn’t touch the mail system.’

‘OK, let me take a look, and I’ll call you back,’ I said, scarcely believing that I was playing along. It wasn’t April Fool’s Day. I tried to remember if someone owed me a practical joke.

I logged into their department’s server, and sent a few test mails. This was in the Research Triangle of North Carolina, and a test mail to my own account was delivered without a hitch. Ditto for one sent to Richmond, and Atlanta, and Washington. Another to Princeton (400 miles) worked.

But then I tried to send an email to Memphis (600 miles). It failed. Boston, failed. Detroit, failed. I got out my address book and started trying to narrow this down. New York (420 miles) worked, but Providence (580 miles) failed.

I was beginning to wonder if I had lost my sanity. I tried emailing a friend who lived in North Carolina, but whose ISP was in Seattle. Thankfully, it failed. If the problem had had to do with the geography of the human recipient and not his mail server, I think I would have broken down in tears.

Having established that—unbelievably—the problem as reported was true, and repeatable, I took a look at the sendmail.cf file. It looked fairly normal. In fact, it looked familiar.

I diffed it against the sendmail.cf in my home directory. It hadn’t been altered—it was a sendmail.cf I had written. And I was fairly certain I hadn’t enabled the ‘FAIL_MAIL_OVER_500_MILES’ option. At a loss, I telnetted into the SMTP port. The server happily responded with a SunOS sendmail banner.

Wait a minute … a SunOS sendmail banner? At the time, Sun was still shipping Sendmail 5 with its operating system, even though Sendmail 8 was fairly mature. Being a good system administrator, I had standardized on Sendmail 8. And also being a good system administrator, I had written a sendmail.cf that used the nice long self-documenting option and variable names available in Sendmail 8 rather than the cryptic punctuation-mark codes that had been used in Sendmail 5.

The pieces fell into place, all at once, and I again choked on the dregs of my now-cold latte. When the consultant had ‘patched the server,’ he had apparently upgraded the version of SunOS, and in so doing downgraded Sendmail. The upgrade helpfully left the sendmail.cf alone, even though it was now the wrong version.

It so happens that Sendmail 5—at least, the version that Sun shipped, which had some tweaks—could deal with the Sendmail 8 sendmail.cf, as most of the rules had at that point remained unaltered. But the new long configuration options—those it saw as junk, and skipped. And the sendmail binary had no defaults compiled in for most of these, so, finding no suitable settings in the sendmail.cf file, they were set to zero.

One of the settings that was set to zero was the timeout to connect to the remote SMTP server. Some experimentation established that on this particular machine with its typical load, a zero timeout would abort a connect call in slightly over three milliseconds.

An odd feature of our campus network at the time was that it was 100% switched. An outgoing packet wouldn’t incur a router delay until hitting the POP and reaching a router on the far side. So time to connect to a lightly loaded remote host on a nearby network would actually largely be governed by the speed of light distance to the destination rather than by incidental router delays.

Feeling slightly giddy, I typed into my shell:
 
$ units
1311 units, 63 prefixes

 
You have: 3 millilightseconds
You want: miles

* 558.84719

/ 0.0017893979
 
‘Five hundred miles, or a little bit more.’

For those wanting to nitpick, Trey has written this FAQ with his answers.

I’m grateful for people like Trey, who actually investigate, even when what we say sounds totally implausible.


You may also like




6 thoughts on “I can empathize with the 500-mile email problem

  1. Perfect. I once got an admission (that I was, in fact, right) and an apology from IT. I’d theorized that a problem I was having was being caused by McAfee. About six months months later… they determined that I was right. It didn’t seem to make any logical sense, so we had to do all kinds of convoluted troubleshooting and monitoring. (It had to do with randomly dropping connection to a remote server.) I think they thought I just hated McAfee and liked blaming it for things, by that time. They weren’t WRONG about that, but that wasn’t the case – that time.

  2. They really need to listen to us! We do troubleshoot first before alerting IT professionals, and what we uncover tends to be reasonably well researched …

  3. So true. And because we’re affected, we’re often persistent and determined to get to the root cause – we just can’t fix it ourselves. I’m STILL mad at Ashton-Tate over the MultiMate Advantage pointer bug they knew about but didn’t deem important enough to bother fixing. That was 30 years ago.

  4. Some of those software companies are pretty useless. And they never inform their tech support people, so all those folks have are copy-and-paste texts that do not address the problem. I mean, look at Bing right now. And I can think of countless others, as you can, too!

  5. Still no updates on Bing?

    It’s weird – lots of results, but they’re often completely useless or not really on point related to the search terms.

  6. I might do a post about it. No real updates, other than they are desperate to be the Wayback Machine.

    They’ve managed to add some of our pages to the index! Only thing is, they are from 1998 and 1999. (This is what happened in the interim.)

Leave a Reply

Your email address will not be published. Required fields are marked *