Finding security.txt in Alexa Top 1M (part 3)

February 17, 2019

In past posts in this series, we wrote a naive script to check the Alexa Top 1m sites for usage of the security.txt standard and then we optimized that script to be run in practical time.

Now, after the script has been run, it's time to analyze it.

Analyzing the results

A first look shows that 18403 domains host security.txt. That's nearly 2% of websites; not bad.

But, let's dive a bit deeper into the data...

Tumblr domains

The first thing to be pointed out is that there are a number of * domains, which all have Tumblr's security.txt.

If we eliminate all of the * domains, we have 11014 remaining.

Soft 404s

Next, it seems like there are a lot of false positives which are actually soft 404 pages. A soft 404 is when a web-server responds with a 200 status code, but sends a "we couldn't find that" page. When we scanned for security.txt documents, we kept all status 200 responses with a short body.

How can we find the soft 404s? Well, security.txt shouldn't include any HTML. Let's check for HTML using this function:

from bs4 import BeautifulSoup

def is_html(content):
    return bool(BeautifulSoup(content, "html.parser").find())

After removing the soft 404s, only 4201 domains remain.

Empty responses

Many of the responses have no content; let's clear those out.

After removing the empty responses, only 2617 domains remain.

Parsing out the remainder

Now, there seem to still be some garbage responses. For instance, some of the responses are soft 404s which do not include HTML (e.g. a simple text response of not found).

Let's try the approach of looking for what would be in a security.txt file. For the valid responses that have been found, there is definitely a colon (:) and also likely a line break (\r\n, \r or \n). Let's have a look for only matches to those formats.

At this point, we only have 1612 domains left (around 0.1% of the total).

Of those domains, it seems there are still some false positives. In a subsequent post, we'll look through the remaining data.