In the last post, we wrote a naive script to check which of the Alexa Top 1m sites were using the security.txt standard.
This script won't be used because it might take around 115 days to run. It needs to be optimized.
As a recap, the issue was that blocking network requests were being sent out one at a time.
How can we fix this? If we use send out batches of network requests, we'll still block on the responses but our execution time will be improved by the size of the batch.
Using grequests
to send out batches of requests
grequests is a great library that will let us send out batches of requests.
grequests
has a similar syntax to the requests
library. However, you usually need to refactor your code as the style of usage is a bit different.
The main difference is you need to think in blocks. With requests
, you can execute a request, then handle the response. With grequests
, you need to build your block of requests, then execute your requests, then handle the responses.
Remember the three steps:
- Build your block of requests
- Execute your requests
- Handle the responses
import grequests
BATCH_SIZE = 50
requests = []
with open('./top-1m.csv') as f:
for line in f.readlines():
_, domain = line.replace('\n', '').split(',')
# step 1: Build your block of requests
requests += build_requests(domain)
if len(requests) > BATCH_SIZE:
# step 2: Execute your requests
responses = grequests.map(requests)
# step 3: Handle the responses
for response in responses:
handle_response(response)
requests = []
grequests.map
is the function that blocks on the network requests. It takes the list of requests and turns them into a list of responses.
Building a request
Building a request has the same syntax as when you're using requests
:
def build_requests(domain):
kwargs = {'timeout': 5}
return [
grequests.get('https://{}/security.txt'.format(domain), **kwargs),
grequests.get('https://{}/.well-known/security.txt'.format(domain), **kwargs)
]
Handling the response
The trick with handling the response is that you usually need the context of the request. We have that context via response.request
.
Here's a quick sample:
def handle_response(response):
if response.status_code == 200 and len(response.content) < 2000:
return True
else:
return False
Putting it all together
Here's the final code that we'll run to check Alexa Top 1M for security.txt:
import grequests
import json
DATA_FILE = './data.json'
BATCH_SIZE = 50
def open_data():
try:
with open(DATA_FILE) as f:
data = json.loads(f.read())
except:
data = {}
return data
def save_data(data):
with open(DATA_FILE, 'w') as f:
f.write(json.dumps(data))
def build_requests(domain):
kwargs = {'timeout': 5}
return [
grequests.get('https://{}/security.txt'.format(domain), **kwargs),
grequests.get('https://{}/.well-known/security.txt'.format(domain), **kwargs)
]
def handle_response(response, data):
if response and response.status_code == 200 and len(response.content) < 2000:
domain = response.url.replace('https://', '').split('/')[0]
if domain not in data:
data[domain] = {}
try:
if response.url.__contains__('.well-known'):
data[domain]['.well-known'] = response.content.decode('utf8')
else:
data[domain]['root'] = response.content.decode('utf8')
except:
pass
save_data(data)
current_line = 0
data = open_data()
requests = []
with open('./top-1m.csv') as f:
for line in f.readlines():
_, domain = line.replace('\n', '').split(',')
requests += build_requests(domain)
if len(requests) > BATCH_SIZE:
responses = grequests.map(requests)
for response in responses:
handle_response(response, data)
requests = []
print('{}/1000000 ({}%)'.format(current_line, 100 * current_line / 1000000))
current_line += 1
Next steps
The main issue here is that we're making a couple of assumptions for the URL:
- We're assuming the site uses
https
- We're assuming that the domain responds to a request to a non-
www.
domain
It might be a better idea to use domains from CommonCrawl (this would also provide a much larger list of domains).
The next steps for the project are.. analysis!