Home > Uncategorized > Extracting domain names from proxy logs with python’s ‘urlparse’

Extracting domain names from proxy logs with python’s ‘urlparse’

During a malware investigation it helps to be able to extract the domain portion of a URL from a web proxy log to identify the communications between a compromised host and an external botnet command and control server. This assumes you know the URL being used for outbound communication, have an infrastructure where all outbound http traffic is routed through a proxy under control of your organization or have access to the logs from the proxy server.

There are plenty of complicated regex expressions for parsing URLs and extracting domains but Python provides a much more elegant way to do this using its ‘urlparse’ module. The following sample code takes a single proxy log file as input and extracts only the domain portion of the URL for further analysis.

[usage: python parseurl.py logfilename]

#!/usr/bin/python
import re
import sys
from urlparse import urlparse

f = open(sys.argv[1], “r”)

for line in f.readlines():
 line = re.findall(r'(https?://\S+)’, line)
 if line:
  parsed=urlparse(line[0])
  print parsed.hostname
f.close()

You can carry out further log reduction by piping the results through ‘uniq’.

$ python parseurl.py proxylog-zeus-10.1.1.1-2011.02.16_15.54.csv

bits.wikimedia.org
upload.wikimedia.org
geoiplookup.wikimedia.org
en.wikipedia.org
bits.wikimedia.org
en.wikipedia.org
ad.doubleclick.net
s0.2mdn.net
ad.doubleclick.net
s0.2mdn.net
tools.google.com
library.municode.com
tools.google.com
15february.adina-blog.co.cc
freephoenixbirdspace.com
http://www.adb.cba.pl

[SNIP]

The last three domains look like they need further investigation.

Advertisements
Categories: Uncategorized
  1. N1XY
    May 11, 2011 at 3:26 am

    That’s a pretty nifty little python script. It would be cool to expand on it so it can parse out directories and the eventual end file on the server. Taking into consideration that command and control systems are evolving to evade us Forensic guys.

    I’m not sure how you feel about this – the use of compromised servers is becoming quite common in C&C, not to mention the use of public sites (twitter etc.). The API’s that public messaging and networking sites provide are quite extensive, imagine a bot that checks for commands via twitter/facebook and gets all it’s updates from RapidShare… or even Google Code! Crazy? I think not!

    Interesting little script that inspires me to go learn some python, thanks for the tutorial & getting me thinking!

    Greetz

  2. May 16, 2011 at 5:24 pm

    Thanks for the comment. Very little time or real world need to expand on the script. It was something I needed to fulfill a particular purpose and I just left it at that.

  3. August 23, 2011 at 3:26 am

    Take care: you are getting the hostnames, not the domains.

    How are you going to difference google.com and google.co.uk ?

    Or zeus.cn.cc and zeus.cc ?

    This is a good challenge 😉

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: