Using Big Data and Graph Theory to Map the WWW – CF016

This week Ashton and Christian kick off the new year by discussing plans to launch the first major cybersecurity research project on the development sandbox: mapping the internet with traceroute, distributed Nmap, and Hadoop. We discuss strategies and design challenges that will be present early on in development and potential solutions that may be employed. Furthermore, we tie back the project to the latest developments in the North Korea cyber fiasco with Sony Corporation.

Cyber Frontiers is all about Exploring Cyber security, Big Data, and the Technologies Shaping the Future Through an Academic Perspective!   Christian Johnson, a student at the University of Maryland will bring fresh and relevant topics to the show based on the current work he does.

Support the Average Guy Tech Scholarship Fund:

WANT TO SUBSCRIBE? We now have Video Large / Small and Video iTunes options at

You can contact us via email at or call in your questions or comments to be played on the show at (402) 478-8450

Full show notes and video at

Listen Mobile:



Ideas for use of hadoop cluster

-Original idea: Set up complete ipv4 scanner to ingest and analyze data, for monday I could easily set up the distributed nmap and load it into hbase

-Set up real time log management system, again using hbase to store multiple different log files.  We could then set up one more small VM as a honeypot and track different logs from it.

-Using a publicly available DNS dataset as training data, we could inject a bunch of malicious server names from a blacklist and try to use machine learning with hadoop to identify malicious URLs just using the names.  I have read research papers where this is done with decent success rates.

Any of the above could be accompanied by another article on cyberfrontierlabs with associated code etc. and could probably be completed in a few hours.

General idea:

(0) sample insertion to hbase so I understand format

  1. start distributed nmap scanning and storing information in xml format on multiple nodes
  2. Use some method to convert xml into json or csv
  3. Store results in hive (and thereby hadoop)

Nmap: Various scans which can be used to identify hosts, services, operating systems etc.

Distributed nmap: Can be used in a cluster to speed up network scanning using nmap

Zenmap: nmap GUI which gave us the idea for network graph


Jim’s Twitter:!/jcollison

Contact Christian:

Contact the show at

Find this and other great Podcasts from the Average Guy Network at

Music courtesy of Ryan King. Check out the Die Hard Cafe band and other original works at: