Using Big Data and Graph Theory to Map the WWW – CF016

Posted on Wed January 7, 2015Sat November 22, 2025 by Jim Collison

This week Ashton and Christian kick off the new year by discussing plans to launch the first major cybersecurity research project on the development sandbox: mapping the internet with traceroute, distributed Nmap, and Hadoop. We discuss strategies and design challenges that will be present early on in development and potential solutions that may be employed. Furthermore, we tie back the project to the latest developments in the North Korea cyber fiasco with Sony Corporation.

Cyber Frontiers is all about Exploring Cyber security, Big Data, and the Technologies Shaping the Future Through an Academic Perspective! Christian Johnson, a student at the University of Maryland will bring fresh and relevant topics to the show based on the current work he does.

Support the Average Guy Tech Scholarship Fund: https://www.patreon.com/theaverageguy

WANT TO SUBSCRIBE? We now have Video Large / Small and Video iTunes options at http://theAverageGuy.tv/subscribe

You can contact us via email at jim@theaverageguy.tv or call in your questions or comments to be played on the show at (402) 478-8450

Full show notes and video at http://theAverageGuy.tv/cf016

Listen Mobile:

Audio Only

Using Big Data and Graph Theory to Map the WWW – CF016

Ideas for use of hadoop cluster

-Original idea: Set up complete ipv4 scanner to ingest and analyze data, for monday I could easily set up the distributed nmap and load it into hbase

-Set up real time log management system, again using hbase to store multiple different log files. We could then set up one more small VM as a honeypot and track different logs from it.

-Using a publicly available DNS dataset as training data, we could inject a bunch of malicious server names from a blacklist and try to use machine learning with hadoop to identify malicious URLs just using the names. I have read research papers where this is done with decent success rates.

Any of the above could be accompanied by another article on cyberfrontierlabs with associated code etc. and could probably be completed in a few hours.

General idea:

(0) sample insertion to hbase so I understand format

start distributed nmap scanning and storing information in xml format on multiple nodes
Use some method to convert xml into json or csv
Store results in hive (and thereby hadoop)

—

Nmap: Various scans which can be used to identify hosts, services, operating systems etc.

http://nmap.org/

Distributed nmap: Can be used in a cluster to speed up network scanning using nmap

http://dnmap.sourceforge.net/

Zenmap: nmap GUI which gave us the idea for network graph

http://nmap.org/zenmap/

Jim’s Twitter: https://x.com/jcollison

Contact Christian: christian@theaverageguy.tv

Contact the show at jim@theaverageguy.tv

Find this and other great Podcasts from the Average Guy Network at http://theaverageguy.tv

Music courtesy of Ryan King. Check out the Die Hard Cafe band and other original works at:
http://diehardcafe.bandcamp.com/ / http://cokehabitgo.tumblr.com/tagged/my-music