Find similar User behaviors in the DNS using Apache Mahout

This theme sounds a bit scary and, actually, it is. Using DNS data from a Bind cache or pcap file lets you track distinct users! No matter if you use different IP addresses or safe browsing features. (And in the case of pcap traces as data source, it doesn't matter if you use different DNS caches)

How should this work? Think about DNS queries as vectors. Every source address is a vector with the dimension of all observed DNS names. Every time an IP addresse is asking for a domain name, it gives you a boolean preference from this address to the domain name. At the end, the result will be a data matrix with a lot of spare vectors in it.


X: distinct src IP addresses
Y: distinct query names
Similar vertical vectors == potential similar users

Now, since we have a proper data model, we can start analyzing it with Apache Mahout. Mahout is a scalable machine learning algorithm library. For the DNS analysis I used the recommender sub system.

First, I started with a LogLikelihoodSimilarity with this simple boolean model. In different test data sets this approach found similarity patterns. However, the detailed analysis of the results shows some weakness. I found several different addresses, for example, with similar user behaviors, but in reality all of them were only different Debian machines. The test case with a PC surfing with two different IP addresses could not be detected.

INFO: Read lines: 311713 (queries)
INFO: Processed 250 users (ip addresses)
neighborhood for 1: [(90:0.9946),(81:0.9946),]
neighborhood for 3: [(33:0.9949),]
neighborhood for 4: [(99:0.9939),]
neighborhood for 5: [(36:0.9988),(6:0.9933),]
neighborhood for 7: [(10:0.9995),(6:0.9999),(43:0.9959),]
neighborhood for 20: [(95:0.9977),(93:0.9956),(68:0.9976),]
neighborhood for 25: [(73:0.9989),(201:0.9989),(104:0.9987),(229:0.9989),(245:0.9989),]
neighborhood for 47: [(202:0.9949),(179:0.9948),]
neighborhood for 50: [(6:0.9981),]
neighborhood for 52: [(195:0.9946),]
neighborhood for 58: [(106:0.9937),(173:0.994),(87:0.9954),(101:0.9946),(191:0.9954),]
neighborhood for 64: [(65:0.9941),]
neighborhood for 76: [(89:0.9957),]
neighborhood for 84: [(121:0.9972),]
neighborhood for 102: [(108:0.9962),]
neighborhood for 140: [(121:0.9934),]
neighborhood for 175: [(184:0.9937),]
neighborhood for 178: [(188:0.9944),]
neighborhood for 205: [(68:0.9935),]
neighborhood for 206: [(221:0.9936),(224:0.9947),(225:0.9931),(243:0.9932),(227:0.9939),
                       (215:0.9939),(241:0.9962),(231:0.9935),]
neighborhood for 211: [(231:0.9936),(228:0.994),(219:0.9938),]
neighborhood for 237: [(228:0.9946),(219:0.9948),]
Found 22 similar user patterns within 3.2s

Of course this is probably a too simple data model. So I used advanced text vectorization technologies such as TF-IDF (term frequency - inverse document frequency). This means: Count all queries per IP address (TF) and divide it to the number of distinct IP addresses which asked for this name (IDF).

This will result in a matrix like the one above, but with calculated preferences for each query per address. The TF-IDF algorithm prefers "special" names and expresses a smaller preference for names of major web search engines.

Next, I tried the PearsonCorrelationSimilarity with the TF-IDF data. The result was even surprising to me! This time, the model detected a similar User behavior in three differen IP addresses. It was the same user!
But the test case was: surfing with a computer with two different addresses, and not three. So why did the model find three differen addresses? I had forgotten, that my mobile phone was in the wireless lan. (My mobile checked the same three mail accounts and some similar news websites.)

INFO: Read lines: 454 (distinct names)
INFO: Processed 7 users (ip addresses)
neighborhood for 1: [(5:1.0),(7:1.0),]
Found 1 similar user pattern within 224ms

Oh, I forgot to mention, that after the data model was initially pre calculated, I managed to correlate DNS events (all DNS queries from one address within 2min) in real time! For this approach I used the mahout PlusAnonymousUserDataModel.

By the way, one neat feature of the system is the recommender part. Not only that you can detect similar users, you can also recommend other DNS queries to similar users. Probably we should extend the DNS protocol with the "recommended section" :-):

$ dig www.topic1.com

; <<>> DiG 9.7.3 <<>> www.topic1.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 48087
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 2, ADDITIONAL: 0, RECOMMENDED: 3

;; QUESTION SECTION:
;www.topic1.com.                 IN      A

;; ANSWER SECTION:
....

;; AUTHORITY SECTION:
....

;;RECOMMENDED SECTION:
topic2.com.			IN	A ....
topic6.net.			IN	A ....
a-topic-you-may-interested.com.	IN	A ....

;; Query time: 163 msec
;; SERVER: ::1#53
;; WHEN: Sat Jul  9 12:25:04 2011

DNSSEC and broken signatures

How can I test my validating resolver? Are there some broken signatures in the internet under a correct delegated domain?

Have a look at my test domain "0x7e.ch":

$ dig www.0x7e.ch. CNAME +dnssec
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 4920
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

$ dig badsig.0x7e.ch. CNAME +dnssec
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 39688
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

For A records use: ok-switch.0x7e.ch. A / bad-switch.0x7e.ch. A

First DNSSEC validating name server for .CH and .LI

My Freerunner is the first DNSSEC validating name server for .CH and .LI and probably one of the first mobile phone world wide using DNSSEC. (I configured the trusted-keys even before the first signatures were created with these keys.)

Gpsylon SwissGrid on Freerunner

The Gpsylon SwissGrid map application runs on a Openmoko Freerunner!

 

Openmoko alarm clock with cron

The perfect alarm clock!
A cron job running on my freerunner, it plays some music with mplayer and stops by touching the screen.

$ cat alarm.sh
#!/bin/sh
echo 100 > /sys/class/leds/neo1973\:vibrator/brightness
alsaplayer -i text /root/sounds/A-Path-To-Solitude.mp3 &
pid=$!
(/root/bin/waitclick.sh; kill $pid && echo 0 > /sys/class/leds/neo1973\:vibrator/brightness) &

$ cat waitclick.sh
#!/bin/sh
input-events -t 300 1 2>&1 | ( grep -q -m 1 released && kill $$ )

Creating JPEG with Matlab

How does JPEG works? See the following document to understand how you can create (calculate) JPEG images with Matlab.

JPEG-mit-Matlab.pdf (german) (806K)

ECG

What can you do, if you think that something with your heart beat is strange? Build an ECG device and have a look to it!

Use your preferred search engine to find one, which already does something like this1. Take a hand full OP-Apms and an oscilloscope:

 

Create a perl srcipt to extract the data and use gnuplot to visualize the results:

Electronic Lab

Look at my small collection of old electronic laboratory reports. Probably they are useful for some new students.