An Interesting solution to a (not so) tricky problem

Once upon a time (a.k.a a couple of weeks ago)  in a far away land (a.k.a at work), We were asked to solve a problem :

We had a mapping of the number of times a user clicked on an Ad from a given ip addresses,

We also  had an initial black/white-list  of ip addresses and usernames.

We needed to use the initial blacklist(s) and whitelist(s) to assign a score (of 0 to 100) to all the remaining ip addresses and users we had – 0 being the score of a blacklisted node and  100 being that of a white-listed node.

So we had to write a program which takes the above data as input and outputs:

Phase 1: Tab-Separated output with (Username, Score) where Score is a quality score between say 0 and 100 (100 is best quality — 0 is worst)

Phase 2: Tab-Separated output with (IP, Score) where Score is a quality score between say 0 and 100 (100 is best quality — 0 is worst)

This seemed like a too trivial story. Isn’t it? It probably is, for those who have studied a bit of graph theory..  Unfortunately that’s not the case with me.. Yet, for some reason, i thought the problem to be a trivial one to solve….. until I got stuck with it for a few days.. and then i somehow came up with a nice little solution (which i really like).

So before giving my solution away.. let me first discuss the problem…

We have a white-list of usernames and ip addresses (Basically the people we trust), which more or less looks like:

*The Good IPs and Usernames*
userA
192.168.1.1
192.168.1.2
...

and a black-list of usernames and ip addresses, (The people who made a lot of fradulent clicks?…),

*The Bad IPs and Usernames*
192.168.1.100
userB
255.255.255.255
...

Which more or less looks like,

The Good and The Bad nodes

And then we have the mapping b/w each username and an ip address, i.e the number of clicks of a user we received from each ip address


(ip address , total clicks, total users, {username:clicks})
192.168.1.2     5     2     {"userC":3,"userD":2}
192.168.1.3     1     1     {"userE": 1}
192.168.1.100     3     2      {"userF":1,  "userG": 2}
...

Which looks like:

Now The other Nodes

Now what about the scores for this trivial case?

Here, Let me solve that for you….

The Answer for the trivial case

So you might ask me whats so tricky in this problem, well.. let me ask you a question first! what if we add another entry to the ip-user name map like this:


192.168.1.5 3   3       {"userF":1,  "userD": 2}

i.e what if UserF clicks once on an Ad from a different ip address(192.168.1.5) and userD clicks on the same ad twice?

The Twist in the story!

Now time to think about it…… how would you score the nodes in the graph?

Its obvious to guess that :

  1. Only nodes: {userF, 192.168.1.5,  userD} will get new scores..
  2. Since userF is indirectly related to a good node, it’s score will probably have to increase
  3. The score of userD should now decrease for a similar reason

By now you probably have realized that you will have to answer these 3 questions to go further:

  1. what should the score of the ip address 192.168.1.5 be then?
  2. what should the score of userF and userD then be?
  3. how would you generalize this solution of yours for a million nodes?

So ……these are the questions that i was stuck at for a couple of days….. and it was a really tricky thing for me. And then, one fine morning (well i spent the night at office just trying to find a solution),

The EUREKA moment!!

I came up with what seemed like a really cool analogy for the problem, something from my high school days…. and without any further delay, let me present the analogy to you:

  • The Good IP Addresses and usernames can be compared to a Voltage sources of potential 100V.
  • The Bad IP Addresses and usernames can be compared to Voltage sources of potential 0V.
  • The Connection b/w 2 nodes ( in our case it is just a connection b/w IP Address and a Username) can be represented by a resistor of conductance = number of clicks(=1/Resistance).(The Greater the number of clicks the user made from the given ip address, the closer he will be to the node, hence the lower resistance and the lesser the potential difference b/w the two nodes.)
  • The “Good ness”, like current, will have to flow from the good node to the bad node.
  • Ohm’s law and Kirchoff’s Laws would apply (obviously, voltage is not leaking away at nodes or in the resistors)=> Nodal analysis on this circuit = happily ever after!! 😀
Tada!!

But what happens after happily ever after is that:

  1. I had to read up about the SPICE circuit solver
  2. Weed out the stray nodes (the nodes that arent directly or indirectly connected to the voltage sources) from the given input using a simple Graph traversal algorithm. (Thank YOU prof. Sebastien Thrun!)
  3. Write a little python script to convert the processed input to a SPICE circuit file.

So for a nice little  example, lets look at the circuit:

A historic battle

The spice input file for this problem:


* Circuit simulating fraud detection

Vip::192.168.1.2 ip::192.168.1.2 0 DC 100
Vip::0.0.0.0 ip::0.0.0.0 0 DC 0
Ruser::a~ip::192.168.1.2 user::a ip::192.168.1.2 1
Ruser::a~ip::0.0.0.0 user::a ip::0.0.0.0 1
Ruser::a~ip::8.8.8.8 user::a ip::8.8.8.8 1
.op
.end

 and the (partial) spice output:
  Node                 Voltage
 -------------------------------
 ip::8.8.8.8          5.000000e+01
 user::a              5.000000e+01
 ip::0.0.0.0          0.000000e+00
 ip::192.168.1.2      1.000000e+02

SO, this solution was more than interesting for me, not just because of the analogy, but because i was able to apply my knowledge of the SPICE software to solve this… and you have no idea how boring the SPICE lab sessions used to be, and this was the first time i actually dug the internet up for more tutorials for such a boring software!!!

Cheers,

:Dinesh

P.S If you find any mistake in this solution, please let me know 🙂

P.P.S For some weird reason, this solution reminds me of http://myonlinescratchpad.blogspot.in/2009/02/my-calculus-assignment.html

An Interesting solution to a (not so) tricky problem