An Interesting solution to a (not so) tricky problem

Once upon a time (a.k.a a couple of weeks ago)  in a far away land (a.k.a at work), We were asked to solve a problem :

We had a mapping of the number of times a user clicked on an Ad from a given ip addresses,

We also  had an initial black/white-list  of ip addresses and usernames.

We needed to use the initial blacklist(s) and whitelist(s) to assign a score (of 0 to 100) to all the remaining ip addresses and users we had – 0 being the score of a blacklisted node and  100 being that of a white-listed node.

So we had to write a program which takes the above data as input and outputs:

Phase 1: Tab-Separated output with (Username, Score) where Score is a quality score between say 0 and 100 (100 is best quality — 0 is worst)

Phase 2: Tab-Separated output with (IP, Score) where Score is a quality score between say 0 and 100 (100 is best quality — 0 is worst)

This seemed like a too trivial story. Isn’t it? It probably is, for those who have studied a bit of graph theory..  Unfortunately that’s not the case with me.. Yet, for some reason, i thought the problem to be a trivial one to solve….. until I got stuck with it for a few days.. and then i somehow came up with a nice little solution (which i really like).

So before giving my solution away.. let me first discuss the problem…

We have a white-list of usernames and ip addresses (Basically the people we trust), which more or less looks like:

*The Good IPs and Usernames*

and a black-list of usernames and ip addresses, (The people who made a lot of fradulent clicks?…),

*The Bad IPs and Usernames*

Which more or less looks like,

The Good and The Bad nodes

And then we have the mapping b/w each username and an ip address, i.e the number of clicks of a user we received from each ip address

(ip address , total clicks, total users, {username:clicks})     5     2     {"userC":3,"userD":2}     1     1     {"userE": 1}     3     2      {"userF":1,  "userG": 2}

Which looks like:

Now The other Nodes

Now what about the scores for this trivial case?

Here, Let me solve that for you….

The Answer for the trivial case

So you might ask me whats so tricky in this problem, well.. let me ask you a question first! what if we add another entry to the ip-user name map like this: 3   3       {"userF":1,  "userD": 2}

i.e what if UserF clicks once on an Ad from a different ip address( and userD clicks on the same ad twice?

The Twist in the story!

Now time to think about it…… how would you score the nodes in the graph?

Its obvious to guess that :

  1. Only nodes: {userF,,  userD} will get new scores..
  2. Since userF is indirectly related to a good node, it’s score will probably have to increase
  3. The score of userD should now decrease for a similar reason

By now you probably have realized that you will have to answer these 3 questions to go further:

  1. what should the score of the ip address be then?
  2. what should the score of userF and userD then be?
  3. how would you generalize this solution of yours for a million nodes?

So ……these are the questions that i was stuck at for a couple of days….. and it was a really tricky thing for me. And then, one fine morning (well i spent the night at office just trying to find a solution),

The EUREKA moment!!

I came up with what seemed like a really cool analogy for the problem, something from my high school days…. and without any further delay, let me present the analogy to you:

  • The Good IP Addresses and usernames can be compared to a Voltage sources of potential 100V.
  • The Bad IP Addresses and usernames can be compared to Voltage sources of potential 0V.
  • The Connection b/w 2 nodes ( in our case it is just a connection b/w IP Address and a Username) can be represented by a resistor of conductance = number of clicks(=1/Resistance).(The Greater the number of clicks the user made from the given ip address, the closer he will be to the node, hence the lower resistance and the lesser the potential difference b/w the two nodes.)
  • The “Good ness”, like current, will have to flow from the good node to the bad node.
  • Ohm’s law and Kirchoff’s Laws would apply (obviously, voltage is not leaking away at nodes or in the resistors)=> Nodal analysis on this circuit = happily ever after!! 😀

But what happens after happily ever after is that:

  1. I had to read up about the SPICE circuit solver
  2. Weed out the stray nodes (the nodes that arent directly or indirectly connected to the voltage sources) from the given input using a simple Graph traversal algorithm. (Thank YOU prof. Sebastien Thrun!)
  3. Write a little python script to convert the processed input to a SPICE circuit file.

So for a nice little  example, lets look at the circuit:

A historic battle

The spice input file for this problem:

* Circuit simulating fraud detection

Vip:: ip:: 0 DC 100
Vip:: ip:: 0 DC 0
Ruser::a~ip:: user::a ip:: 1
Ruser::a~ip:: user::a ip:: 1
Ruser::a~ip:: user::a ip:: 1

 and the (partial) spice output:
  Node                 Voltage
 ip::          5.000000e+01
 user::a              5.000000e+01
 ip::          0.000000e+00
 ip::      1.000000e+02

SO, this solution was more than interesting for me, not just because of the analogy, but because i was able to apply my knowledge of the SPICE software to solve this… and you have no idea how boring the SPICE lab sessions used to be, and this was the first time i actually dug the internet up for more tutorials for such a boring software!!!



P.S If you find any mistake in this solution, please let me know 🙂

P.P.S For some weird reason, this solution reminds me of

An Interesting solution to a (not so) tricky problem

2 thoughts on “An Interesting solution to a (not so) tricky problem

  1. mg says:

    Loved the analogy too!

    I come from an engineering background and had a lots of contact with kirchhoff laws (circuit analysis, etc) during the making of my master’s thesis. I loved your sublime solution to this problem and after having coded my own circuit solver for the subject I was working for, all I can say is I hope you never have to solve giant network of voltages and resistances because it takes to much time. For example, after optimization, my simulations took roughly 2 days to complete (each one) and 5 before optimization.

    Also, quick question. Whitelist at 100, black at 0. and grey for all the rest. Will you keep it that way or are you going to setup some kind of thresholds to allow more node to be considered “white” and “black” or simply use the 0-100 scale in any other way?

    Great work!

    1. Yups..the size of the problem was actually a scary one …. as it involved solving a huge matrix for answers……

      Fortunately, In the case of this problem, even if we have a million nodes, most of them wont be connected to the the “Good/Bad nodes”… so most of them can happily be removed as “stray nodes”, during the “compile time” (when we convert the given ip-user map into a spice circuit)……. a O(n) operation in the case of a trivial solution….

      for 40,000 Nodes, ngspice took me about 20-30 minutes to give me a solution on my machine…

      And as for your question, that is the actual reason why we were asked to solve this problem! We were supposed to extend our handpicked database of good/bad nodes based on some ( threshold voltage + distance to the voltage source ) metric…


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s