Merge collections without duplicates in C#

Posted by Jonas Elfström Thu, 18 Oct 2012 21:28:00 GMT

Your task is to merge two lists of objects. The resulting collection should be without duplicates, based on a certain property on the objects.

The lists are populated with very simple Person objects.

1
2
3
4
5
class Person
{
        public int Number { get; set; }
        public string Name { get; set; }
}


C# programmers often turn to LINQ, and they should! With LINQ you could end up with something like this:

1
2
3
var merged = new List<Person>(list1);
merged.AddRange(list2.Where(p2 => 
                list1.All(p1 => p1.Number != p2.Number)));


It's not all that obvious how it works and it also has performance problems. This solution has to iterate over list1 for every object in list2. Not every object every time but on average half of them (if the lists contains no duplicates within themselves). If the lists contains a 1000 objects each there's a good chance that list1 will be iterated 500.000 times. That kind of work does not come for free.

We need to get rid of this nested looping. I was pretty sure a dictionary based solution would do the trick.

1
2
3
4
5
6
var dict = list2.ToDictionary(p => p.Number);
foreach (var person in list1)
{
        dict[person.Number] = person;
}
var merged = dict.Values.ToList();


This solution converts list2 to a dictionary and then it loops a single time over list1. It either adds to the dictionary or replaces the value of list2 if it was a duplicate. This seemed to be a much more efficient algorithm.

In C# there's not just Dictionary<TKey, TValue> but we also have HashSet. A HashSet is a collection that contains no duplicates. It also has the UnionWith method and that is exactly what we are looking for. The problem is that it compares object to object and we want to compare based on a property.

Fortunately the HashSet methods honors the default equality comparer. Such a beast could look like this.

1
2
3
4
5
6
7
8
9
10
11
12
class PersonComparer : IEqualityComparer<Person>
{
        public bool Equals(Person p1, Person p2)
        {
                return p1.Number == p2.Number;
        }

        public int GetHashCode(Person p)
        {
                return p.Number;
        }
}


Then just add it as the second parameter to the constructor.

1
2
3
var hs = new HashSet<Person>(list1, new PersonComparer());
hs.UnionWith(list2);
var merged = hs.ToList();


Set theory tells us that this merging is actually a union of the two sets (without duplicates) and LINQ happens to have a Union method.

var merged = list1.Union(list2, new PersonComparer());


By that we've come full circle and it's time to see how the different solutions performs. I wrote a small test program that creates two lists of objects with 1000 objects in each. In each list 500 of the objects property, that we base the merge on, has the same value.

Lists and LINQ merge: 4820ms
Dictionary merge: 16ms
HashSet and IEqualityComparer: 20ms
LINQ Union and IEqualityComparer: 24ms

The first solution is, in this case, 300 times slower than the fastest one!

I prefer LINQ Union because it communicates intent very clearly.

ASP.NET Web API

Posted by Jonas Elfström Sat, 01 Sep 2012 16:05:00 GMT

Here's my slides for a presentation on REST and ASP.NET Web API that I did.

Leif in Clojure 2

Posted by Jonas Elfström Tue, 01 Nov 2011 02:12:00 GMT

I've been meaning to learn a functional language for years. I don't know what put me off besides the fear of parentheses and the notion that maybe, just maybe, Ruby could be an acceptable Lisp. For all I know, Ruby might be just that, but there's one thing about Ruby that is starting to get more and more unacceptable to me.

Ruby is actually not very well fit to take advantage of the multi-core world we are living in. The GIL and the prevalent mutable state are two big problems. As web applications go, that's often not a problem since it's almost always possible to scale without threads there. But if you want to use the full potential of your multi-core machine in a singe Ruby process/VM, you are out of luck. There are workarounds that get you somewhere and there are implementations of Ruby without a GIL (JRuby and soon Rubinius). Even without the GIL we still have the problem of having a lot of mutable state. That will, probably, make it hard to scale over multiple cores.

A little more than a week ago I watched the presentation Simple Made Easy by Rich Hickey. It's an excellent presentation and one of the best I've seen in years. I was also aware that Hickey is the man who designed Clojure. Finally, I was able to make a choice. I chose to begin to learn Clojure.

I'm still very early in this endeavor. Actually, still trying to decide what books I should read. Anyway, the day after I saw the presentation I downloaded clooj (a simple IDE), read some blogs and some documentation, and started to hack on something. Just to try to get a feel for how hard this was going to be. The parentheses didn't bother me all that much, it was much harder to accept that operators are just functions and that they get no special treatment. It seems you simply have to (+ 31 11) to add 31 and 11. I think I can get over that.

As a side note, it turns out I wrote my first Clojure program on 2011-10-24 and that happened to be the day that John McCarthy, the father of Lisp, died. Yet another of those strange coincidences that seems to happens more often than they should.

The program I wrote is yet another implementation of the stupid simple "algorithm" that I've got to call Leif. If you take Conway's Game of Life, remove the rules and replace them with randomness, you could end up with something like Leif. Quite silly, actually.

Hopefully I will have something more interesting to tell in a couple of weeksmonths, after I've got some more experience with Clojure. For now, Leif will have to do. Here it is.

This is what it looks like at the beginning

and after a couple of minutes it looks like this

One thing that surprised me is that it seems to use both of my cores even though I put no effort whatsoever in making it do so. Probably just the Java GC or something.

Inspired by randomness 1

Posted by Jonas Elfström Thu, 27 Oct 2011 21:19:00 GMT

Inspired by Randomly not so random I decided to play around with the subject.

Both Java and C# uses a linear congruential generator as their pseudorandom number generators.

The current implementation of the Random class is based on a modified version of Donald E. Knuth's subtractive random number generator algorithm."

About System.Random at MSDN.

I tried to find information on what values Microsoft uses for the m, a, and c parameters in their LGC implementation but I failed to do so.

A good random number generator should have certain properties.

The German Federal Office for Information Security (BSI) has established four criteria for quality of deterministic random number generators. They are summarized here:
K1 — A sequence of random numbers with a low probability of containing identical consecutive elements.

From the pseudorandom number generator article on Wikipedia. Let's see what we can do about that.

1
2
3
4
5
var random = new Random(116793166);
for (int i = 0; i < 9; i++)
{
    Console.Write(random.Next(10) + " ");
}


Outputs:

1 1 1 1 1 1 1 1 1

Did I just break the K1 criteria above? You know I didn't but you might be interested in how I found the seed number. It's easy but also quite computational intensive, that's why I used Parallel.For.

1
2
3
4
5
6
7
8
9
10
11
12
Parallel.For(0, int.MaxValue, i =>
{
    var rnd = new Random(i);
    bool allSame = true;
    for (int j = 0; j < 9; j++)
    {
        allSame = allSame && rnd.Next(10) == 1;
        if (!allSame) break;
    }
    if (allSame)
        Console.WriteLine(i); 
});


It took a couple of minutes for the number 116793166 to show up in my console.

To keep on playing I needed a random string generator. Mine looks like this.

1
2
3
4
5
6
7
8
9
private const string _chars = "abcdefghijklmnopqrstuvwxyz";

private static string RandomString(int size, Random rng)
{
    var buffer = new char[size];
    for (int i = 0; i < size; i++)
        buffer[i] = _chars[rng.Next(_chars.Length)];
    return new string(buffer);
}


If the random generators were the same it looks like it should give the same result as the Java one but RandomString(5, new Random(-229985452)) returns tfsld. Either my RandomString method works different than the Java one or it's the case that Java and .NET has slightly different settings of their LGCs.

Mathematically there's an infinite amount of inputs that results in a returned hello, for instance 1382472294, but here we are limited by the size of our integers.

I found the seed 1382472294 with the following little method and loop

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
private static bool CompareToRandomString(string str, Random rng) 
{ 
    for (int i = 0; i < str.Length; i++)
    {
        if (_chars[rng.Next(_chars.Length)] != str[i])
            return false;
    }
    return true;
}

...

const string lookingFor = "hello";

Parallel.For(0, int.MaxValue, i =>
{
    var rnd = new Random(i);
    if (CompareToRandomString(lookingFor, rnd))
        Console.WriteLine(i);
});


I leave it up to you to implement a break out of that loop.

Ruby does not use a LGC but a Mersenne twister instead. It's still only pseudorandom so there's no problem finding patterns you like and being able to repeat them.

1
2
srand(2570940381)
9.times { print "%d " % rand(10) }

Outputs:
1 2 3 4 5 6 7 8 9

1
2
3
charset=('a'..'z').to_a
srand(124931)
puts (0...4).map{ charset.to_a[rand(charset.size)] }.join

Outputs:
ruby

Sudoku solver in CoffeeScript 6

Posted by Jonas Elfström Tue, 19 Apr 2011 13:46:00 GMT

CoffeeScript is inspired by Ruby and Python but what's most peculiar with it is that it compiles to JavaScript. The generated JavaScript isn't all that bad and it even passes JavaScript lint without warnings.

"Underneath all of those embarrassing braces and semicolons, JavaScript has always had a gorgeous object model at its heart." - Jeremy Ashkenas

About a week ago I stumbled on the very clever Sudoku solver by Peter Norvig. I have nothing (or at least not much) against Python but I pretty soon checked out the Ruby translations of the solver. I then refactored one of the solutions to get a chance to get to know the algorithm better.

Now that I finally installed CoffeeScript the Sudoku solver came to mind. I dived in head first and got in trouble pretty soon. It turns out that Array Comprehensions in CoffeeScript differs some from the List Comprehensions in Python.

1
2
3
def cross(A, B):
       "Cross product of elements in A and elements in B."
       return [a+b for a in A for b in B]

returns an one-dimensional array if you call it with two arrays (or strings).

But in CoffeeScript

cross = (A, B) -> (a+b for a in A for b in B)

returns a two-dimensional array.

The jury is still out on if this is intended or not but either way the array comprehensions in CoffeeScript are still very useful.

EDIT 2011-06-02
It's been decided that this is by design. The issue has been closed.


For the cross-function I ended up with

1
2
3
4
5
6
cross = (A, B) ->
  results = []
  for a in A
    for b in B
      results.push a + b
  results


EDIT 2011-06-02
Using `map` I could have got much closer to the Ruby version.
1
2
cross = (cols, rows) ->
  [].concat (cols.map (x) -> rows.map (y) -> y+x)...
I'm still more of a map/reduce guy than a list comprehension ninja.



You can find the CoffeScript Sudoku solver as a Gist. Compile it with
coffee -c sudoku.coffee
and then run it with
node sudoku.js

I don't know how I missed it but as Trevor pointed out below all you have to do is
coffee sudoku.coffee

    1 2 3 4 5 6 7 8 9
  |------+-----+-----|
A | 4 8 3|9 2 1|6 5 7|
B | 9 6 7|3 4 5|8 2 1|
C | 2 5 1|8 7 6|4 9 3|
  |------+-----+-----|
D | 5 4 8|1 3 2|9 7 6|
E | 7 2 9|5 6 4|1 3 8|
F | 1 3 6|7 9 8|2 4 5|
  |------+-----+-----|
G | 3 7 2|6 8 9|5 1 4|
H | 8 1 4|2 5 3|7 6 9|
I | 6 9 5|4 1 7|3 8 2|
  |------+-----+-----|

Here's the generated sudoku.js.

This is the only CoffeeScript I've ever written but I already like it (more than JavaScript). Please correct me if I strayed from the CoffeeScript way.

Older posts: 1 2 3 ... 12