String concatenation in Ruby 1
There's no StringBuilder class in Ruby because the String class has the << for appending. The problem is that not every Ruby programmer seems to be aware of it. Recently I've seen += being used to append to strings where << would have been a much better choice.
The problem with using += is that it creates a new String instance and if you do that in a loop you can get really horrible performance.
If you are dealing with an array you don't even have to use << because Array#join is even faster and shows intent in a nice way.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
require 'benchmark' array_of_rnd_strings=(0...262144).map{65.+(rand(25)).chr} .join.scan(/.{1,8}/m) Benchmark.bm do |benchmark| benchmark.report do str=array_of_rnd_strings.join end benchmark.report do str2="" array_of_rnd_strings.each do |s| str2<<s end end benchmark.report do str3="" array_of_rnd_strings.each do |s| str3+=s end end end |
| user | system | total | real |
| 0.030000 | 0.000000 | 0.030000 | ( 0.027184) |
| 0.160000 | 0.010000 | 0.170000 | ( 0.190277) |
| 106.020000 | 0.300000 | 106.320000 | (113.457793) |
The performance of += was even worse than I imagined!
Finding primes in parallel 9
Justin Etheredge has been blogging about his challenge to find prime numbers with LINQ. He later used AsParallel() (coming in .NET 4) to speed things up and then followed that up with a post about using The Sieve Of Eratosthenes.
As you can see in the comments of those posts I tried to speed the Sieve of Eratosthenes up by using Parallel.For in the inner loop. I also tried AsParallel() in the LINQ expression but it made no difference in either case. At most it got 5% faster. I'm not sure but it could be that because SoE is very memory intense we could have a scaling issue and maybe also memory bandwidth exhaustion. This is mere speculation.
I then searched for other algorithms and found The Sieve of Atkin. It uses less memory than SoE so I thought I'd give it a try.
I set the limit to 20,000,000 and then benchmarked it. It timed in on 2.48s so actually worse than the 2.2s that SoE took. Not good!
Then I added Parallel.For in the loop that did most of the work and lo and behold, it scaled! I have two cores in my machine (T7200@2.0GHz) and the average runtime went down to 1.26s. That's almost linear and surprisingly good! If you happen have a quad core (or more) and feel like trying it out then please contact me. It would be interesting to see if it scales further.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
static List<int> FindPrimesBySieveOfAtkins(int max) { var isPrime = new BitArray((int)max+1, false); var sqrt = (int)Math.Sqrt(max); Parallel.For(1, sqrt, x => { var xx = x * x; for (int y = 1; y <= sqrt; y++) { var yy = y * y; var n = 4 * xx + yy; if (n <= max && (n % 12 == 1 || n % 12 == 5)) isPrime[n] = !isPrime[n]; n = 3 * xx + yy; if (n <= max && n % 12 == 7) isPrime[n] = !isPrime[n]; n = 3 * xx - yy; if (x > y && n <= max && n % 12 == 11) isPrime[n] = !isPrime[n]; } }); var primes = new List<int>() { 2, 3 }; for (int n = 5; n <= sqrt; n++) { if (isPrime[n]) { primes.Add(n); int nn = n * n; for (int k = nn; k <= max; k += nn) isPrime[k] = false; } } for (int n = sqrt + 1; n <= max; n++) if (isPrime[n]) primes.Add(n); return primes; } |
This is C# 4.0 code, compiled in Visual C# 2010 Express Beta 2.
Edit 2010-01-20
Indications are that this does in fact not scale very good on a quad core. It's even worse, it seems it scales good on my old T7200 but not on a dual core E6320. I don't know why but of course the shared state of the isPrime BitArray is a huge problem and maybe it could be that differences in CPU architecture (FSB speed, caches and so on) in the E6320 is an explanation. Average execution time on the E6320 was 1290ms in a single thread and 1064ms in two.
If you want to try this in an older version of C# than 4.0 then check out this post.
A reader asked how I timed the executions. Here's how.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
var steps = new List<long>(); var watch = new Stopwatch(); for (int i = 0; i < 10; i++) { watch.Reset(); watch.Start(); var primes = FindPrimesBySieveOfAtkins(20000000); watch.Stop(); Console.WriteLine(watch.ElapsedMilliseconds.ToString()); steps.Add(watch.ElapsedMilliseconds); } Console.WriteLine("Average: " + steps.Average().ToString()); |
Comparing instance variables in Ruby
Say you have two objects of the same class and you want to know what differs between them. Well actually you just want to know the instance variables in object b that differs from the ones in object a.
To begin with, we need a class. I like cheese.
1 2 3 4 5 6 |
class Cheese attr_accessor :name, :weight, :expire_date def initialize(name, weight, expire_date) @name, @weight, @expire_date = name, weight, expire_date end end |
Then we need some cheese objects.
1 2 |
stilton=Cheese.new('Stilton', 250, Date.parse("2009-11-02")) gorgonzola=Cheese.new('Gorgonzola', 250, Date.parse("2009-11-17")) |
With only name, weight and an expiration date it would be easy to compare those but imagine that these two objects has 42 properties. It does not stop there, you are being asked to compare 24 different classes in this way. Are you cringing yet?
Object#instance_variables to the rescue! Well, that and a small hack by me. Below I add a new method called instance_variables_compare to Object. The long method name is because I wanted to follow the naming already in place. Usually I prefer to do these kind of things as a module and then include them where appropriate but in this case I find that a monkey patch will do.
1 2 3 4 5 6 7 |
class Object def instance_variables_compare(o) Hash[*self.instance_variables.map {|v| self.instance_variable_get(v)!=o.instance_variable_get(v) ? [v,o.instance_variable_get(v)] : []}.flatten] end end |
It returns the instance variables that differs as a hash because it's handy and because I like it that way.
1 2 3 4 5 6 7 8 9 10 |
>> stilton.instance_variables_compare(gorgonzola)
=> {"@name"=>"Gorgonzola", "@expire_date"=>#<Date: 4910305/2,0,2299161>}
>> gorgonzola.instance_variables_compare(stilton)
=> {"@name"=>"Stilton", "@expire_date"=>#<Date: 4910275/2,0,2299161>}
>> stilton.expire_date=gorgonzola.expire_date
=> #<Date: 4910305/2,0,2299161>
>> stilton.instance_variables_compare(gorgonzola)
=> {"@name"=>"Gorgonzola"}
>> stilton.instance_variables_compare(stilton)
=> {} |
If you ever think of using this code you should be aware of two things.
- This code is very untested and comes with no guarantees.
- Since instance variables spring into life the first time they are assigned to you either have to work with objects that always initialize everything or you have to change
instance_variables_compareto handle this.
Infinite ranges in C# 1
I recently learned that C# is compliant with the IEEE 754-1985 for floating point arithmetics. That wasn't a big surprise but that division by zero is defined as Infinity in it was! It actually kind of bothers me that I didn't know this.
In mathematics division by zero is undefined for real numbers but I guess Infinity is a more pragmatic result. Or as a friend put it "IEEE stands for Institute of Electrical and Electronics Engineers not Institute of Mathematics"
1 2 3 4 |
double n = 1.0; n = n / 0; if (n > 636413622384679305) System.Console.WriteLine("Yes it certainly is!"); |
This C# code does not throw an exception, it simply leaves n defined as Infinity and writes a line to the console.
Ruby is also IEEE 754-1985 compliant. It even lets you define infinite ranges.
1 2 3 4 5 6 |
Infinity=1.0/0 =>Infinity (1..Infinity).include?(162259276829213363391578010288127) => true (7..Infinity).step(7).take(3).inject(&:+) # 7+14+21 => 42 |
I can't say I see very much use of this but it brings a kind of completeness to the handling of infinities. Unfortunately it seems we don't get that in C# out of the box because Enumerable.Range takes <int>,<int> as parameters and there's no Infinity definition for int. That's unless someone wrote a generic Range class. Turns out none other than Jon Skeet did in his MiscUtil. Download MiscUtil and then by using MiscUtil.Collections; you can:
1 2 3 4 5 6 |
double n = 1.0; var infinity = n / 0; var r = new Range<double>(0, infinity); if (r.Contains(4711)) System.Console.WriteLine("Yes it certainly does!"); var sum = r.Step(7.0).Take(3).Sum(); |
And guess what, it works like a charm! 4711 is part of positive infinity and sum is 42.0 and all is good.
Edit
There's also a couple of predefined constants. Thanks to Eric for pointing that out.
1 2 |
var r = new Range<double>(7, System.Double.PositiveInfinity); var sum = r.Step(7.0).Take(3).Sum(); |
Counting the number of Google Readers
I run this blog on a 9 year old laptop hidden in a cabinet in the living room. It's not a powerful machine but it has been up to the job since I turned it into a web server 7 years ago. This could maybe be one of the last HP Omnibook 4150b still in use, at least it has to be in a very exclusive club of laptops being switched on for the past 7.5 years. Recently I've seen an increase in traffic and especially from Feedfetcher-Google. It so happens that Feedfetcher also shows the number of subscribers.
[19/Oct/2009:22:01:19 +0200] "GET /xml/rss20/feed.xml HTTP/1.1" 304 0 "-" "Feedfetcher-Google; (+http://www.google.com/feedfetcher.html; 4 subscribers; feed-id=7686756599804593322)"
The above is only one out of five different feed-ids because I have both atom and rss and for a short while this blog was at another address. The fifth feed is actually myself subscribing to the comments.
I'm not using FeedBurner so I can't get my statistics from there but I still wanted to be able to see the number of Google Readers of my blog (as far as I can see I only have one other type of subscriber).
Usually I script anything more advanced than a grep in Ruby but this time I made an exception and stayed in Bash.
1 2 3 4 5 6 7 8 9 |
tail -1000 /www/logs/access.log | grep Feedfetcher | cut -d ";" -f 4 | sort -u | while IFS= read -r line do tac /www/logs/access.log | grep -m 1 $line done | sed 's/^.*html; \([0-9]*\) subscribers.*/\1/' | awk '{tot=tot+$1} END {print tot}' |
Most certainly this can be optimized in a number of ways. Don't be shy, just tell me!
So what's going on there? Well, first I get the last 1000 rows from my access log and right now my traffic is so low that that is way more than I really would have to. Then I get all unique feeed-ids from the rows containing Feedfetcher. I pipe those to a loop that gets the very last access for each one of them. Then I parse out the number of subscribers with a regexp in sed and count them with awk .
It turns out that I have a whopping number of 14 15 subscribers and I am one of them.