每周源代码9 – WideFinder版本


[原文发表地址]The Weekly Source Code 9 – WideFinder Edition

[原文发表时间] 2007-10-23 05:24 AM

沿袭我一贯的要求阅读源代码以编出更好的程序,我为大家奉上第2篇每周系列“每周源代码”,之后还将源源不断地继续。这里是我这周在读的一些代码。

没错,我知道这篇博文距上一篇不过4天,但我实在是觉得很有趣,等不及到周三再发了。另外这对你来说也是新的一周。

上个月,Tim Bray分享了他写程序的经历,这里是Joe Cheng简洁的描述。Tim把项目叫做WideFinder。

WideFinder的挑战是要写一个能做到如下要求的程序:

  • 在博客文章上扫描日志文件获取点击率
  • 被计数
  • 并排序
  • 将前十篇最热门的标准输出它也应该
  • 要和Tim的Ruby版本一样优雅简洁
  • 其性能应与附加CPU内核相称

这得在250兆左右的相当大的日志文件上完成。虽然第六点是最有趣的,但很多同行们都关注于第五点。不管怎样,这是一系列非常有趣的问题,比FizzBuzz有趣得多,是你值得去研究的对象。

我建议你去查看下Tim的网站,他持续在博客上列出一些他觉得最有趣的源代码作为一个主要的C#程序员,我一直在寻求拓展我得心应手的领域,这些是我觉得有趣的东西,根据我认为得有趣程度给它们排了序。

Don Box C# 3.0之初体验 ——显然这种代码Don可以在两杯啤酒下肚之后写出来。注意yield的使用,让“将LINQ用于CR/LF strings文本文件”。这就是众多“不停不停写”的辅助函数中的一个,让我百思不得其解,为什么不把它包含进去。

         static void Main(string[] args)
         {
             var regex = new Regex(@"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+)");
             var grouped = from line in ReadLinesFromFile(@"C:\temp\bray.txt")
             let match = regex.Match(line)
             where match.Success
             let url = match.Value
             group url by url;
 
             var ordered = from g in grouped
             let count = g.Count()
             orderby count descending
             select new { Count = count, Key = g.Key };
       
             foreach (var item in ordered.Take(10))
             Console.WriteLine("{0}: {1}", item.Count, item.Key);
         }
 
        // LINQ-compatible streaming I/O helper
        public static IEnumerable<string> ReadLinesFromFile(string filename)
        {
            using (StreamReader reader = new StreamReader(filename)) {
            while (true)
            {
                string s = reader.ReadLine();
                if (s == null)
                break;
                yield return s;
            }
        }
    }

Joe Cheng将这个和他的LINQ技巧紧密联系起来,在一个扩充的foop中做了所有的排列组合。插播一下,我下周想看下他的QuickTimer类。他用了我最喜欢的C#关键字 – IDisposable/using。Joe 还提到了些很容易用PLINQ添加的并行。也许我们很快就能看到那样的代码。

using (new QuickTimer(“Total time”))
{
    IEnumerable<string> data = new LineReader(args[0]);

    Regex regex = new Regex(@”GET /ongoing/When/\d\d\dx/\d\d\d\d/\d\d/\d\d/([^ ]+) “,
    RegexOptions.Compiled | RegexOptions.CultureInvariant);

    var result = from line in data
    let match = regex.Match(line)
    where match.Success
    group match by match.Groups[1].Value into grp
    orderby grp.Count() descending
    select new { Article = grp.Key, Count = grp.Count() };

    foreach (var v in result.Take(10))
    Console.WriteLine(“{0}: {1}”, v.Article, v.Count);
}

随着Jomo Fisher发表了F# WideFinder初体验一文,我的程序员大追踪也一如既往地继续。注意每个人都喜欢用“天真”来表示“我知道可以做得更好,所以仁慈些。”我不能很大公无私地告诉你我完全理解,但真的挺神奇的。

#light

open System.Text.RegularExpressions

open System.IO

open System.Text

 
let regex = new Regex(@"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+)", RegexOptions.Compiled)

let seqRead fileName =

    seq { use reader = new StreamReader(File.OpenRead(fileName))

          while not reader.EndOfStream do

              yield reader.ReadLine() }

let query fileName =

    seqRead fileName

    |> Seq.map (fun line -> regex.Match(line))

    |> Seq.filter (fun regMatch -> regMatch.Success)

    |> Seq.map (fun regMatch -> regMatch.Value)

    |> Seq.countBy (fun url -> url)

*And here‘s the code to call it:   

for result in query @"file.txt" do

    let url, count = result

•暂时摆脱微软语言一会,这里是Tim的Ruby实例:

counts = {}

counts.default = 0

ARGF.each_line do |line|

  if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }

    counts[$1] += 1

  end

end

keys_by_count = counts.keys.sort { |a, b| counts[b] <=> counts[a] }

keys_by_count[0 .. 9].each do |key|

  puts "#{counts[key]}: #{key}"

end

这是UnintentionalObjectRetention中的Java版本。我从1997年在Nike担任约聘员工之后就再也没有做过Java了:

public class WideFinder

{    
    public static void main(String[] args) throws IOException

    {     
        Map<String, Integer> counts = new HashMap<String, Integer>();     
        Pattern p = Pattern.compile("GET /ongoing/When/\\d\\d\\dx/(\\d\\d\\d\\d/\\d\\d/\\d\\d/[^ .]+) "); 
        BufferedReader in = new BufferedReader(new InputStreamReader(          
        new FileInputStream(args[0]), "US-ASCII"));        
        String line = null;    
        while ((line = in.readLine()) != null)

        {      
            Matcher m = p.matcher(line);      
            if (m.find()) {           
            String key = m.group();    
            Integer currentCount = counts.get(key);  
            counts.put(key, (currentCount == null ? 1 : (currentCount + 1)));    
            }

        }   
        in.close();     
        List<Entry<String, Integer>> results = new ArrayList<Map.Entry<String, Integer>>(counts.entrySet());      Collections.sort(results, new Comparator<Entry<String, Integer>>() {  
        public int compare(Entry<String, Integer> o1, Entry<String, Integer> o2)       
        {           
            return o2.getValue().compareTo(o1.getValue());    
        }    
        });  
        for(int i = 0; i < 10; i++) {   
            System.out.println(results.get(i));   
        }

    }

}

最后,但也是最重要的,这是Nate的LISP

(defun run (&rest logs)

  (let ((counts (make-hash-table :test #’equal)))

    (dolist (filename logs)

      (with-open-file (stream filename :direction :input

                              :external-format :latin-1)

        (loop for line = (read-line stream nil stream)

           until (eq line stream)

           do (cl-ppcre:register-groups-bind (match)

                  ("GET /ongoing/When/\\d{3}x/(\\d{4}/\\d{2}/\\d{2}/[^ .]+) " line)

                (incf (gethash match counts 0))))))

    (loop for key being the hash-keys of counts

       collect key into keys

       finally (map nil #'(lambda (x)

                            (format t "~D: ~A~%" (gethash x counts) x))

                    (subseq (sort keys #’>

                                  :key #'(lambda (x) (gethash x counts))) 0 10)))))

一种很好的理解别的语言(程序或人类语言)的方法就是阅读一个故事的各种语言版本,并进行比较。Tim的问题就很好地达到了那个目的。

如果你想知道我们为什么用托管代码来编程,就去看看C版本吧

如果你找到其他还没被好好读过的比较酷的代码,欢迎你随时发链接给我。


Comments (0)

Skip to main content