Using a Java Framework with F#: The Stanford Parser for NLP

Article
02/05/2013

We like to say "F# loves R", because we can use R packages from F#, through an R type provider for F#.

We like to say "F# loves TypeScript", because we can use TypeScript Interface Definition Files from F#, through a TypeScript type provider for F#. This applies when compiling F# to Javascript through FunScript or WebSharper.

But what about Java and JVM-based packages? Well, believe it or not, F# and JVM-based packages can work surprisingly well together, particularly through IKVM . This magical piece of software allows you to use many of those lovely, cross-platform Java frameworks out there as part of your F# applications. Sergey Tihon has been exploring this by using the Stanford Parser with F#, a statistical NLP toolkits for various major computational linguistics problems. Sergey's step-by-step guide for using this JVM-based framework with F# is copied below, and please follow his blog too.

Of course, it could also be of interest to write an F# type provider for JAR packages. This could execute IKVM "behind the scenes".

Enjoy!

Don

p.s. I know that some other language-interop puzzles on the F# community's radar are F#-to-Python, F#-to-Matlab and F#-to-Mathematica. Perhaps some intrepid people at Facebook will even write F#-to-PHP one day :-)

IKVM overview.

IKVM is an implementation of Java for Mono and the Microsoft .NET Framework. It includes the following components:

A Java Virtual Machine implemented in .NET

A .NET implementation of the Java class libraries

Tools that enable Java and .NET interoperability

Read more about what you can do with IKVM.NET.

The Stanford NLP Group makes parts of our Natural Language Processing software available to the public. These are statistical NLP toolkits for various major computational linguistics problems. They can be incorporated into applications with human language technology needs.

All the software we distribute is written in Java. All recent distributions require Sun/Oracle JDK 1.5+. Distribution packages include components for command-line invocation, jar files, a Java API, and source code.

IKVM .jar to .dll compilation

First of all, we need to download and install IKVM.NET. You can do it from SourceForge. The next step is to download Stanford Parser (current latest version is 2.0.4 from 2012-11-12). Now we need to compile stanford-parser.jar to .NET assembly. You can do it with the following command:

1 ikvmc.exe stanford-parser.jar

...

Let’s play!

That’s all! Now we are ready to start playing with Stanford Parser. I want to show up here one of the standard examples(ParserDemo.fs), the second one is available on the GitHub with other sources.

30 let demoAPI (lp:LexicalizedParser) =

31 // This option shows parsing a list of correctly tokenized words

32 let sent = [|"This"; "is"; "an"; "easy"; "sentence"; "." |]

33 let rawWords = Sentence.toCoreLabelList(sent)

34 let parse = lp.apply(rawWords)

35 parse.pennPrint()

36

37 // This option shows loading and using an explicit tokenizer

38 let sent2 = "This is another sentence."

39 let tokenizerFactory = PTBTokenizer.factory(CoreLabelTokenFactory(), "")

40 use sent2Reader = new StringReader(sent2)

41 let rawWords2 = tokenizerFactory.getTokenizer(sent2Reader).tokenize()

42 let parse = lp.apply(rawWords2)

43

44 let tlp = PennTreebankLanguagePack()

45 let gsf = tlp.grammaticalStructureFactory()

46 let gs = gsf.newGrammaticalStructure(parse)

47 let tdl = gs.typedDependenciesCCprocessed()

48 printfn "\n%O\n" tdl

49

50 let tp = new TreePrint("penn,typedDependenciesCollapsed")

51 tp.printTree(parse)

52

53 let main fileName =

54 let lp = LexicalizedParser.loadModel(code-tag-ffc2707c-1a41-497f-b64e-8ba6438fa061"..\..\..\..\StanfordNLPLibraries\stanford-parser\stanford-parser-2.0.4-models\englishPCFG.ser.gz")

55 match fileName with

56 | Some(file) -> demoDP lp file

57 | None -> demoAPI lp

What we are doing here? First of all, we instantiate LexicalizedParser and initialize it with englishPCFG.ser.gz model. Then we create two sentences. First is created from already tokenized string(from string array, in this sample). The second one is created from the string using PTBTokenizer. After that we create lexical parser that is trained on the Penn Treebank corpus. Finally, we are parsing our sentences using this parser. Result output can be found below.

  [|"1"|] Loading parser from serialized file ..\..\..\..\StanfordNLPLibraries\ stanford-parser\stanford-parser-2.0.4-models\englishPCFG.ser.gz ... done [1.5 sec]. (ROOT (S (NP (DT This)) (VP (VBZ is) (NP (DT an) (JJ easy) (NN sentence))) (. .))) [nsubj(sentence-4, This-1), cop(sentence-4, is-2), det(sentence-4, another-3), root(ROOT-0, sentence-4)] (ROOT (S (NP (DT This)) (VP (VBZ is) (NP (DT another) (NN sentence))) (. .))) nsubj(sentence-4, This-1) cop(sentence-4, is-2) det(sentence-4, another-3) root(ROOT-0, sentence-4)

Using a Java Framework with F#: The Stanford Parser for NLP

IKVM overview.

About Stanford NLP

IKVM .jar to .dll compilation

Let’s play!

Additional resources