When you want to assign a probability to a sequence of words you will run into the Problem that longer sequences are very rare. People fight this problem by using smoothing techniques and interpolating longer order models (models with longer word sequences) with lower order language models. While this idea is strong and helpful it is usually applied in the same way. In order to use a shorter model the first word of the sequence is omitted. This will be iterated. The Problem occurs if one of the last words of the sequence is the really rare word. In this way omiting words in the front will not help.
So the simple trick of Generalized Language models is to smooth a sequence of n words with n-1 shorter models which skip a word at position 1 to n-1 respectively.
Then we combine everything with Modified Kneser Ney Smoothing just like it was done with the previous smoothing methods.
Language Models have a huge variety of applications like: Spellchecking, Speech recognition, next word prediction (Autocompletion), machine Translation, Question Answering,…
Most of these Problems make use a language model at some place. Creating Language Models with lower perplexity let us hope to increase the performance of the above mentioned applications.
The data sets come in the form of structured text corpora which we cleaned from markup and tokenized to generate word sequences.
We filtered the word tokens by removing all character sequences which did not contain any letter, digit or common punctuation marks.
Eventually, the word token sequences were split into word sequences of length n which provided the basis for the training and test sets for all algorithms.
Note that we did not perform case-folding nor did we apply stemming algorithms to normalize the word forms.
Also, we did our evaluation using case sensitive training and test data.
Additionally, we kept all tokens for named entities such as names of persons or places.
All data sets have been randomly split into a training and a test set on a sentence level.
The training sets consist of 80% of the sentences, which have been used to derive n-grams, skip n-grams and corresponding continuation counts for values of n between 1 and 5.
Note that we have trained a prediction model for each data set individually.
From the remaining 20% of the sequences we have randomly sampled a separate set of 100,000 sequences of 5 words each.
These test sequences have also been shortened to sequences of length 3, and 4 and provide a basis to conduct our final experiments to evaluate the performance of the different algorithms.
We learnt the generalized language models on the same split of the training corpus as the standard language model using modified Kneser-Ney smoothing and we also used the same set of test sequences for a direct comparison.
To ensure rigour and openness of research you can download the data set for training as well as the test sequences and you can download the entire source code.
We compared the probabilities of our language model implementation (which is a subset of the generalized language model) using KN as well as MKN smoothing with the Kyoto Language Model Toolkit. Since we got the same results for small n and small data sets we believe that our implementation is correct.
In a second experiment we have investigated the impact of the size of the training data set.
The wikipedia corpus consists of 1.7 bn. words.
Thus, the 80% split for training consists of 1.3 bn. words.
We have iteratively created smaller training sets by decreasing the split factor by an order of magnitude.
So we created 8% / 92% and 0.8% / 99.2% split, and so on.
We have stopped at the 0.008% / 99.992% split as the training data set in this case consisted of less words than our 100k test sequences which we still randomly sampled from the test data of each split.
Then we trained a generalized language model as well as a standard language model with modified Kneser-Ney smoothing on each of these samples of the training data.
Again we have evaluated these language models on the same random sample of 100,000 sequences as mentioned above.
We have used Perplexity as a standard metric to evaluate our Language Model.
As a baseline for our generalized language model (GLM) we have trained standard language models using modified Kneser-Ney Smoothing (MKN).
These models have been trained for model lengths 3 to 5.
For unigram and bigram models MKN and GLM are identical.
The perplexity values for all data sets and various model orders can be seen in the next table.
In this table we also present the relative reduction of perplexity in comparison to the baseline.
With these improvements we will continue to to evaluate for other methods of generalization and also try to see if the novel methodology works well with the applications of Language Models. You can find more resources at the following links:
If you have questions, research ideas or want to collaborate on one of my ideas feel free to contact me.
]]>The course was designed for 16 highly gifted high school students (11th and 12th grade). The level was supposed to be manageable for a second year undergraduate student. Since our students came from different grades and schools we were forced to sacrifice some course time to teach some basic programming skills. Thus we could not cover all the aspects of Web Science. Instead we focused on three main course objectives:
By the end of the course our students should…
All students were asked to prepare a talk and read the book ”Weaving the Web” by Sir Tim Berners-Lee before the summer school started. Ten of the talks included the technical foundations starting with binary numbers going all the way to the application layer and all the necessary protocols. This included the theoretical study of IP, TCP and HTTP as well as routing algorithms (BGP ) and DNS. To ensure a better understanding the students had to form groups and implement a simple Web Server and a Web Client that were able to process HTTP1.0 GET requests during course time. This was done using the Java Programming Language and the socket classes from the Java API. These topics have been covered in the first week of the course. In the second half we focused on the ethics of the web. After each talk on an ethical topic which was supposed to give an overview for about 20 minutes we entered a 2 hour group discussion. For example for the discussion on net neutrality we knew the following groups of interests from the overview talk: Large internet providers, big web companies, small web companies, politicians, consumers. Students were randomly assigned to one of these groups. Within 10 minutes they had to prepare a list of arguments that would reflect the interests of their particular group as well as arguments they would expect from other groups. While discussing the issue on a round table they had to find a good solution respecting the technical nature of the web and the interests of their group.
Even though the Summer School is very competitive participation is voluntary so there can’t be an exam or something similar in the end. Also all work had to be completed during the 50 hours course time without any home work assignments. We had three evaluation methods to ensure the comprehension of the course content.
1. Hacking Project: As already mentioned students implemented a Web Server and Web Client during the first half of the course. Being in groups of 2 or 3 students and being new to programming we teachers helped students out which gave us a nice feedback whether or not students understood the content.
2. Oral presentation: After the middle of the course students had to prepare and give a presentation to be consumed by an interdisciplinary audience i.e the students from other courses of the summer school, which are all not covering any IT topics. We asked the students to create a theatre role-play of what happens if someone types www.wikipedia.org into a web browser and hits the enter key. All students placed routing tables on the seats for the audience, created TCP / IP packets (filled with candy that represented the time to live) and routed DNS requests as well as HTTP requests together with the TCP handshake around the audience in the class room demonstrating that the basic decentralized web architecture was understood by everyone in the course.
3. Paper Writing: During the last days of the course the students were expected to collectively prepare a 25 pages documentation with scientific standards of what they have learned during the summer school. The process of creating this documentation is not only guided by us teachers but gives also a nice feedback loop to see if the goals of the course have been achieved.
Overall we can say that the concept of the course worked really well. Especially putting such a high focus on the Web Architecture and actually letting students implement protocols helped to gain a deeper understanding.
Paper1 <--[ref]--> Paper2 | | |[author] |[author] v v Author1 Author2
For the benchmark we where trying to find coauthors which is basically a friend of a friend query following the author relationship (or breadth first search (depth 2))
As we know there are basically 3 ways of communicating with the neo4j Database:
Here you work on the nodes and relationship objects within java. Formulating a query once you have fixed an author node looks pretty much like this.
for (Relationship rel: author.getRelationships(RelationshipTypes.AUTHOROF)){
Node paper = rel.getOtherNode(author);
for (Relationship coAuthorRel: paper.getRelationships(RelationshipTypes.AUTHOROF)){
Node coAuthor = coAuthorRel.getOtherNode(paper);
if (coAuthor.getId()==author.getId())continue;
resCnt++;
}
}
We see that the code can easily look very confusing (if queries are getting more complicated). On the other hand one can easy combine several similar traversals into one big query making readability worse but increasing performance.
The Traverser Framework ships with the Java API and I really like the idea of it. I think it is really easy to undestand the meaning of a query and in my opinion it really helps to create a good readability of the code.
Traversal t = new Traversal();
for (Path p:t.description().breadthFirst().
relationships(RelationshipTypes.AUTHOROF).evaluator(Evaluators.atDepth(2)).
uniqueness(Uniqueness.NONE).traverse(author)){
Node coAuthor = p.endNode();
resCnt++;
}
Especially if you have a lot of similar queries or queries that are refinements of other queries you can save them and extend them using the Traverser Framework. What a cool technique.
And then there is Cypher Query language. An interface pushed a lot by neo4j. If you look at the query you can totally understand why. It is a really beautiful language that is close to SQL (Looking at Stackoverflow it is actually frightening how many people are trying to answer Foaf queries using MySQL) but still emphasizes on the graph like structure.
ExecutionEngine engine = new ExecutionEngine( graphDB );
String query = "START author=node("+author.getId()+
") MATCH author-[:"+RelationshipTypes.AUTHOROF.name()+
"]-()-[:"+RelationshipTypes.AUTHOROF.name()+
"]- coAuthor RETURN coAuthor";
ExecutionResult result = engine.execute( query);
scala.collection.Iterator
while (it.hasNext()){
Node coAuthor = it.next();
resCnt++;
}
I was always wondering about the performance of this Query language. Writing a Query language is a very complex task and the more expressive the language is the harder it is to achieve good performance (same holds true for SPARQL in the semantic web) And lets just point out Cypher is quite expressive.
I was shocked so I talked with Andres Taylor from neo4j who is mainly working for cypher. He asked my which neo4j version I used and I said it was 1.7. He told me I should check out 1.9. since Cypher has become more performant. So I run the benchmarks over neo4j 1.8 and neo4j 1.9 unfortunately Cypher became slower in newer neo4j releases.
Cypher is just over a year old. Since we are very constrained on developers, we have had to be very picky about what we work on the focus in this first phase has been to explore the language, and learn about how our users use the query language, and to expand the feature set to a reasonable level
I believe that Cypher is our future API. I know you can very easily outperform Cypher by handwriting queries. like every language ever created, in the beginning you can always do better than the compiler by writing by hand but eventually,the compiler catches up
So far I was only using the Java Core API working with neo4j and I will continue to do so.
If you are in a high speed scenario (I believe every web application is one) you should really think about switching to the neo4j Java core API for writing your queries. It might not be as nice looking as Cypher or the traverser Framework but the gain in speed pays off.
Also I personally like the amount of control that you have when traversing over the core yourself.
Adittionally I will soon post an article why scripting languages like PHP, Python ore Ruby aren’t suitable for building web Applications anyway. So changing to the core API makes even sense for several reasons.
The complete source code of the benchmark can be found at https://github.com/renepickhardt/related-work.net/blob/master/RelatedWork/src/net/relatedwork/server/neo4jHelper/benchmarks/FriendOfAFriendQueryBenchmark.java (commit: 0d73a2e6fc41177f3249f773f7e96278c1b56610)
The detailed results can be found in this spreadsheet.
In the last month I have created quite some content for my blog and it will be published over the next weeks. So watch out for screen casts how to create an autocompletion in gwt with neo4j, how to create ngrams from wikipedia, thoughts and techniques for related work, reasearch ideas and questions that we found but probably have not the time to work on
]]>First of all I think it is really important to understand that in China everything is related to your relations (http://en.wikipedia.org/wiki/Guanxi). A chinese business card will always name a view of your best and strongest contacts. This is more important than your adress for example. If a conference starts people exchange namecards before they sit down and discuss.
This principle of Guanxi is also reflected in the style presentations are made. Here are some basic rules:
My way of respecting these principles:
The second thing is that in China the concept of family is very important. I would say as a rule of thumb if you want to make business with someone in china and you havent been introduced to their family things are not going like you might expect this.
For this reason I have included some slides with a worldmap going further down to the place where I was born and where I studied and where my parents still leave!
When I choosed a worldmap I did not only take one with Chinese language but I also took one where china was centered. In my contact data I also put chinese social networks. Remember Twitter, Facebook and many other sites are blocked in China. So if you really want to communicate with chinese people why not getting a QQ number or weibo account?
You saw this on conferences many times. Chinese people just put a hack a lot of stuff on a slide. I strongly believe this is due to the fact that reading and recognizing Chinese characters is much faster than western characters. So if your presentation is in Chinese Language don’t be afraid to stuff your slides with information. I have seen many talks by Chinese people that where literally reading word by word what was written on the slides. Where in western countries this is considered bad practice in China this is all right.
Speaking of Language: Of course if you know some chinese it shows respect if you at least try to include some chinese. I split my presentation in 2 parts. One which was in chinese and one that was in english.
So in my case I included the fact that we have PhD positions open and scholarships. That our institut is really international and the working language is english. Of course I also included some slides about my past and current research like Graphity and Typology
In China it is not rude at all if ones cellphone rings and one has more important stuff to do. You as presenter should switch of your phone but you should not be disturbed or annoyed if people in the audience receive phone calls and go out of the room doing that business. This is very common in China.
I am sure there are many more rules on how to hold a presentation in China and maybe I even made some mistakes in my presentation but at least I have the feeling that the reaction was quite positiv. So if you have questions, suggestions and feedback feel free to drop a line I am more than happy to discuss cultural topics!
Even though this development is very good to see I am not happy about how the following discussion is going on about models how to fulfill the goals from the royal society.
Neelie Kroes from the European comission posted a really nice answer!
I am glad to see this step forward. After my successful submission of Graphity and reading the copyright form of IEEE which I had to sign I really did have concerns publishing my work with them.
I am still considering not submitting to big journals and conferences anymore but just publishing on my universities website, my blog and/or on open preprint archives.
Wondering about possible research topics many people told me that running a social network like metalcon and seeing so much data would be like treasure for research questions. Even though at that time our goal was just to keep metalcon running in its current state I startet thinking about what research questions could be asked. For most questions we just did not collect good data. In this sense I realized if metalcon should really give rise to research questions I would have to reimplement big parts of it.
Since metalcon also had the problems of slow page loading times I decided to efficiently recode the project. I new that running metalcon on a graph data base like neo4j would be a good idea. That is why I started wondering about how to implement a newsfeed system efficiently in a scalable manner using a graph data base.
==> A research question was formulated.
After 3 months of hard thinking and trying out different data base designs I had a meeting with my potential supervisor Prof. Dr. Steffen Staab. He would have loved to see some prelimenary results. While summing up everything for the meeting the solution or better said the idea of Graphity was suddenly clear before my eyes.
In late august I talked about the ideas in the oberseminar and I presented a poster at the european summer school of information retrieval. I wondered about where to submit this.
Steffen thought that the idea was pretty smart and could be published in a top journal like VLDB which he suggested me to choose. He said it would be my first scientific work and VLDB has a very short reviewing cycle of less than one month that could provide me fast feedback. A pieve of advice which I did not understand and also did not follow. Well, sometimes you have to learn the hard way…
Being confident about the quality of the index also due to Steffens positive feedback and plans for VLDB my plan was rather to submit to WWW conference. I thought social networks would be a relevant topic to this conference. Steffen agreed but pointed out that he was a program chair for that conference which would mean that submissions from his own institute would not be the best idea.
Even though graphity was not implemented or evaluated yet and no related work was read we decided that the paper should be submitted to SIGMOD conference. With an upcoming deadline of only 2 months later.
By the way having these results and starting this hard work gave me founding for 3 years starting in october 2012!
After two months of very hard work especially together with Jonas Kunze a physicist from metalcon from whom I really learnt a lot about setting up software and evaluation frameworks the paper was submitted to sigmod. Meanwhile I tried to get an overview of the related work (which at that time was kind of the hardest part) since I really did not know how to start and my research question came from a very practical and relevant usecase but did not emerge after reading many papers.
Anyway we finished our submission just in time together with all the experiments and a – I would still say decent – text about graphity. (you can have the final copy of the sigmod publication on request)
I talked back to my now Supervisor Steffen and asked him what he thought about blogging the results of graphity already trying to get some awareness in the community. He was a little skeptical since blogs can not really be cited and the paper was not published yet. I argued that blogs are becoming more and more importent in research. I said even if a blogpost is not publication in the scientific sense it still is a form of publication and gains visability. Steffen agreed to test this out and I have to say the idea to do this was perfect. I put the core results on my blog received very positiv feedback from people working at linkedin, microsoft and other hackers. Also I realized that the problem was currently unpublished / unsolved and relevant to many people.
Interestingly one of my co-authors tried to persue me to publish the sigmod version of the paper as a technical report. I did not get the point of doing so (he said something like, then it is officially published and no one can steal the idea…” Up to today I don’t get the point of doing this other then manipulating ones official publication count…)
After all the strong feedback on my blog and via mail the reviews from SIGMOD conference where quite disappointing. Every single reviewer highly valued the idea and the relevance of the problem. But they criticized the evaluation and the related work.
The most disappointing of all the feedback was the following quote:
W1: I cannot find any strong and novel technical contribution in their algorithms. Their proposed method is too simple and straightforward. It is simply an application of using doubly linked list to graph structures.
To give an answer to this at least once: Yes the algorithm is simple and only based on double linked lists. BUT it is in the best complexity class one can imagine for the problem it scales perfectly and this was prooven. How can one throw away an answer because it is too simple if it is the best possible answer? I sense that this is one of the biggest differences between a mathematician and computer scientist. In particular the same reviewer pointed out that the problem was relevant and important.
Having this feedback we came to the conclusion that the database community might have been the wrong community for the problem. Again we would have figured this out much faster if submitting to VLDB like Steffen had suggested.
Only 1 week away from the feedback was the submission for the hypertext conference. Since one week was not really much time we decided to not implement any of the feedback and just check what another community would say. My supervisor Steffen was not really convinced of this idea but we argued that the notification of hypertext conference was only one month later and this quick and aditional feedback might be of help.
The reviews from hypertext conference have been rather strange. One strong accept but the reviewer said he was not an expert in the field. One strong reject of a person that did not understand the correctness of our top k n way merge (which was not even core to the paper) and a borderline from one reviewer who really had some interesting comments on our terminology and the related work.
Overall the paper was accepted as a poster presentation and we decided that this was not sufficient for our work so we withdraw our submission.
Discussions rose up which conference we should target now. We have been sure that the style of the evaluation was for sure nothing for the data base community. On the other hand graph data bases alone would not be sufficient for semantic conferences and social is such a hype right now that we really were not sure.
Steffen would have liked to submit the paper to ISWC since this is an important community for our institue. I suggested SocialCom since the core of our paper was really lying in social networking. All reviews so far have valued the problem as important and relevant for social networks and people basically liked the idea.
We tried to find related work for the used baselines and figured out that for our strongest baseline we could not find any related work. 3 days before the submission to Socialcom Steffen argued that we should drop the name of the paper and not call it graphity anymore. He said that we just sell the work as two new indices for social news streams (which I thought was kind of reasonable since the baseline to graphity really makes sense to use) and was not presented in any related work. Also we could enhance the paper with some feedback from my blog and the neo4j mailinglist. The only downside of this strategy was that changing the title and changing the story line would yield to a complete rewrite of the paper. Something I was not to keen of doing within 3 days. My coauther and I were convinced that we should stick to our changes from working in the feedback and stick to our argument for the baseline without related work.
We talked back to Steffen. He said he left the final decission to us but would strongly recommend to change the storyline of the paper and drop the name. Even though I was not 100% convinced and my coauthor also did not want to rewrite the paper I decided to follow Steffens experience. We rewrote the story line.
The paper was accepted as a full paper to SocialCom. So following Steffens advice was a good idea. Also not getting down by bad reviews from the wrong community and staying convinced that the work had some strong results turned out to be a good thing. I am sure that posting preliminary results on the blog was really nice. In this way we could integrate open accessable feedback from the community to our third submission. I am excited how much I will have learnt from this experience with later submissions.
I will publish the paper on the blog as soon as the camera ready version is done! Subscribe to my newsleter, RSS or twitter if you don’t want to miss the final version of the paper!
This chapter is split into several sections and is supposed to give some motivation. It already demonstrates in a good way that in order to understand natural language processing you really have to bridge the gap between mathematics and computer science on the one hand and linguistics on the other hand. Even in the basic examples given from linguistics there was some notation that I did not understand right away. In this sense I am really looking forward to reading chapter 3 which is supposed to give a rich overview of all the concepts needed from linguistics.
I personally found the motivating section of chapter 1 also too long. I am about to learn some concepts and right now I don’t really have the feeling that I need to understand all the philosophical discussions about grammar, syntax and semantics.
What I really loved on the other hand was the last section “dirty hands”. In this section a small corpus (tom sawyer) was used to introduce some of the phenomena that one faces in natural language processing. Some of which I already discussed without knowing that I did in the article about text mining for linguists on ulysses. In the book of course they have been discussed in a more structured way but one can easily download the source code from the above mentioned article and play around to understand the basic concepts from the book. Among these there where:
Word Counts / Tokens / Types The basic operation in a text one can do is counting words. This is something that I already did in the Ulysses article. Counting words is interesting since in today’s world it can be automated. What I didn’t see in my last blog post that counting words would already lead to some more insights than just a distribution of words. which I will discuss now:
distinctive words can be spotted. Once I have a corpora consisting of many different texts I can count all words and create a ranking of most frequent words. I will reallize that for any given text the ranking looks quite similar to that global ranking. But once in the while I might spot some words in a single text in the top 100 of most frequent words that would not appear (let’s say) in the top 200 of the global ranking. Those words seem to be distinctive words that are of particular interest for the current text. In the example of Tom Sawyer “Tom” is such a distinctive word.
Hapax Legomena If one looks at all the words in a given text one will realize that most words occur less than 3 times. This phenomenon is called Hapax Legomenon and demonstrates the difficulty of natural language processing. The data one analyses is very sparse. The frequent words are most the time grammatical structures whereas the infrequent words carry semantic. From this phenomenon one goes very quick to:
Zipf’s law Roughly speaking Zipf’s law says that when you count the word frequencies of a text and you order them from the most frequent word to the least frequent words you get a table. In this table you can multiply the position of the word with its frequency and you will always get about the same number (saying that the rank is anti proportional to the frequency of the word). This is of course only an estimation just imagine the most frequent word occurs in an uneven number. Then there will be no frequency for the second most important word which multiplied with 2 will get the same frequency of the most frequent word.
Anyway Zipfs law was a very important discovery and has been generalized by Mandelbrot’s law (which I so far only knew from chaos theory and fractals). Maybe somtime in near future I will find some time to calculate the word frequencies of my blog and see if Zipf’s law will hold (:
Collocation / Bigrams On other important concept was that of collocation. Many words only have a meaning together. In this sense “New York” or “United states” are more than the sum of the single words. The text pointed out that it is not sufficient to find the most frequent bigrams in order to find good collocations. Those begrams have to be filtered ether with gramatical structures or normalized. I think calculating a jaccard coefficient might be interesting (even though it was not written in the text.) Should I really try to verify Zipf’s law in the near future I will also try to verify my method for calculating collocations. I hope that I would find collocations in my blog like social network, graph data base or news stream…
KWIC What I did not have in mind so far is the analysis of text is the keyword in context analysis. What is happening here is that you look at all text snippets that occur in a certain window around a key word. This seems more like work from linguistics but I think automating this task would also be useful in natural language processing. So far it never came to my mind when using a computer system that it would also make sens to describe words from the context. Actually pretty funny since this is the most natural operation we do when learning a new language.
Exercises
I really liked the exercises in the book. Some of them where really straight forward. But I had the feeling they where really carefully chosen in order to demonstrate some of the information given in the book so far.
What I was missing around these basic words is the order of the words. This is was somewhat reflected in the collocation problem. Even though I am not an expert yet I have the feeling that most methods in statistical natural language processing seem to “forget” the order in which the words appear in the text. I am convinced that this is an important piece of information which already inspired me in my Diploma thesis to create some similar method to Explicit semantic analysis and which is a core element in typology!
Anyway reading the first chapter of the book I did not really learn something new but It helped me on taking a certain point of view. I am really exciting to proceed. The next chapter will be about probability theory. I already saw that it is just written in a math style with examples like rolling a dice rather than examples from corpus linguistics which I find sad.