graph – Data Science, Data Analytics and Machine Learning Consulting in Koblenz Germany https://www.rene-pickhardt.de Extract knowledge from your data and be ahead of your competition Tue, 17 Jul 2018 12:12:43 +0000 en-US hourly 1 https://wordpress.org/?v=4.9.6 Typology using neo4j wins 2 awards at the German federal competition young scientists. https://www.rene-pickhardt.de/typology-using-neo4j-wins-2-awards-at-the-german-federal-competition-young-scientists/ https://www.rene-pickhardt.de/typology-using-neo4j-wins-2-awards-at-the-german-federal-competition-young-scientists/#comments Mon, 21 May 2012 09:44:54 +0000 http://www.rene-pickhardt.de/?p=1341 Two days ago I arrived in Erfurt in order to visit the federal competition young scientists (Jugend Forscht). I reported about the project typology by Till Speicher and Paul Wagner which I supervised over the last half year and which already won many awards.
Saturday night they have already won a special award donated by the Gesellschaft fuer Informatik this award has the title “special award for a contribution which demonstrates particularly the usefulness of computer science for Society.” (Sonderpreis fuer eine Arbeit, die in besonderer Art und Weise den Nutzen der Informatik verdeutlicht.) This award was connected with 1500 Euro in cash!
Yesterday there was the final award ceremony. I was quite excited to see how the hard work that Till and Paul put into their research project would be evaluated by the juryman. Out of 457 submissions with a tough competition the top5 projects have been awared. Till and Paul came in 4th and will now be allowed to visit the German chancelor Angela Merkel in Berlin.
With the use of spreading activation Typology is able to make precise predictions of what you gonna type next on your smartphone. It outperforms the current scientific standards (language models) by more than 100% and has a precision of 67%!
A demo of the project can be found at www.typology.de
The android App of this system is available in the appstore. It is currently only available for German text and is beta. But the source code is open and we are happy for anybody who wants to contribute can check out the code. A mailinglist will be set up soon. But anyone who is interested can already drop a short message at mail@typology.de and will be added to the mailinglist as soon as it is established.
You can also look at the documentation which will soon be available in english. Alternatively you commit bugs and request features in our bug tracker
I am currently trying to improve the data structures and the data base design in order to make retrieval of suggestions faster and decrease the calculation power of our web server. If you have expertise in top – k aggregation joins in combination with prefix filtering drop me a line and we can discuss about this issue!
Happy winners Paul Wagner (left) and Till Speicher (right) with me

]]>
https://www.rene-pickhardt.de/typology-using-neo4j-wins-2-awards-at-the-german-federal-competition-young-scientists/feed/ 5
Question of the Day: How the hell do we reach more people? https://www.rene-pickhardt.de/question-of-the-day-how-the-hell-do-we-reach-more-people/ https://www.rene-pickhardt.de/question-of-the-day-how-the-hell-do-we-reach-more-people/#comments Sun, 19 Feb 2012 21:33:09 +0000 http://www.rene-pickhardt.de/?p=1027 Recently I received an email from a musicians that wishes to stay unnamed telling me that many people out there love his music but it just hasn’t spread too far. His basic question is how can his band reach more people on the web especially with regard to a new upcoming video?
His promoter suggested something like:

  1. You should send a press release to all music related websites
  2. You should show the videos to your  friends and ask them to reshare it with their friends
  3. You should make a riffle regarding the video

Before I start with my thoughts. In this blog article I will explain some fundamental concepts of online marketing. This is not only interesting for musicians but for any brand! The concepts of branding will be questioned hard.

The aked question is very fundamental for online (music) marketing

The oberservation of this musician is quite right and holds true for many musicians but also for other products. Actually it is a nice application of the Pareto Principle (8.9% of bands make up 91.1% of plays on Last.fm). And the problem exists not only on the web. Once you start a new thing with a lot of enthusiasm you will most certainly get to realize that no one really cares despite the fact that your product / music is good and your receive some nice feedback. I looked at some of the social media response for this particular artist and found out that he really is in the lucky position to receive a lot of great feedback and see that quite a lot people cared. Still it is  hard for him to increase his reach and inform people about this very product id est introducing the music to them.

Problem of overcoming the boarders of one’s own social circle

The entire problem breaks down to overcoming one’s own social circle. If you understand that the entire world is a network (it is really hard to fully understand all the implications of this) you understand that the big problem in (online) marketing is the following:
You have your ego network – or let’s call it social circle of people – that you can reach with an idea. You will realize within your  social circle some ties are better and some are worse. Some ties are so great that they might even help you out (for example because they are your record label or promoter or booker or die hard fans). But in the end they can be all combined as one person in the network that has a (somewhat bigger) social circle. And the social circle will look like the following pictures which I took from Marc Smith Blogpost about social networks in the news (A tutor of mine at Webscie summer school 2011)
Ego network and social circle
In this series of pictures you can actually see how weak your social circle is (and again for a music band the circle might be bigger at all but the entire picture will look similar:
Rest of an ego network
Even though you can reach anyone in our worldwide friendship network with only 6 people in between your social circle will only reach a view ten thousand people! So the question is: “how to get the other 7 billion?”
To make this even more clear: As a b(r)and you might “know” an amazing number of let us say 3500 fans (mabe you reach even more) and become a “rather” central node in the social graph of people and brands. But let uns see why this is not of any help!

  1. you know 3500 fans
  2. Those fans already have 410’000 (almost half a million) people in their ego networks which probably makes the use of Facebook so tempting. 
  3. Your fans friends altogether know 51 mio people. 
  4. Finally these people that you already know over three hops will know the entire rest of the world e.g. the other 7 bn people 

Do you really think your music / product / message / idea is so great that your friends will tell their friends who will tell their friends who will tell the others? That is really all it takes! 4 hops. Sorry to disappoint you. This is almost impossible to achieve!

The classic answer before the viral web was born: Advertising

Well the problem existed a long time before the web was born. People already have figured out the solution: You ask someone who has a huge reach to tell his audience about your new idea! Less sophisticated we can summerize this idea in one word: “advertising” you go to any media pay them some money and they will talk about you. This worked amazingly well in earlier days. There wasn’t that much media and if you had your product in some media you could be certain that you had at least a brand awareness. But there is a limitation of advertising…

A Problem of advertising and some advantages and disadvantages of online advertising

Of course once you run advertising campaigns your will receive some attention. No matter how good or bad your product is. That was the principle during the Dot com Bubble  where many start ups spent way too much advertising dollars and money to built reach instead of focusing on a great product. This is the reason why every ad campaign should carefully measure several things (fortunately in online marketing this is easier than anything else!):

  • Conversionrate – as the percentage of people that convert to (fans, custormers, emailadresses,… basically whatever goal you have for the campaign)
  • Bounce rate – as the amount of people that clicked your ad but left your page / product right away
  • The price you pay per click (or per 1 thousand impressions)
  • The value to you for a new fan (customer).

Once you know these numbers you can easily calculate weather an ad campaign is usefull for you or not! And you can calculate those numbers easily. The first three are given or measured during the ad campaign. If you don’t know how to do this. Contact me. The last one has its own paragraph.

What is the value of one new fan?

For the value of a new fan you could do the following. Count the number of fans you have (e.g. size of news letter / facebook fans …) and then look at your last year revenue (sum up: merch + concerts + sold music) devide this number by the number of fans and you see how much revenue one fan produces in one year. I know it is only a rough estimation but I wonder how many bands have calculated this number! Look at other products and markets.Everyone who is in the business of direct marketing maintains a customer data base which is a solid asset to the business. Groupon for example gives away 6 Euro for every new Customer someone finds. (This means that Groupon thinks a new Contact is worth at least 6 Euro.) 
As a musician you should know this number. Assume a new Fan is worth 6 Euro to you this would mean that knowing 5 thousand fans is worth as much as the production of an entire record. Could this be true? Actually I will try to start some calculations together with my Collegues from In Legend soon and try to calculate how much we should spend to gain a new fan!

Problem of aggressive marketing vs spreading of new information

On the web the problem is that people have so much choice that advertisting only helps to some degree. With Google adwords for example you only pay for clicks. So you definately obtain the attention. On the other side people drift away very quickly and their span of attention is very small. Another drawback is that many things on the web are free. Even though advertisting for music on the web is so much cheaper than offline advertising it is not standard to make use of advertisting on the web. Recordlabels and promoters so far refuse to invest money for it. That is sad because there is a lot of potential in online advertising (especially once you measure your conversion and bounce rates as well as the click through rates.)
Luckily those rates can be measured before you run an ad campaign. Investing money to increase reach should therefore never be done to a complete new product. A complete new product should be tested first. You meassure the feedback. Once you realize that people are happy you are save paying money to increasing your reach. For example in in legends case the youtube pandemonium video received much better feedback than vortex video and so far the soul apart video seems to receive the best feedback overall. By user ratings / average daily views as well as user comments. So you better go out an make advertising with your best performing content. Will you care that this is not your newest piece of work or even the first video? Hell no! Advertising is always targeted to people that don’t know you. You can even show them a 10 year old video and they might not realize its age! Once they like you they can still discover your newer stuff. 
Attention the above suggestion drastically changes if you are a mature b(r)and. In that case of course your star products are already well known and you should use promotional power and your brand recognition to get out the new product. This leads to the next paragraph.

Do all these calculations depend on the maturity of the band?

The answer is clearly yes! A band that is mature will still need promotion and advertising but also a lot of messages spread from almost alone. Some other things are easier achieved (mag title stories and so on…) But the obvious message is the following. The cost of advertising and your conversion rate will most probably stay constant and remain independent of the bands maturity. Metallica has to pay the same price for youtube ads as In Legend has to. But the fan is worth more if a band is not mature yet. A fan from the first record might go to many tours and buy a lot of products and also help spreading the word in a viral manner. For a mature band chances are pretty high that people who see the ad already know the brand but already decided not to like the band. So even though a young band has not much money and can not be sure that the product is already sufficiently good (which will lead to a drop in conversionrate i.e. more expensive ads or higher acquisition cost per new fan) ads make much more sense if the band is totally unknown. Be couragous invest some money and bootstrap your product! You are also courageous by going on the stage. So please also enter the web stage and built your reach! I know it’s less fun than rocking the audience in a life concert but the effect should be pretty much the same.

Other non advertising impact factors to overcome the boarders of one’s social circle

Since advertising is very expensive and probably not sustainable one can wonder if there are other ways to overcome the boarders of your social circle. I would say there are. The best way to do so in my pure empirical experience is transparicy, openess, the trust in other people and the interactive communication with them. Take me as an example. When I went to china I had basically no one to talk to about my interests. Now I started blogging sharing my ideas and – some people already called me crazy – parts of my intellectual propertie. Guess what! By doing so I was able to find more and more people with similar interests from all around the globe that I would never have met in my own social circle giving my valuable feedback to my ideas. My Reading club on distributed graph data bases is just a recent example of this added value from transparency.
So as a band here are some things you could and should do besides advertising your star products (and yes it is not the mainstream and might take some courage to do so!):

  1. Interact with your fans. Treat them with respect. Not just by telling them but by really showing them. Don’t be so “kind” to share only news about you. That’s not interacting that’s publishing! Join the discussion on current topics (in music, news, …) or respond to what your fans say! Find out if it is relevant and stay in a conversation. I know it is hard since fans can become annoying and as a person it is already hard where a band has much more fame and more people and ideas to take care of. 
  2. And please don’t be fake interactive by asking questions like “Heya Legends..
    A new week has begun, what is your sound for depressing mondays and what gets your through the week???
    ” Asking questions especially if they don’t really carry a meaning is not interactive. That is just embarissing. That’s the reason why this kind of questions don’t take your marketing anywhere.
  3. Integrate your fans! That is the by far most promesing strategy. As a band with already 3’500 fans there is so much diversity, creativity and so on that you will be able to achieve extra ordinary things. I am sure fans of you will have access to recording studios will have access to cheap videos maybe there are webdesigner and photographers. If you interact with your fans in a very smart way you don’t even have to ask them if there is a REPLACE BY WHATEVER YOU NEED. You will just know who is it. This leads me to 4th
  4. Use the upcoming social network google+. First of all it is very obvious that it will kill Facebook on the long term (probably even on the short term) but more important it supports you to follow your fans since you can put them into circles. So all photographs go in one circle. All reviews go in another circle, all bloggers in the next circle, all technicians, all bookers, all concert organizers and so on. Streetteamer go in one circle. Follow those people they like your music and could be valuable for you.   
  5. Speaking of streetteam: I once talked to a record label coworker. She told me in her experience the single best promotion tool for a band is a streetteam. Unfortunately I have seen many streetteams also did not receive the respect they deserved from band mambers. A band’s streetteam has an incredible impact. I once wondered how a poster of my favourite band came to be placed on a train station of a minor German city. It was much later that I realized that this must have been streetteam members… On the web by being interactive you can build your own global streetteam with almost no cost – besides time! Treat your streetteam. Have a streetteam meeting at least once a year and have all bandmembers come and have a great party with them. You can also make virtual meetings.
  6. last but not least: Do all the things your promoter suggested at the very top of my post. Those are the core homework choires. 

For the last lines. Have fun making music and enjoy the most recent video of my band

]]>
https://www.rene-pickhardt.de/question-of-the-day-how-the-hell-do-we-reach-more-people/feed/ 1
Google Pregel vs Signal Collect for distributed Graph Processing – pros and cons https://www.rene-pickhardt.de/google-pregel-vs-signal-collect-for-distributed-graph-processing-pros-and-cons/ https://www.rene-pickhardt.de/google-pregel-vs-signal-collect-for-distributed-graph-processing-pros-and-cons/#comments Sun, 19 Feb 2012 17:05:49 +0000 http://www.rene-pickhardt.de/?p=1134 One of the reading club assignments was to read the paper about Google Pregel and Signal Collect, compare them and point out pros and cons of both approaches.
So after I read both papers as well as Claudios overview on Pregel clones and took some notes here are my thoughts but first a short summary of both papers.

Summary of Google Pregel

The methodology is heavily based on Bulk Sychronous Parallel Model (BSP) and also has some similarties to MapReduce (with just one superstep). The main idea is to spread the data over several machines and introduce some supersteps. For each superstep every vertex of the graph calculates a certain function that is given by the programmer.
This enables one to process large graphs which are distributed over several machines. The paper describes how to use Checkpoints to increase fault tolerance and also how to make good use of the Google File System in order to partition the graph data on the workers. The authors mention that smarter hashing functions could help to distribute the vertices not randomly but rather in a way they are connected on the graph which could potentially increase performance.
Overall the goal of Google Pregel seems to enable one to process large graph data and gain knowledge from it. The focus does not seem to increase the usage of the calculation power of the distributed system efficiently. In stead it rather seems to create a system that makes distribution of data – that will not fit into one machine – possible at a decent speed and without much effort for the programmer by introducing methods for increasing fault tolerance.

Summary of Signal Collect

Signal Collect as a system is pretty similar to Google Pregel. The main difference is that the authors introduce a threshold score which is used to decide weather a node should collect its signals or weather it should send signals. Using this score the processing of algorithms can be accelerated in a way that for every super step only signals and collects are performed if a certain threashhold is hit.
From here the authors say that one can get rid of the superstep model and make the entire calculation asynchronous. This is done by introducing randomization on the set of vertices on which signal and collect computations have to be computed (as long as the threasholdscores are overcome)
The entire system is implemented on a single machine but the vertices of the compute graph are processed by different workers (in this setting Threads). All Threads are able to share the main memory of the system which makes message passing of Signal and Collect computations unnecessary. The authors show how the increasing number of workers actually antiproportionally lower the runtime of the algorithm in the asynchronous setting. They also give evidence that different scheduling strategies seem to fit the needs for different graphs or algorithms.

Discussion of Pros and Cons

  • From the text above it seems very obvious that Signal Collect with its Asynchronous Programming model seems superior. But – in opposite to the authors – I have hidden to mention the drawbacks of one small but important detail. The fact that all the workers share a common knowledge which they can access due random access in main memory of the machine allows their model to be so fast while being asynchronous. It is not clear how to maintain this speed with a real distributed system. So in this way Signal Collect only give a proof of concept that an abstract programming model for graph processing exists and it enables fast distribution in theory.
  • Pregel actually is a real frame work that can really achieve distribution of large data to clusters of several thousand machines which for sure is a huge pro.
  • Signal Collect proposes to be more general than Pregel since Pregel can only respect one vertex type and edges are stored implicitly. Whereas Signal Collect is able to store RDF Graphs. I personally understand that Signal Collect can only send signals from one vertex to another if and edge exists and is also not able to add or remove edges or vertices. In this sense I still think that Pregel is the more general system. But I guess one can still argue on my point of view.
  • Pregel’s big drawbacks in my opinion are that the system is not optimized for speed. As already discussed in the last meeting of the reading club Map Reduce – with its one Superstep attitude – is able to start Backup tasks towards the end of the computation in order to fight stragglers. Pregel has to wait for those stragglers in every superstep in order to make synchronous Barriers possible.  
  • Another point that is unique with Pregel is the deep integration with Google File System (btw. I am almost through the google file system paper and even if you already know of the idea it is absolutely worthwhile reading it and understanding the arguments for the design decisions of the google file system). So far I am not sure weather this integration is a strong or a weak point. This is due to the fact that I can’t see all the implications. However it gives strenght to my argument that for a distributed system some things like network protocols and file systems should be considered since they seem to have a strong impact on the entire system. 
  • Both systems in my opinion fail to consider partitioning of the graph and a different network protocol as an important task. Especially for Pregel I do not understand this since it already has so much network traffic. Partitioning the graph might increase start up Traffic on the one hand but could increase overall traffic on the long term. 

Outlooks and personal thoughts:

I am considering to invite the authors of both papers to next weeks reading club. It would be even more interesting to discuss these and other questions directly with the guys who built that stuff. 
Also I like Schegi’s idea to see what happens if one actually runs several neo4j servers on different machines and just use a model similar to Signal Collect or Pregel to perform some computations. In this way a programming model could be given and research on the core distribution framework – relying on good technologies for the workers – could be done.
For the development of the first version of metalcon we used memcached. I read a lot that memcached scales perfectly horizontal over several machines. I wonder how an integration of memcached to Signal Collect would work in order to make the asynchronous computation possible in a distributed fashion. Since random access memory is a bottleneck in any application I suggest to put the original memcached paper on our reading list.
One last point to mention is that both systems still don’t seem to be useful as a technology to built a distributed graph data base which enables online query processing.

]]>
https://www.rene-pickhardt.de/google-pregel-vs-signal-collect-for-distributed-graph-processing-pros-and-cons/feed/ 8