A programmer's blog

Fun at Hack/Reduce!

Posted in hackreduce by pascaldimassimo on March 28, 2011

Last Saturday, I went to Hack/Reduce. Organized by the folks at Hopper, the event was an opportunity to learn to use Hadoop. They took the time to prepare a EC2 cluster of more than a hundred nodes. They also loaded a set of popular dataset for us to play with. With clear instructions on how to deploy our jobs to the cluster, we were ready to hack! Each team had an idea about what they want to do with the data. The Hopper’s guy were there to help us realize it!

I teamed up with Mathieu Carbou and David Avenante to build a full inverted index from the Wikipedia dataset. Our job used Mathieu’s xmltool to parse each Wikipedia page and some Lucene tokenizers to extract words and positions. I was a lot of fun to see this running on more than 400 cpus!

At the end of the day, each team took the time to present what they were able to accomplish. It was really impressing to see what was done in a single day! One team dig through flights data and discover that it is cheaper to travel on Friday and Saturday. Many teams also leveraged the bixi dataset to extract interesting information about Montrealers’s usage of the bikes. Neat stuff!

I’d like to thank the Hopper’s staff for such a nice event! Well done guys!

Confoo Day 3

Posted in confoo by pascaldimassimo on March 11, 2011

Today started with Scaling Web Apps with RabbitMQ. We were introduced to the basics of AMQP. We went through some use cases where using a message queue system made sense for a web application. An example is images processing, which can be done asynchronously by sending those jobs to a process via RabbitMQ.

Next it was Varnish in action. Varnish is a web server accelerator that works by caching page content. It is configured to sit in front of the web server so that it protects it from serving request for which the response is already in the cache. It is also called a reverse proxy. We were presented some general configuration guidelines.

I was then introduced to What every developer should know about Performance. It was the third talk I attended by Morgan Tocker and I did not regret it because he’s a great talker. For him, the main aspect to consider when trying to optimize the performance of a web application is the response time. But not only the average response time. It is important to properly log the response time of all requests in order to determine in what circumstances it is bad. We should also log what each request is doing so that we know where the time was spent. He mentioned that we should not be afraid of activating those logs in production as the overhead of this is generally low and because this is crucial data in order to optimize the performances.

Just before lunch, we were lectured by a Microsoft’s representative on Interoperability and Web Standards. The presenter worked hard to convince us that Microsoft has changed and that it is now working to implement open standards. This was an awkward moment.

After lunch was a presentation on Solr Search Engine: Beyond The Basics. Despite the fact that I already know Solr well, I learned a couple of things during this talk, like that you can define a default core. The presenter was obviously well-versed in Solr and the slides were funny!

My last talk of the day was Step by Step: GC Tuning in the HotSpot JVM. That was pretty technical stuff for a Friday afternoon! We learned the basics of generational GC in HotSpot. We discussed the differences between the Parallel Old collector and the Concurrent Mark-Sweep collector. We went trough some settings to control the behavior of those collectors. The presenter suggested that we should always let the GC log activated in production in order to monitor any problem related to GC in an application. Similar to what Morgan Tocker said earlier this day, we should not fear the overhead incurred by logging, as this is vital data.

So that was it! I really enjoyed the time spent at Confoo. I learned quite a lot!

Confoo Day 2

Posted in clojure by pascaldimassimo on March 10, 2011

Another day at Confoo!

I’ve started the day with Designing HTTP Interfaces and RESTful Web Services. So far, it’s my favorite talk of the event. The guy is very knowledgeable about all things HTTP and REST and had a funny way to present. We learned about what it really takes to design clean and efficient HTTP and REST api. It was the first time I heard about the Richardson Maturity Model. I encourage you to take a look at the slides of this one, it was really good!

Then I went to Scalable Architecture 101. I think they were too much information into this one. We basically go through all the typical servers (web, database, mail, cache, DNS and others!) and talk about what it takes to make those scalable. One point that the presenter insists on, and that I agree with, is to never use a database to store web sessions. Use Memcached!

After a nice lunch with other attendees, I attended a Panel: Which NoSQL database should you use? Three panelists representing respectively CouchDB, Cassandra and MongoDB were there answering questions of a moderator and the audience. Basically, they talked about how those 3 products handle ACID. To me, CouchDB stood as the most interesting of the group, but I do not think the Cassandra representative did a good job at promoting this db.

I concluded the day with Linked Data: The new black. It was an introduction to RDF and the principles of the semantic web. I think I was a bit tired at this point because I had trouble keeping my focus. But the guy did a fairly good job at showing how this is useful and that we are only at the beginning! Web 3.0 looks promising!

Interesting day!

Confoo Day 1

Posted in clojure by pascaldimassimo on March 9, 2011

I’ve spent my first day at Confoo.ca here in Montréal.  I learned a lot in a single day! Here’s my thoughts about the conferences I attended.

I’ve started the day with Java EE 6 – how J2EE became popular again. We learned about the new Java EE 6 released a couple of months ago. We spent most of the time exploring the features of the web profile. It’s interesting to see how the standard Java technologies are now influenced by open source products like Spring, Guice and Hibernate.

Then, I went to a presentation on whether or not we still need Java web frameworks. The presenter’s main argument was that the traditional Java web frameworks like Struts, Tapestry or Wicket are not appropriate anymore. He even got as far to say that MVC is not valid anymore! Some attendees were skeptic. He advocates the approach of picking only the products that are needed to get the job done, not relying on a complete frameworks. He then presented us the solution that they built for a client using different products like Jersey, Socket.io and Reddis.

After lunch, I went to a somewhat related talk entitled Why MVC is not an application architecture. The presenter started by explaining us that, originally, the MVC pattern was nothing more than the Observer pattern for UI. According to him, it has nothing to do with the Web, especially not for today’s web application. He then got into different patterns to help PHP developers build more layered applications. To me, it was interesting to hear how the PHP community is applying patterns similar to the one described in the Core J2EE Patterns book.

After that, I got into a completely different subject, that is An Overview of Flash Storage for Databases. The talk was an overview of the current state of the technologies related to enterprise SSD hard drives. While still expansive, those drives are way more efficient in terms of IOPS. The presenter claimed that, to really take benefits of them, applications, especially RDMS, needs to change the algorithms they use to read and write data to disk.

The last talk I attended was Building servers with Node.js. We started by being introduced to the general principles of evented I/O. We then got a nice introduction to Node.js by seeing the traditional echo server example. While I really think that asynchronous is the way to go to have scalable web servers, I am not comfortable with the spaghetti code that results of the use of callbacks.  I like the approach taken by other evented I/O products like cool.io and async-http-client to make the code more readable. But nevertheless, I think this is a really nice software!

I really enjoyed that first day and I am looking forward to the next!

Java and the Reactor pattern

Posted in java by pascaldimassimo on February 10, 2011

The Reactor pattern is a common design pattern to provide nonblocking I/O. Instead of having multiple threads that are blocked waiting for IO to complete on a connection, you assign a single thread that is responsible to monitor all the connections. When all the IO operations are completed for a connection, that thread can fire up an event so that another thread starts processing the data coming from the connection. This approach works well when you have to handle a lot of connections, because you are not force to dedicate a thread for each connection, which might consume lot of resources if the number of connections is high.

Since Java 1.4, the Selector class provides an implementation of this pattern. You start by registering connections to a Selector instance. Then you call the select method of the Selector to get a list of all the connections that are ready to do IO operations, like read and write. A single thread can be assigned the responsibility of polling the selector and send notifications when a connection is complete. See this nice tutorial for more details on how to use the Selector class.

To familiarize myself with the Selector class, I wrote nio-crawler. It uses a single thread to fetch links from the web, using the Selector class. When a page is fully downloaded, it is passed to a handler thread that parses the HTTP and the HTML to get the new links to follow. The handler threads never block on IO, so they never sit idle (as long as there is data to parse, of course).

Comments are welcome!

Lucene and Solr Introduction

Posted in java by pascaldimassimo on January 29, 2011

Here is a presentation I gave on November 18th 2010 at the Montreal JUG.

Learning Clojure

Posted in clojure by pascaldimassimo on November 3, 2010

I started learning Clojure a couple of weeks ago. Tough my brain is hurting, I must say that I really enjoy it. It is so refreshing! Clojure has been on my radar for a year or two, but that video from Rich Hickey gives me the motivation to actually start learning it. I am currently using the labrepl to practice writing clojure code. Nice stuff!

Browsing through Github, you can find a couple of really high-quality projects written in Clojure, like Compojure and Leiningen. It always amazes me to see how fast the community can build around a promising new language.

Last week, Eugene Wallingford says that the time is right for functional design patterns.

Who knows?

How to re-crawl with Nutch

Posted in java, nutch by pascaldimassimo on June 11, 2010

Nutch allows to crawl a site or a collection of sites. If your objective is to simply crawl the content once, it is fairly easy. But if you want to continuously monitor a site and crawl updates, it can be harder. Harder because the Nutch documentation does not have many details about that.

After a bit of digging, I found that Nutch offers an Adaptive Fetch Schedule class that can be used for that purpose. To understand how this class works, let’s recap how Nutch manage crawl.

Nutch maintains a record on file of all the urls that it has encountered while crawling. This record is called the crawl db. Initially, the crawl db is build from a list of urls provided by the user using the inject command. An important concept in Nutch is the generate/fetch/update process. The generate command looks up in the crawl db for all the urls due for fetch and regroup them in a segment. An url is due for fetch if it is either a new url or if it is time to re-crawl it. More on that later. The fetch command will, well, fetch on the web all the urls of the segment. After that, the update command will add the results of the crawling (stored in the segment) into the crawl db. Each url crawled will be updated to indicate the fetch time and the next scheduled fetch. New urls discovered will also be added and marked as not fetched.

By default, Nutch will set the next scheduled fetch of a page to be the fetch time + a constant interval. The default value is 30 days, but it can be changed in the file nutch-site.xml via the db.fetch.interval.default property to whatever value. On a later generate call, if the time has come, the url will be added to a segment and re-crawled. This default behavior can be acceptable if roughly all pages of a site change at approximately the same rhythm. But if the site being crawled contains a lot of pages that almost never change, you would probably want Nutch to visit these pages less often and concentrate on the one that changes frequently. But it is not possible to do that with the default fetch schedule that uses the same constant interval for each url.

Enter the Adaptive Fetch schedule. This fetch schedule will adapt to the rhythm of changes of a page and set the next schedule time accordingly. When a new url is added to the crawl db, it is initially set to be re-fetched at the default interval. The next time the page is visited, the Adaptive Fetch schedule will increase the interval before the next fetch if the page has not changed and decreased it if the page has changed. Note that a maximum and a minimum interval is defined in the configuration. The interval will never be longer than that maximum or smaller than the minimum. So after a while, the pages that changes often will tend to be visited more than the one that does not.

db.fetch.schedule.class The implementation of fetch schedule
db.fetch.interval.default The default number of seconds between re-fetches of a page
db.fetch.schedule.adaptive.min_interval The min number of seconds between re-fetches of a page
db.fetch.schedule.adaptive.max_interval The max number of seconds between re-fetches of a page
db.fetch.schedule.adaptive.inc_rate If a page is unmodified, the interval before the next fetch will be increased by this rate
db.fetch.schedule.adaptive.dec_rate If a page is modified, the interval before the next fetch will be decreased by this rate
db.fetch.schedule.adaptive.sync_delta If true, try to synchronize with the time of page change by shifting the next fetchTime by a fraction (sync_rate) of the difference between the last modification time, and the last fetch time

If a page was modified, the Adaptive Fetch schedule will store the last fetch time as the last modification time. Nutch will use that information in the If-Modified-Since header of the http request of the next fetch. If the web server supports this and the page has not changed since, it will only returns a 304 code. Note that there is a bug in Nutch 1.0 that prevents this to work properly. I have reported the bug and it will be fixed for Nutch 1.1. You can use the trunk in the meantime.

How does Nutch can detect if a page has changed or not? Each time a page is fetched, Nutch computes a signature for the page. At the next fetch, if the signature is the same (or if a 304 is returned by the web server because of the If-Modified-Since header), Nutch can tell if the page was modified or not. By default the signature of a page is built not only with its content, but also with the http headers returned with the page. So even if the content of a page has not changed, if an http header is not the same (like an etag or a date), the signature changes. To solve that problem, there is the TextProfileSignature class. It is designed to look only at the text content of a page to build the signature. To use it, you need to set the db.signature.class property to org.apache.nutch.crawl.TextProfileSignature.

A word about the setting db.fetch.schedule.adaptive.sync_delta. I set it to false for my crawls because I have not been able to really understand what it is good for. As I described earlier, the next fetch time is computed by adding a dynamic interval to the last featch time. But with this setting set to true, the interval is applied to a reference time which is a time located between the last fetch time and the last modification time. If someone can enlighten me about the usefulness of this, please do!

How to return a single JSON list out of MappingJacksonJsonView

Posted in java, spring by pascaldimassimo on April 13, 2010

I really like to build REST web services with Spring Web MVC. The ContentNegotiatingViewResolver allows to easily render an object or a collection of objects to many representations, like XML, JSON or good old HTML.

In a recent project, I was using the MappingJacksonJsonView to render my objects to JSON. It works pretty well. Just put your POJO objects in a ModelAndView and it will be transformed to a JSON map. It also works with collections of objects. If you add an ArrayList of objects to a ModelAndView, MappingJacksonJsonView will transform it to a JSON list containing a map for each object. However, MappingJacksonJsonView always wraps the objects you put in the ModelAndView inside a JSON map. It makes sense in the majority of cases, but when you want to return a single list, it will be embedded inside a map, which was not what I want.

So, by default, this code:

ModelAndView mav = new ModelAndView();
mav.addObject(listOfObjects);
return mav;

will produce this JSON representation:

{"objectList" : [{"name":"object1"}, {"name":"object2"}]}

But what I would like to have is simply:

[{"name":"object1"}, {"name":"object2"}]

In order to do that, I had to subclass the MappingJacksonJsonView class to override the filterModel object. This method returns the map of all objects to be transformed to JSON. But since it always returns a map, the final output is always a JSON map. So what I did in my subclass is, after executing the parent’s filterModel method, check if the map contains a single object and if that is the case, return that single object out of the map.

import java.util.Map;

import org.springframework.web.servlet.view.json.MappingJacksonJsonView;

/**
 * This class will make sure that if there is a single object to
 * transform to JSON, it won't be rendered inside a map.
 */
public class MappingJacksonJsonViewEx extends MappingJacksonJsonView
{
 @SuppressWarnings("unchecked")
 @Override
 protected Object filterModel(Map<String, Object> model)
 {
    Object result = super.filterModel(model);
    if (!(result instanceof Map))
    {
       return result;
    }

    Map map = (Map) result;
    if (map.size() == 1)
    {
       return map.values().toArray()[0];
    }
    return map;
 }
}

Voilà! Now it only returns the list.

Follow

Get every new post delivered to your Inbox.