UVA 12357 – Ball Stacking

Ball stacking is a problem I had tried to solve a long time ago with no success. A week ago I decided to give it another try and I finally solved it.

After a first read, the more experienced competitive programmer will see that the way to goes through dynamic programming, but the real trick behind it is the way you look at the problem. In my first attempt, I tried to see a solution thinking of rows, instead of columns, but, it’s not hard to see that dealing with the selection of multiple balls in a row is now easy task by the nature of how we are obligated to select their parent-balls.

The easiest way to solve it is to solve by columns, from left to right. As the input is given as a triangle, you’ll have to imagine that the stack is moved around 30~45 degrees counterclockwise, and them think of how to solve it thinking of the columns. After thinking for a couple minutes, you can reach a few conclusions:

  • Now, when we select a ball, it’s rather simple to calculate it real cost: its the ball value summed to the balls above it in the current and previous column.

Now the dynamic programming solution looks clearer. We have to compute the best sum in the i-th line and j-th column, checking which ball from the previous column should be taken before the one in (i, j). This case can be calculated using the states from the previous column that had been calculated.

This current runs in O(n^3), which is enough for the online judges. To go even further, we can reduce the complexity to O(n^2) by keeping track of which is the best sum we could have by taking a ball at any line from the previous column, by simply keeping an auxiliary array that has in the position [p], the largest value from the dynamic programming memoizatin array in the interval [p, n],  preventing us to search for it in linear time.

The solution implemented is available here.

A Detailed Look: Merge Sort

Tonight, in an impromptu, I decided to code a merge sort. Why would I do that ? I know it’s a cool sorting algorithm and one of the first examples we are taught about Divide and Conquer, but it’s not a good reason to do that out of the blue anyway, so, as an attempt to make this sudden code-wish more productive, I decided to write a post looking this algorithm closely.

Basic Idea:

The principle behind Merge-Sort is rather simple, yet powerful. One have to keep dividing the object to be sorted (an array, in our case), until it’s unit fraction, meaning an array with one single element. From this principle, an unitary array is always sorted, and merging two sorted arrays is simple (more on that later). Then, by induction, we know how to sort two sorted arrays with 2 units, then with 4 and so on, an classic example of the divide and conquer technique.

Words are beautiful but how do we code it ? The code examples in this post are in C language (yeah, pointers, I know).

void merge_sort(int* array, int begin, int end) {
  if (end - begin + 1 <= 1) return;

  int m = (begin + end) / 2;

  merge_sort(array, begin, m);
  merge_sort(array, m + 1, end);

  merge(array, begin, m, m + 1, end);

When I first looked at this code years ago, I just couldn’t believe this actually sorted an array. The real mystery behind this code is that it uses recursion, it’s the first ‘big rock’ we encounter in the path of learning computer science, as it just doesn’t make sense in the first sight. I’ll suppose that readers as this point are know of recursion and it’s no longer a problem.

The best way to understand what’s happening with the recursive calls is graphically. One of the best images that shows how this division happens is available in Wikipedia merge-sort page:


The merge function keeps breaking the array in 2 parts until it reaches it’s atomic part (1 single element). Try to simulate one vector division by hand to deepens the understanding of the function (give a preference for vectors with size 2^n).

With the division part clear now, we have to understand the merge part and it’s complexity. The standard merging algorithm is presented below:

 * Merges two sorted pieces from *array:
 * [sl, el] and [sr, er]
void merge(int* array, int sl, int el, int sr, int er) {
  int len_l = el - sl + 1;
  int len_r = er - sr + 1;

  int pos = 0;

  // Creates a buffer to store the sorted interval temporarily
  // It is not possible to sort two sorted arrays in O(n) without additional memory
  int* buffer = (int*) malloc((len_l + len_r) * sizeof(int));

  while (sl <= el && sr <= er) {
    if (array[sl] <= array[sr]) {
      buffer[pos++] = array[sl++];
    } else {
      buffer[pos++] = array[sr++];

  while (sl <= el) {
    buffer[pos++] = array[sl++];
  while (sr <= er) {
    buffer[pos++] = array[sr++];

  sl -= len_l;
  sr -= len_r;

  for (pos = 0; pos < len_l; pos++) {
    array[sl++] = buffer[pos];
  for (pos = len_l; pos < len_l + len_r; pos++) {
    array[sr++] = buffer[pos];


As both array pieces are sorted we use a two-pointer technique by starting in the beginning of both arrays and keep comparing them, always adding the smallest one in a spare piece of memory so that we can retrieve from there later. As each single element from both arrays is visited only once, it’s easy to see the linear complexity in the function.

This is the time to ask whether it’s necessary to use this spare piece of memory. As an exercise, try to create a merge function that doesn’t use extra memory.


For those who are acquainted with computational complexity know have a feeling about the cost of this function. T(n) = 2 * T(n / 2) +\mathcal{O}(n). This recurrence relation can be easily solved using the Master Theorem, but as the recursion in simple in this case, we can find the complexity by hand:

\begin{aligned} T(n) &= 2 \times T(n / 2) + \mathcal{O}(n) \\ T(n) &= 2 \times [2 \times T(n / 4) + \mathcal{O}(n/2)] \\ T(n) &= 4 \times [2 \times T(n / 8) + \mathcal{O}(n/4)] + 2 \times \mathcal{O}(n) \\ T(n) &= 8 \times [2 \times T(n / 16) + \mathcal{O}(n/8)] + 3 \times \mathcal{O}(n) \end{aligned}

It’s clear that the recursion ends when n = 1, because k is how deep the recursion goes:

\begin{aligned}  \dfrac{n}{2^k} &>= 1 \\  n &>= 2^k \\  \log_2 n &>= k \times \log_2 2  \end{aligned}

Now we know that the largest possible value for k is \log_2 n, hence:

\begin{aligned}  T(n) &= 2^{\log_2 n} * T(1) + \log_2 n * \mathcal{O}(n) \\  T(n) &\in \mathcal{O}(n \times \log_2 n)  \end{aligned}

Independence = Parallelism ?

If you read the past posts from this blog, you already know that I’ve just finished a parallel programming course that blew my mind, so obviously I’d try to add some parallelism to merge-sort. Unfortunately, merge-sort isn’t the best example to show the power of parallelism as other problems (Sieve of Eratosthenes, Conway’s Game of Life), but we are still able to see the improvements of doing more than one thing in the same time anyway.

The first thing to think before adding (simple?) parallelism to some computation is to identify independent regions in the algorithm. In our case, for each level of splitting and merging is independent, hence, we can make each of them to be run by a different thread.

I’ll present three ways to add parallelism to merge-sort, both them using the fork-join method through OpenMP. The first one uses parallel sections, enabling us to keep our code recursive, the second one will use the simplest OpenMP directive (parallel for) and the third one uses task parallelism,. that is new to the version 3.0 of OpenMP.

Parallel Sections:

Using parallel sections, we are able to create blocks of code were each statement (function call) surrounded by an OpenMP directive is given to a different thread. In this case, we can put both recursive calls of merge inside a parallel section and make each of the calls to be run with different threads:

void merge_sort(int* array, int b, int e) {
  if (e - b + 1 <= 1) return;

  int m = (b + e) / 2;

  #pragma omp parallel
    #pragma omp sections
      #pragma omp section
      merge_sort(array, b, m);

      #pragma omp section
      merge_sort(array, m + 1, e);
  merge(array, b, m, m + 1, e);

Parallel For:

The directive omp parallel for is probably the first one taught when learning the OpenMP standard. The advantage of using it is that we can give a little less thinking about the parallelism (it’s often a bad idea to not think much about how we are doing parallelism). We just know that adding #pragma omp parallel for, each of the iterations will be given to a different thread. Of course, it’s not always that simple, and we need to take care about independence of variables (private, shared), also, the standard force us to have a well defined stop point to the iteration. More on the standard is available here.

void merge_sort(int* array, int len) {
  int i, j;

   * `i` Represents the size of a block compared.
   * It goes from 1 to the largest possible value smaller than N.
   * It grows as a power of 2 (due to the bitwise operation << ).
  for (i = 1; i < len; i <<= 1) {
     * Now merging the sorted intervals [j, j + i - 1], [j + i, j + 2 * i - 1].
    #pragma omp parallel for
    for (j = 0; j < len; j += 2 * i) {
      int start_l = j;
      int end_l = fmin(len - 1, j + i - 1);

      int start_r = end_l + 1;
      int end_r = fmin(len - 1, j + 2 * i - 1);

      merge(array, start_l, end_l, start_r, end_r);

Task Parallelism:

Task parallelism was basically the last subject taught in course I’ve taken (apparently because it’s kind of recent in the OpenMP standard). It’s not my plan here to teach about task parallelism, in parts because there’s a lot of other resources available that can teach it way better than me, so I’ll jump right on how it’s used in this case.

As an advantage, we can get back to our recursive version, is easier to code. Here, we can spawn new threads asynchronously on demand and synchronize them with 2 directives (task, taskwait).

void merge_sort(int* array, int begin, int end) {
  if (end - begin + 1 <= 1) return;

  int m = (begin + end) / 2;

  #pragma omp task
  merge_sort(array, begin, m);

  #pragma omp task
  merge_sort(array, m + 1, end);

  #pragma omp taskwait
  merge(array, begin, m, m + 1, end);


As most of us have seen merge-sort sometime in life, it’s often good to see one of those fundamental algorithms a little closer, trying different implementations and approaches. It’s been a while since I don’t take an individual algorithm to pay much attention so it was fun.

parallel bitset – an attempt

A few months ago, while thinking and coding a solution to a class problem that was the perfect case for the bitset, an side effect caused by my recent course in parallel computing showed up. I realized that operations inside a bitset are naturally independent (bits are represented as literal bits from integers, and bitsets larger than 64 bits are vectors of integers that can be processed individually).

As a reaction of this effect and with this parallel mindset still fresh in my mind, I decided to code a simple version of a STL compliant parallel bitset using OpenMP (fork/join model). It’s in early stage and some of the functions aren’t implemented yet, also the way it’s currently built and tested is pathetic (sorry).  The next steps are improving build/test, implement the missing functions and create benchmarks comparing it with the regular sequential implementation.

Any help to the small library is welcome 🙂

parallel programming – first two months

I decided to take CE-265 (Parallel Programming) this semester in grad school. As I never tried parallel programming before I decided to give it a shot, the fact the the professor is one of the biggest names in the area also weighed in my decision.

So far this is one of the best courses I’ve taken in my whole academic life. The lectures are dense and enlightening and the exercises are fun.  To run our programming in big clusters, we’ve been using the SDumont super computer. In this first two months we have explored machine architectures (SISD, SIMD, MIMD), Amdhal Law, OpenMP, Cuda and MPI. The highlights for me were the data-parallel algorithms, which take a lot of advantage of machines with SIMD architecture (GPUs), solving problems with a time complexity I thought was impossible (prefix sum in O(log n) for example), another amazing example of the power of parallel computing is the Bitonic Sort, where you can sort N numbers in O(log^2(n)) time using N processors, the idea behind the algorithm is bold and require some thinking as some of the steps are non intuitive making the algorithm even more surprising.

The course is hard because there’s one exercise given per week, requiring a full report with code, outputs and analysis. Some of them are: simulate the glider move in the game of life (OpenMP, Cuda and MPI (with ghost zones)), sieve of Eratosthenes with OpenMP and PI approximation with MPI, they were all fun (and time consuming) and helped me to synthesize my ideas about parallelism.

This post had the goal to just go through the topics covered in the course and my experience in it so far, a lot of what is covered by this course is also available in online MOOCs about parallel computing.

GSOC ’17 – The Last Week

A few weeks ago I wrote a very late post about my experience in the Google Summer of Code project. Surprisingly, those 3 months passed so fast that it’s the last week and this blog post wrapping up the whole work done in the project.

My accepted proposal to the program included 6 features that should be implemented through the three month span, three of them have been successfully merged to the LuaRocks code base, two are pending code review from my mentor and one is graphical (creating a ribbon) and it’s being designed. As the Work Product Submission Guideline suggests, this post will list the features listed, a link to their code and some commentaries.

  • Add a feature to Follow users

This one was the very first feature that I coded, it seemed to be the easiest one and was a good choice to get my feet wet. I involved a small change in the database, re-using the code used to follow modules and write tests. The code for this feature is here

  • Add a Login with GitHub

I was more comfortable with MoonScript when I started to code this one. LuaRocks already had a feature that allowed users to link their GitHub account with their LuaRocks one so it was a matter to fetch tokens from the GitHub API, adapt the database, add a button and write tests. Code is here.

  • Add a distinction between Following/Starring modules

This one involved a lot of commits and code. I had to change the database add new items in ui, models, controllers and tests, in the end I made a mess with a git rebase and the commit history ended up dirty :(. Code is here.

  • Create update feed

This one is currently pending code review. The longest feature so far, involving create new database tables, models, flows, adding new actions to controllers and writing tests. This one also has a ‘not so clean’ history so before it gets merged I expect to make it clear. The pull request is here

  • Send an Email Digest to users

This one is currently pending code review. LuaRocks already have an email service so the feature consisted of creating the weekly email content and the functions to populate the email it. The pull request is here.

  • Create a LuaRocks ribbon

This is the last one, it has the goal to the attached to sites and READMEs to display projects that are available through LuaRocks. The discussion over the ribbon is here.

Well, I guess that’s all. As there’s some more work left after the pending reviews, the project is close to the end. The experience so far has been amazing, having my code shipped to a real life open source project is fun, the feeling of contributing to something that will benefit the whole Lua/MoonScript community is a amaing. My mentor (Leaf) has helped me a lot through the whole process, I’ve learned a lot through his code reviews and I can say I’m a better develop after all these work he guided me through.  I’d like to end thanking him for all the help. I’d like to thank Hisham too for making LuaRocks and being supportive to the GSOC project.

The halfway of GSOC ’17

This post starts with some regrets. In my mind, I was supposed to keep a blog series (or a daily log) about my experience being on the Google Summer of Code ’17. This is my first time in the program and I’m working on the Luarocks project. The goal of my work in the organization is to improve the social aspects of the website, more information about Luarocks can be found in the previous link but long story short, it serves as a hub for Lua programmers to host their modules.

As I’ve reached half of the project (just delivered my second review), I’ll try to give some insights about the program and how it’s been affecting me as a software engineer to be. If you reached this post, I”ll suppose that you are aware of what GSOC is and how it works so I can use some words here without explaining their meaning in the program context.

Shortly after the excitement of being accepted in the program I realized that the community bonding phase was about to start. The core community of LuaRocks is rather small and their website, which is the part of the project I’m currently working has basically one maintainer (which happens to be my mentor). That being said, my community bonding phase was basically getting to know him better, learning Moonscript and figuring out how Lapis works. I’ve some background of Lua programming thanks to my use of the awesome window manager but the syntax differences of Moonscript caught me off-guard but I’m much more comfortable now. Lapis has been proving to be a great tool, it inherits some characteristics from Rails which made the learning curve smoother.

By that time I’ve managed to deliver 2 features that was proposed. 1. Adding the ability to follow developers (previously only modules were able to being followed). 2. Allowing users to register and login using their GitHub accounts. (There are two waiting for review and 2 to get started). I have to luck to have a very approachable mentor so I’m able to discuss all my ideas with him.

I still have one month left to implement my last two features an review possible issues the other two pending reviewing. The experience of having my code shipped to production and to be able to open the site and see a small change that I was responsible to is great, I’m looking forward for more of it, the program is a great opportunity to have contact with real-life codebases and to do meaningful contributions to them so I strongly recommend anyone interested in the program to move forward.

My contributions in the project can be seen here.

A confusion in graphs vertex-labels

A few days ago my friend at school presented me this problem. After reading carefully, one understand that it’s a wise decision to think about the swaps between characters in the strings as a graph, and that it’s (presumably) optimal to go from left to right and put in the position i of the string the largest character possible, and in case of multiples being available, use the rightmost one.

But here our confusion started. Thinking about the graph generated by the allowed swaps, can we always swaps the labels of 2 distinct vertexes without messing with the labels of other vertexes ?

The trick here is that if you start with an idea of finding a cycle that contain the both vertexes of interest and find a rotation between them, you will get stuck (that what happened to us for a while). The right idea here is to think about a single path between those interest vertexes and work exclusively in this path.

Let’s create an example, imagine that there is a path between a and j represented by [a, c, z, d, e, f, j], if you move a to the position of j (doing the mandatory adjacent swaps), the path between the a-j is now [c, z, d, e, f, j, a], that is, besides a, all other elements in the array shifted one position to the left, and moving to it’s position of interest (the first position), would make all other labels to do a right shift, moving them to their original position.

With that in mind, we can state that for every un-directed graph with labeled vertexes, we can re-arrange the labels between any pair of connected vertexes, without messing the labels of any of the other vertex. As for a path of size P between two vertexes (a, b)we need P – 1 swaps to move a vertex from their initial position to their goal position (a goes where b is) and P – 2 swaps to move the to a’s initial position.

I hope this small talk can help someone with silly doubts on graph theory. Obviously, this idea can be used to solve more elaborated problems like this one.

Dynamic Median Problem

Hi there,

Today I faced an interesting problem in 2012 NCPC called Cookie Collection. Long story short, the problem asks for online queries of the kind:

  1.  Insert a value x in a structure
  2. Retrieve and remove the current median value in the structure

Let’s define what is a median here: Given an ordered structure P with N elements, the median is the element P[N / 2] position, if N is odd, then the median the minimum value between P[N / 2] an P[(N / 2) + 1].

Approach 1:

In a first sight, we are tempted to keep this structure as an array, which give us a complexity of O(1) per type 1 query and O(n * log(n)) for type 2 queries. As N can be as large as 600000, this complexity is far from optimal, so, we can drop this idea right here.

Approach 2:

If we think more carefully, we come up with a different and more clever structure. What about keep 2 heaps, one to keep the N / 2 smallest values from our structure and the other one to keep the N / 2 greater values.

Lets keep our heap to keep the N / 2 smallest values as a max-heap, and the other one as a min heap. We know that both heaps will keep half of the values of our structure (one of them will have (N / 2) + 1 values in case our structure has an odd number of elements), hence, given the way the heaps are ordered, it’s easy to see that the median is in one of the heads of heaps, here’s a proof.


Let’s S and M be 2 heaps with the same properties stated above, storing elements from an ordered structure P with even size. We know that S has the smaller half of elements of P and M has the greater half of elements of P. If the portion of elements is equally distributed between S and M, we know that the median in P is the minimum value between the elements in the top of S and M, as the top of S is the greater elements form the lower half of P and the top elements from M is the smallest value from the greater half of P, the top element of both heaps are respectively P[N / 2] and P[(N / 2) + 1], hence the median is the smallest between both values.

With this idea in mind, the proof for P with an odd number of elements is trivial.


They key idea to implement this idea is to keep both heaps, and a key function to balance the heaps.

To insert an element we check whether it’s greater than our current median, if it is, we know that it belong to the greater half heap, otherwise, to the lower one.

To retrieve the median, it’s already explained in the proof above.

After insertions, our heaps may have a size difference greater than 1, when it happens, we need to balance then. To balance our heaps we have to keep removing the top element from the heap with greater size and insert it in the heap with smaller size, until the absolute size difference between both heaps is greater than 1.

As inserting and removing from a heap has O(log(n)) complexity, this approach has O(log(n)) complexity for queries of both type 1 and 2.

//Get median of a sequence in O(log(n))
void balance() {
while (abs((int) (minHeap.size() – maxHeap.size())) > 1) {
if (minHeap.size() > maxHeap.size()) {
int tmp = minHeap.top();
} else {
int tmp = maxHeap.top();
int median_retrieve(void) {
if (minHeap.empty() && maxHeap.empty()) return 0;
if (minHeap.size() == maxHeap.size()) {
return min(minHeap.top(), maxHeap.top());
} else {
if (minHeap.size() > maxHeap.size()) {
return minHeap.top();
} else {
return maxHeap.top();
void median_insert(int x) {
if (x > median_retrieve()) {
} else {

view raw
Dynamic Median.cpp
hosted with ❤ by GitHub

SPOJ – GREED – Greedy Island

I often keep looking at SPOJ’s status pages to decide which problem to read next. I ended up looking at this one yesterday.

The statement is simple, you have N cards, and some of them are repeated. You are given rules of card trades and have to answer the minimum number of trades to be made in order to have the N different cards.

The first thing to notice is that a simple greedy solution doesn’t work, because the order you trade your cards matters, and there isn’t a general rule for trade’s order.

The solution becomes clear when you have prior knowledge in Minimum-Cost Flow theory. One can see that we can model a bipartite graph, labeling the left side of this graph as the cards we current have and the right side as our goal. We can define directed edges between this sides going from the left side to the right side as trades, an edge between the sides (from node i to node j) can be defined with an infinity flow and a cost as the minimum number of trades to go from a card i to a card j (the cost of trading cards i and j can be defined as their shortest path between i and j in the trades graph). The problem is basically solved, we only need to assign edges from the source to the left side and from the right side to the sink.

From the right side to the sink, edges has negligible cost (0) and flow 1 unit, as an exercise to the reader, think about what would be the flow going from the source to the nodes of the left side of the bipartite graph.


Thanks for reading.


Red Programacion – Actividad 7 Partial-Editorial

Hi, it’s been a while. I took part in this fine programming contest held by Red Programacion statements here, I solved only 6 out of 12 problems proposed, and as I found the round very similar to South America ICPC Regional, I’ll write about the problems I solved.

A – Chasing the Cheetahs 

Given that all the cheetahs starts from the same point, we know that as different cheetahs has different speeds and release times, we can state a set S with cheetahs and a function F(x) = (the largest difference between 2 cheetahs in set S in time x). As cheetahs has fixed speed, F(x) is a parabola, which we can do a ternary search to find it’s global minima.

B – Falcon Dive

This is a simple one, you know the difference between the positions of the falcon in the two images, you know how the falcon should move in the third image. Unfortunately the input of this problem in the file is an image, you can’t copy/paste it, so you have to manually type all those characters to test (I just created my own test cases).

F – Lunch Menu

This is a rather interesting problem. We have count in how many different ways we can select one element of each of the 4 different kind of dishes and the sum of their values has to smaller than an integer L.

Let’s think about a reduced versions of this problem. If we only had one kind of dish, it could be solved in O(log(n)) time with a binary search. With two kinds of dishes, then we have to iterate over one kind and binary search in the other one, leading to O(n * log(n)). Using  tree dishes, we iterate over 2 of them an binary search the third one in O(n^2 * log(n)). With 4 kinds, we can take two of them, and pre-process all their possible n^2 combinations, sort them and solve it as we only had 3 kinds in O(n^2 * 2* log(n)).

G – The Owl and The Fox

Another straightforward one, one has to decrease by 1 the rightmost non-zero digit from the N

K – How Sweet It Is

Straightforward simulation, keep track of your amount of candies, if it’s enough to buys candies, buy it.

L – Lottery Co-primes

This one is straightforward, try all N – 2 divisions to the string and calculate their greatest common divisor to check whether they are co-primes.