Predicting the Movement of Individuals in a Multi-state Process Using a Hazard Rate Model

Written by: Lotte Harrijvan

During my time at Notilyze I developed a quantitative model for one of their biggest clients as part of my thesis for the master Quantitative Finance. The aim of the project was to map the movement of individuals in a multi-state process consisting of a number of states. Using these movements we can forecast future cash flow. The idea is to use survival analysis to predict the hazards of transitioning in this process. By estimating the hazard or risk to transition to another state, we can predict the future path of an individual in this multi-state process. The hazard of transitioning can be estimated with a Cox regression, which allows us to incorporate individual variables that may affect the transitions. It therefore allows us to estimate the hazards on a monthly base for each individual separately. These hazards are then transformed into transition probabilities. We put these estimated transition probabilities into transition matrices. With these monthly transition matrices we can predict the future path of the individual and produce a state variable over time. This state variable is a dummy variable indicating the active state of the individual.

Next, we use these state variables to predict the monthly payments of an individual because these payments are dependent on the states and this way I was able to link the thesis to a financial concept. A logistic regression is used for these payment predictions with the state variables serving as explanatory variables.  Ultimately, the process produces a monthly overview of the individuals distributed over the different states (see figure) together with the monthly payment predictions for each individual.  This provides the client with a lot of insights into their core process and it therefore can help them in their decision making. The next step, in finishing my graduate internship at Notilyze, is to implement the model. I will be doing this in cooperation with the client. I am very excited to see my model actually being implemented and used.

SAS EMEA hackathon 2020 – Personal experience team members

Written by: Monica Knook

Read time: 5 minutes

Notilyze participated in the SAS EMEA Hackathon 2020. The goal of this Hackathon was to find a way to add sustainable value in real life business. As the connection with data for good was important, the team of Notilyze made a model to extract information from satellite images to help estimating the number of refugees in refugee camps in Nigeria. Better camp size estimation ensures that the demand in each camp can be met easier, as surpluses in goods can be moved to camps with shortages, leading to a more effective cross-camp collaboration. With this case Notilyze ended up winning the first prize. For this article, a few team members were interviewed about their personal experience in participating in the hackathon.

You’ve accomplished a great case in the hackathon. What was your role in the team?

Fleur: I was part of the start of the project where we build the Camp Forecast tool with IOM and ELVA. During that project I analyzed the data and used this knowledge for the Hackathon project. Besides, with a helicopter view on this project, I helped with how to use and communicate the solution of the hackathon challenge in the form of a video.



‘’Making this training dataset came along with a lot of labelling, which really was a team effort.’’



Paul: As Data Analyst, my role in the team consisted of multiple tasks. First of all, I was involved in combining the data sources (satellite imagery and surveys). Secondly, I built an Object Detection model to detect the tents from the training dataset. Making this training dataset came along with a lot of labelling, which really was a team effort. Finally, I supervised the creation of a dashboard with useful insights, which has an easy user interface.

Quinten: As a Data Scientist at Notilyze, my role in the SAS EMEA Hackathon was mainly to preprocess all satellite images obtained from Google Earth. The most crucial part of this preprocessing was the contrast stretching of the images, which greatly improved the accuracy of our object detection model (see our previsous blog for more elaboration on this). I was responsible for writing a script in SAS Viya that could efficiently preprocess tens of thousands of images.

And what did you liked or enjoyed most participating in the hackathon?

Paul: I really liked the aspect of a central theme, without getting a pre-specified problem to solve. Although this kind of hackathon requires a little more creativity in the beginning, it results in really different and interesting cases from all teams. I think in this way more value is obtained from a Hackathon.

Fleur: Working with a great team on a project that is so valuable to society.



‘’Luckily I could reach out to Jaimy van Dijk,

Data Scientist at SAS.’’



What was your biggest challenge in the hackathon?

Fleur: To visualize our solution in a video, which was a great challenge!

Paul: My personal biggest challenge was to use SAS VDMML to create an Object Detection model called a ‘Faster R-CNN’. Luckily I could reach out to Jaimy van Dijk, Data Scientist at SAS. She could answer some of our most prevalent questions, leading to a working model.

What have you learned by participating in the hackathon?

Quinten: From a technical perspective I learned to use the SAS Viya CAS Image Action Set and how to get the most out of its functionalities. It turns out there are plentiful possibilities in SAS to preprocess images in a meaningful way, all the while being super quick compared to other software. In a broader sense, it was an interesting challenge to go through the full process (from defining the problem to obtaining results) in a relatively short period of time.




‘’We achieved this by writing articles and making small videos, which were all very new things for me to do!’’




Paul: Considering the possibilities with SAS, I have learned how to build an Object Detection model. Also I have learned how to incorporate images in a dashboard as Data Driven Objects with some help from Peter Kleuver (SAS). Another thing I learned is how much effort is put into marketing. As I am currently graduating for my MSc. Econometrics and Management Science, this project was the first one that involved a lot of people outside the world of data science that we also wanted to involve in this project. We achieved this by writing articles and making small videos, which were all very new things for me to do!

Fleur: To have a role in different fields in one project, from data analytics, marketing to graphic design.

So now the hackathon is finished, what does the future look like for this project?

Our collaboration with ELVA and IOM does not stop here. Based on the work done during the hackathon, we will jointly determine what steps need to be taken to get this dashboard into production. A specific and tangible step to improve both the model and the applicability of our solution is to find a provider of satellite imagery. During the hackathon we have worked with imagery from Google Earth, but it is necessary to get a more stable and more frequent stream of images for all areas of interest.

How did you watch the announcement of the winners?

Together with colleagues we watched the announcement, enjoying some pizza and drinks.

What was your reaction when you’ve heard Notilyze had won the first prize?

Paul: Of course I was very excited at the moment I heard we had won the first prize! But to be honest, it took some time for me to really realize what this meant for us. I am really excited to go to Cary somewhere in the near future!

Fleur: I was so happy and enjoyed celebrating this with the team!

In the near future we will update you about the next steps that need to be taken. And about our road to Cary. So stay tuned for more…

Monica Knook is currently doing her graduation Internship in Global, Marketing & Sales at Notilyze. Besides doing her internship, she was also part of Notilyze’s SAS Hackathon team. In this team she was involved in the Marketing of Notilyze’s case.

Improving Object Detection with Contrast Stretching (Part 2/2)

In the previous article about contrast stretching, we explored percentile contrast stretching and how to apply this to obtain better performance in object detection models. Percentile contrast stretching is also called (histogram) normalization, as we normalize the range of the pixel intensities. In this article we will examine another contrast stretching method, called histogram equalization. Besides comparing model performance, we will also compare the preprocessing speed for histogram normalization in SAS with our own written percentile contrast stretching in Python.
A small recap of the previous article; we want to train an object detection model (Faster R-CNN) to detect tents in refugee camps. in Figure 1 the mAP plot of our base case can be seen, together with the rolling mean (window=100) of a measure for the area size of the detections (sqrt(Area)). We notice two things: although a lot of tents are found (recall=0.7), we have a lot of false positives (precision=0.2). The mAP score is 31%. We want to improve this result by using contrast stretching, and in this article we specifically look into histogram equalization.

Figure 1: Precision-Recall plot before contrast stretching, 10 epochs

We will first explain the math behind histogram equalization for all math enthusiasts, but feel free to skip directly to the results!

The math

With histogram equalization one can uniformly distribute all pixel intensities over the range [0,255]. Instead of simply spreading out these values more, which was what we did with percentile contrast stretching, we now choose new values for each pixel intensity in such a way that the histogram becomes uniformly distributed. To do this, we are looking for the transformation y=f(k), where y is a new pixel intensity based on the old pixel intensity k.

We will approach the histogram with pixel intensities from a probabilistic point of view. Then we can summarize all pixels N with intensity xi as drawings from a stochast X. We can express the occurrence of one value of k as a probability with

where I{xi=k} is an indicator function:

Then the discrete cumulative distribution function (CDF) is

To make y uniformly distributed on the range [0,255], we will use the transformation

We prove that this transformation will result into a uniform distribution as follows. As we introduced pX(k) this transformation Y=f(X) leads to a new distribution pY(k), which can be deduced by using the inverse CDF method:

Taking the derivatives with respect to y of both sides gives

As f-1(y)=k by definition, we get

Substituting y=f(k), we obtain

which equals the probability density function of a uniform distribution on the domain [0,255].


In the first row of Figure 2 an example of an image slice before (left) and after histogram equalization (right) can be found. In the middle row of this figure the histograms of the 400×400 pixel values of the original slice are shown, together with the histograms of the two stretched slices. The difference in stretching methods is especially clear in the tails. For the 2-98 percentile stretching, we have a larger number of pixels having pixel value 0, as all the pixels having an original value lower than the 2 percentile value will attain this value. The same holds for the value 255.

Figure 2: Top row: Image Slices (Source: Google Earth, Maxar Technologies, second and third slice are edited. Middle row: Histogram of pixel distribution from slice above. Lower row: distribution of the standard deviation per image slice for each image band (GBR) and the mean of the standard deviation of the three channels (in orange)

Also we compare again the standard deviaton of pixel values in each image slice. The standard deviations of all 58,163 image slices are plotted in the bottom row of Figure 2. An observation that stands out is the fact that the spread of sigma is a lot smaller for the set of images after Histogram Equalization. This makes sense, as we are trying to make the pixel values of each image slice uniformly distributed. For this reason, the variance of the values of each slice will also approach the theoretical value of a uniform distribution:

With a=0 and b=255 we get

This leads to a standard deviation

which is exactly where you can find the peak in the histogram.

In Figure 3 we see that the mAP has increased from 31.03% to 40.93%. The effect is a little larger than the percentile contrast stretching from our previous post (mAP=40.58%).

Figure 3: Precision-Recall plot after histogram equalization, 10 epochs


Comparing processing speed

Although our two methods do not differ considerably in terms of increasing model performance, we are interesting whether one method significantly outperforms the other on speed. Therefore we run both methods serially on the same virtual machine with 16 cores of 2.4 GHz (8 CPUs with each 2 cores). The total memory of this virtual machine is 264 GB.

In order to do the percentile contrast stretching in Python, we wrote our own algorithm as we could not find a function that would do this for us. How you write this piece of code is very important on the speed performance. Our first algorithm could process around 50 image slices per minute. By using more numpy packages and less for-loops we increased the speed to 2,500 image slices per minute.

When coding in SAS you generally have less flexibility compared to coding in Python. That is, in SAS there might be less than 5 ways to get to the same outcome while in Python there are easily more than 20 ways, all depending on different kind of packages. However, this flexibility (from easy-to-read code to very efficient coding) comes with a trade-off in speed. To obtain a result in Python is not as hard as it is in SAS, however to obtain a result quickly can be much harder in Python due to the large amount of options you have when coding a certain program. Therefore a SAS program could have a speed advantage. This indeed seems the case. The histogram equalization of 58,163 image slices of 400×400 pixels each takes 7 minutes, which means we reach a speed of 8,300 image slices per minute, which is more than 3 times faster than our Python code.

Although the contrast methods differ and therefore the difference in speed cannot completely be attributed to the difference in Python and SAS, it gives an indication of how fast SAS can be.

If you would like to receive the program codes of both Python and SAS, feel free to reach out via the LinkedIn post about this article!

Improving Object Detection with Contrast Stretching (Part 1/2)

An important part of training neural networks is preprocessing of the input. A lot of performance gain can be obtained by carefully examining, cleaning and transforming the input data. In this post we will consider the influence of contrast stretching of our input images on the performance of a Faster R-CNN network to recognize objects. In this Hackathon these objects are white tents in refugee camps. This post is the first part of our series about contrast stretching. The second part can be found here.


A Faster R-CNN is an object detection model. It is based on conventional CNNs. A CNN (Convolutional Neural Network) is often used to classify images. A Faster R-CNN model is based on such a CNN model and extends it with region proposals. These region proposals contain regions of the image in which an object is potentially present. Then the CNN part classifies these proposals.

Using a training set of 3,500 image slices, we train the Faster R-CNN model 10 epochs, with in each epoch 1,000 randomly chosen images. We validate our model on a validation set, consisting of 2,034 image slices with in total 1,976 white tents.

In Figure 1 the mAP (mean Average Precision) plot can be found, together with the rolling mean (window=100) of a measure for the area size of the detections (sqrt(Area)). Keep in mind that, to create a mAP curve, the detected boxes are sorted by the probability the box contains a tent, according to the model. We notice two things: although a lot of tents are found (recall=0.7), we have a lot of false positives (precision=0.2). Especially the sharp drop in the mAP curve at the left side of the graph is a problem; this indicates that, although the model is quite confident about these boxes containing a tent, in reality they do not contain any object. Also the model seems to be more confident when assessing larger bounding boxes, as those appear (incorrectly as just discussed) in the beginning of the mAP curve.

Figure 1: Precision-Recall plot before contrast stretching, 10 epochs

We try to improve these results by using contrast stretching. Contrast stretching should theoretically improve the learning ability of the model as it enhances the contours of objects and emphasizes the difference between object and background. This helps the convolution layers in extracting information and features from the images. Contrast stretching can be done for an image simply by using a formula that scales up the differences in pixel values. As we would like to use colored images, we have to scale up three image ‘channels’: the red channel (R), the green channel (G) and the blue channel (B). A common convention is to represent each pixel of each channel with an integer in the range [0,255], where a RGB value of (0,0,0) equals black and (255,255,255) represents white.

The math

As we deal with image slices of 400×400 pixels, this results in three channels of 400×400 pixels. So in total one image slice can be represented by a 400x400x3 matrix with values between 0 and 255. Before we dive into the formulas, one final remark needs to be made. We are going to split up this 400x400x3 matrix into 3 matrices of 400×400. Then we will apply contrast stretching on each color channel (R, G and B) and afterwards we will combine the three stretched matrices again into a 400x400x3 matrix.

Then stretching can be done by transforming pixel x with a formula like:

where a represents the lower boundary to which we want to scale and b represents the upper boundary. Instead of using the minimum and maximum value in our contrast stretch as lower and upper boundary, we use respectively the 2 and 98 percentile values. An advantage of taking these percentile values is that it is more robust to outliers; if only one pixel in the image channel would be 0 and only one pixel would be 255, no contrast stretching would occur. As we use the 2 and 98 percentile values, we generate more stretching. We should be careful though that after stretching this way, we should round up all values below the 2 percentile to a value of 0 and all the values above the 98 percentile to a value of 255. This directly shows the disadvantage of stretching with different values than the minimum and maximum value; we lose some information.

To quantify the effect of the contrast stretching, we introduce a metric that represents the spread in color within one channel k in one image n: σk,n.

This standard deviation can be obtained by calculating the variance of one 400×400 matrix. If we indicate a pixel x at row i and column j of this 400×400 matrix with xi,j, we get:

where is the average pixel value in the matrix, calculated with

with I,J=400. Then σk,n = √σ2k,n.

Our total dataset consists of N=58,163 images.

In the upper two image slices in Figure 2 the qualitative effect of 2-98 percentile contrast stretching can be seen.

In the two diagrams below we have plotted the histograms of our metric. For each of the image slice we have calculated the standard deviation of each of the three image channels. Then we apply the contrast stretching separately on the three channels, resulting in the diagram on the right. As we would like to monitor the effect of separately applying contrast stretching on each channel, we also plotted a histogram of the total image, by taking the average of the three standard deviations (R,G and B, so K=3):

As can be seen, the shift in the average standard deviation for each image is representative for the shift per channel. Moreover, a significant increase in standard deviation can be observed. We can calculate the average standard deviation of all the image slices with

Figure 2: Top left: original image slice (Source: Google Earth, Maxar Technologies). Top right: same image slice after 2-98 percentile contrast stretching. In the lower two images the distribution of the standard deviation per image slice for each image band (GBR) and the mean of the standard deviation of the three channels (in orange)

Notice that this standard deviation of the pixel values in an image increases a lot. The average standard deviation of all pixel values within an image slice (400×400 pixels) was μσ = 26 before contrast stretching and after stretching it is μσ = 64. Also the values of the standard deviation are more normally distributed. Every channel has been stretched with its own 2 and 98 percentile values (per image slice), but as can be seen the stretching effect is the same for every channel.


When training the model on this pre-processed dataset, we end up with a new mAP curve, see Figure 3.

Figure 3: Precision-Recall plot after 2-98 percentile contrast stretching, 10 epochs

When comparing Figure 1 to Figure 3, we can notice several improvements. First of all, the recall increases from 0.7 to 0.8, which means 80% of the 1,976 tents is detected. Also the sharp drop in precision at the start of the mAP plot has been decreased. Together this leads to an improving mAP score of 9%. Furthermore, notice that the rolling mean of the area also decreases more gradually. As we want the model to be not too sensitive regarding the size of tents, this is an improvement as well. However, we notice that smaller detected bounding boxes still come with a faster drop in precision, as can be seen when following the blue graph from recall=0.6 to the right. So some improvement can still be made.

SAS EMEA Hackathon 2020 – A deeper dive into our case

As stated in our previous post, we are going to make a model to extract information from satellite images to help estimating the number of refugees in refugee camps in Nigeria. In this post we will dive some deeper into this goal. We will answer the questions “How do we want to make this model work?” and “How is this going to help in estimating the number of refugees?”

How do we want to make this model work?
In Figure 1 an example of satellite imagery can be found. In this image tents are clearly distinguishable. We think it is possible to extract the number of people in a refugee camp by considering the buildings in such a camp. Based on the buildings, several features can be extracted e.g. the number of tents or the total area the tents cover. Bjorgo (2000) has shown that information from satellite imagery can be used in estimating populations in refugee camps, doing this for 5 refugee camps. However it is cumbersome to count the number of tents or the covered area by hand, especially when going from 5 to 50 camps and when considering temporal variation as well (see Figure 2). We want to automate this process in such a way that it can easily be used during daily operations, such as decision making in supplies distribution. We will train this object detection model by using survey data collected in these refugee camps.

Figure 1: A refugee camp in Nigeria, 10 June 2015 (Source: Google Earth, Maxar Technologies)


Figure 2: A refugee camp in Nigeria, 2 January 2016 (Source: Google Earth, Maxar Technologies)

How is this going to help in estimating the number of refugees?
When looking at Figure 1, one could question the accuracy of the relationship between the tents and the number of refugees. Tents could be placed in advance to anticipate on an increase in number of refugees, or tents could be empty because of a decrease in the number of refugees in a camp. Also a shortage of tents could lead to refugees without a shelter, leading to an underestimation of the population when counting tents. However, discussing these things with ELVA we could quickly conclude that tents are almost never empty. Most of the work in these camps is responsive as there is often a shortage in supplies and workforce. Also most of the population in a camp is connected to family with phones, so even if tents would be empty, this message would spread rather quickly, resulting in more refugees travelling to the concerning camp(s). A shortage of tents is more realistic and in the surveys we also encounter people without a shelter. However, also for these cases it is still useful to estimate the number of people within the camp based on the number of tents for two reasons. First of all, our model at least could give a lower boundary of the number of refugees in a camp, which is better than the current situation. Secondly, our tool will not (directly) replace the surveying, so we can check the observations of the surveys a month after our direct estimation. Finding systematic underestimation of the number of refugees in a camp could indicate a shortage of tents in these camps. This can lead to more quantitatively based decision making in supply distribution.



Bjorgo, E. (2000). Using very high spatial resolution multispectral satellite sensor imagery to monitor refugee camps. International Journal of Remote Sensing, 21(3), 611-616.

Camp Forecast will respond to the following of the United Nation Sustainable Development Goals (Sustainable Development Goals, 2020)

Notilyze participates in SAS EMEA Hackathon

We proudly announce the participation of Notilyze in the SAS EMEA Hackathon. The Hackathon will take place in February and the goal is to find a way to add sustainable value in real life business. We will take this challenge in cooperation with ELVA Community Engagement and the International Organization for Migration.

Case description

Yearly, 1.3 billion dollars of humanitarian aid funding is wasted due to outdated supply management practices in refugee camps (Van der Laan, 2016). As a result, an estimated 1.880.000 children, women and men per year cannot be provided with essential humanitarian supplies to keep them safe and in good health. Empirical evidence shows that this enormous human toll can be avoided through implementing better demand forecasting techniques within refugee/IDP camps.

Camp managers currently make use of ad-hoc, judgmental forecasting techniques, which are laboursome and comparatively ineffective. In contrast to our tool, existing AI-driven supply chain optimization tools however do not meet the needs of humanitarian missions as they fail to account for: 1) strong demand uncertainty due to conflict volatility; 2) SPHERE standards; 3) strong divergence of “product baskets” (i.e. foodstuffs, medicine, etc.) dependent on seasonality and camp location.

Camp Forecast (CF) will allow the distribution of life-saving humanitarian supplies, including medication, foodstuffs, blankets, tents and others, to an additional 1.880.000 children, women and men fleeing from conflict worldwide. CF will especially benefit children, pregnant women and IDPs/refugees with special needs – who are most vulnerable and dependent on humanitarian supplies within a camp setting. Keeping in mind the total humanitarian aid in 2017 amounted to 27.8 billion USD, an efficiency increase of 0.1% would already result in 27.8 million USD that could be utilized to better effect.

ELVA, the International Organization for Migration (IOM) and Notilyze have been cooperating to create such a Camp Forecast Tool. With this consortium we combine decades of leading humanitarian experience (IOM) with data collection, analysis and visualization experience in 20 conflict-affected countries worldwide (Elva Community Engagement) and strong commercial expertise building cutting-edge AI-driven supply chain solutions for commercial and non-commercial actors (Notilyze).

WASH Requirements Dashboard

Figure 1: WASH Requirements Dashboard

Until now this consortium has been focusing on simplifying the inventarisation of the stocks and the needs for the basic WASH supplies in camps. Instead of a monthly lengthy survey with questions on population, current WASH stocks and current needs a camp manager now only needs to fill out a number of people in the camp to get an estimate of the required WASH supplies and the costs coming along with these requirements (see Figure 1).

A disadvantage of IOM’s monthly “camp site assessments” as input for this forecast model is that data is available a month after the survey has been taken. To increase the quality of these forecasts an good estimate of the current population in a forecast is helpful. Therefore IOM would like to analyse other data sources that would help to gather more detailed data more efficiently and provide more accurate forecasts.

That is where we come in. Using satellite imagery we want to estimate the current amount of people in 50 refugee camps all over Nigeria. With this information we have both better input for the forecast model and we could revise our forecasts more quickly. Using SAS we want to build an operational object detection model to streamline estimations of camp sizes. The goal is to deliver insightful information on refugee populations with the SAS EMEA Hackathon 2020.

This goal perfectly fits the goal of the Hackathon, which is using data for good and linking the use case to the UN Sustainable Development Goals (see Figure 2).

Camp Forecast will respond to the following of the United Nation Sustainable Development Goals (Sustainable Development Goals, 2020)

Figure 2: Camp Forecast will respond to the following of the United Nation Sustainable Development Goals ( Sustainable Development Goals, 2020 )



van der Laan, E., van Dalen, J., Rohrmoser, M., & Simpson, R. (2016). Demand forecasting and order planning for humanitarian logistics: An empirical assessment. Journal of Operations Management45, 114-122.

Security awareness

In collaboration with the ComplianceAgency, Notilyze is working on a new information security system, so that we can assure you that your data will be and remain safe with us.

As an element of this process, the Notilyze team held a security awareness session, hosted by the ComplianceAgency. This was to discuss and update our knowledge about data security and to create awareness within the Notilyze team. We as a company find it very important to keep our clients’ data secure as well as to keep our employees up-to-date about the newest trends.

One of the goals of this new information security system is to obtain the ISO 27001 and NEN 7510 certificates.

ISO 27001 describes how you can process information security in a process-oriented way, with the aim of ensuring the confidentiality, availability and integrity of information within your organization. This includes the protection of personal and / or company data, protection against hackers and burglary.

The NEN 7510 norm is a standard for information security for the health care sector in the Netherlands, developed by the Dutch Standardization Institute (NNI)