Hands-on Tutorial

In this tutorial, you can follow a typical process mining scenario, step by step, to get a first overview about what kind of questions can be answered with process mining. If you have not installed our Process Mining Software Disco yet, you find instructions in the Installation chapter.

The goals of this tutorial are:

  • Help you understand the phases of a process mining analysis.
  • Enable you to get started and play around with your own data.

Example Scenario

Imagine you are the manager of the purchasing process in your organization. Once a request is being submitted (for example, an employee needs a new computer), it has to be approved by the manager and is then forwarded to the purchasing department, where an agent looks up the best option and places the order with the supplier. Finally, the invoice is being paid by the finance department. For simplicity, we assume that the whole process is handled within one system, for example an Enterprise Resource Planning (ERP) system like SAP (see Figure 1).


Figure 1: Example scenario: Purchasing process supported by an ERP system

Imagine further that you have the following three problems with this process:

  1. Inefficient operations: You are looking for ways to make the process more efficient.
  2. Compliance: You were asked to demonstrate that the process is performed according to the purchasing guidelines. You think it is, but now you need to prove that this is the case.
  3. Complaints: You have received complaints that the process takes too long. Normally, the process should be completed within 21 days. You don’t know whether this timeframe is indeed sometimes exceeded, and, if so, whether it is a wide-spread problem or just some individual bad experiences that some people had.

You decide to do a process mining analysis to get an objective picture of how the process looks like. From the above problems, you derive the following analysis goals:

  1. Understand the process in detail
  2. Check whether there are deviations from the payment guidelines
  3. Control performance targets (21 days)

A typical process mining projects goes through the following main phases:


Figure 2: Typical phases of a process mining project

In the first phase, also called Scoping or Questions phase, you define the goals of the process mining project. Which process are you going to analyze? Where does it start, where does it stop? People have often different ideas about the process, even if they use the same name. What are the main questions that you want to answer about the process? Which IT systems are involved in the execution of the process?

In our scenario, we already know process scope, the involved IT systems, and the questions about the process that we want to answer:


Figure 3: The questions that we want to answer about the purchasing process in this tutorial

So, we can move on to the second phase: The data extraction. As the process owner of the purchasing process you will not extract the data yourself from the ERP system. Instead, you will work with your IT department and request the data from them.

To do this, you need to understand the data requirements for process mining and how they translate for your own process. We will get to this in the next chapter (see Data Requirements). For now, let’s assume that you have received a CSV file from the IT staff.


Figure 4: You will typically not extract the data yourself. Instead, you will work with the IT department and request the data from them

You can download the extracted data for the example scenario as a CSV or Excel file.


Figure 5: With our questions about the process defined, and the data extracted, we can start going through the process mining tutorial step by step.

Now, we can get started with the analysis!

Step 1 - Inspect Data

As a first step, let’s take a look at the data that we have received from the IT department. You can simply open the CSV file in Excel or in a text editor (see Figure 6).


Figure 6: Purchase order 339 highlighted in the example data set

Each row corresponds to one event, which is an activity that has taken place in the process.

The first column provides information about the Case ID that the event belongs to. In this process the Case ID is the purchase order number. A Case ID is necessary to correlate events that belong to the same process instance (as shown for Case 1, Case 2, and Case 3 in the illustration of How Does it Work?). For example, you can see that eight events have been recorded for purchase order 339.

The fourth column shows which activity has taken place. For example, the first activity in case 339 was “Create Purchase Requisition”, the second one was “Analyze Purchase Requisition”, and so on. In process mining, we need at least one activity column to show the steps that were performed in the process (see A, B, C, etc. in the illustration of How Does it Work?).

The second and the third column indicate the time and date of when each activity was started and completed. We need at least one timestamp (can be either the start or completion) for each event to bring the steps in each process instance in the right order. [1]

A case ID, an activity name, and at least one timestamp for each event are the minimum requirements for process mining (refer to Data Requirements for further details later). But if we have additional information in the data set, we can also use that for additional analysis. For example, for this purchasing process we also have information about the employee who has performed the activity and about their function in the organization.

Step 2 - Import Data

Now let’s get started with process mining and import the data set in Disco! You can click the ‘Open’ button in the top-left corner and locate the file on your hard disk. Once you have selected it, you will see a preview of the first 1000 rows of the data set in a similar view as we have just seen in Excel (see Figure 7).


Figure 7: The import screen lets you configure the meaning of each column in the data set

You can now select each column (it will be highlighted in blue) and tell Disco how it should interpret this column: At the top you find configuration options for the Case ID, the Activity name, Timestamps, Resource, and Other (which are additional attributes).

For example, the first column is currently selected and above you can see that it is configured as the Case ID. Disco tries to guess the right configuration for your data, but to make sure it got it right go through each of the columns choose the right configuration at the top. The two timestamp columns should be set as Timestamp, the Activity column as Activity, the Resource column as Resource, and the Role column as Other.

When you are finished, you can verify that you have configured the data correctly by comparing the little icon that you see next to the header of each column with the screenshot in Figure 7. Then, click the Start import button in the lower right corner.

Step 3 - Inspect Process

As soon as you click ‘Start import’, Disco will mine your data set and automatically display a process map that shows you how the process was really performed (see Figure 8). Note that we did not have a process model at hand, but using process mining we have automatically discovered the process just from the historical data!


If you hit the import limit of the demo version, you can either follow the tutorial from here using the Sandbox example that comes with Disco (see First Steps after Installation: The Sandbox) or you can download this FBT file (a special file that behaves like a CSV file but can be opened with the demo version of Disco).


Figure 8: After the import, you are directly brought to the process map that shows you how the process was actually performed. We can see that there were a lot of amendments (rework) in this process

At the top of the process map you can see a little triangle. This is the start point of the process. We can see that there are 608 cases (608 purchase orders) in the data set, and all 608 cases started with the activity “Create Purchase Requisition” as the first step in the process.

Afterwards, the process splits into two different paths: 374 times it goes to the left and 234 times it goes to the right. The numbers, the thickness of the arrows and the coloring all indicate the frequency of how often certain parts of the process have been performed.

Immediately, we can see one unexpected pattern in the purchasing process: The “Amend Request for Quotation” activity is only supposed to be performed in exceptional situations, because in this step a change is being made to an existing request. However, we can see now that this activity was performed more than 500 times for just 608 cases! This does not look like an exception, and we need to find out why this happens so frequently.


Figure 9: You can reduce the number of activities that are displayed by pulling down the ‘Activities’ slider. At the lowest point only the activities of the most frequent variant are shown

One thing that will become apparent once you start analyzing your own data is that real-world processes become very complex very quickly. Therefore, we need to be able to deal with that complexity. Fortunately, with process mining and Disco you can adjust the level of detail that you want to see in your process map.

To try this out, first pull down the Activities slider on the right (see Figure 9). As a result, you now see a simplified process map that only shows the activities of the first most frequent variant. This is the main flow of the process.


Figure 10: At 100% you see all of the activities, but you are still looking at a simplified version of the process. For example, some numbers of the incoming and outgoing arcs do not add up yet

When you start pulling up the Activities slider again, then gradually more and more of the less frequent activities are shown. At 100% all of the steps that are recorded in the data are visible again. For example, activity “Amend Purchase Requisition” was performed just 11 times and came in as one of the last steps (see Figure 10).

However, we can see that we are still looking at a simplified version of the process, because the numbers don’t add up yet. Right? For example, “Amend Purchase Requisition” was performed 11 times in total. But while there is an incoming path from activity “Analyze Purchase Requisition” with frequency 11, the outgoing path only shows a frequency count of 8. Where are the other three?


Figure 11: If both the ‘Activities’ and the ‘Paths’ sliders are pulled up completely , then you see all of the activities and all transitions between them (nothing is hidden)

The reason is that while we see all activities in the process map, we currently see only the most important process flows between these activities. To reveal the full process, including all activities and all transitions between them, now also pull up the “Paths” slider on the right (see Figure 11).

You can now see that the missing three are going out to activity “Create Request for Quotation”. This arrow with frequency count 3 was hidden before, but now with both the Activities and the Paths sliders at 100% we see really everything that has happened in this process.

If you are wondering why you would ever want to see anything less than 100% of your process, and to learn more about how the simplification sliders work, you can read this article on Simplification Strategies for Process Mining.

Step 4 - Inspect Statistics

With the process map, we have now obtained a bird’s eye view on the overall process. As a next step, let’s inspect some of the process statistics by changing to the ‘Statistics’ tab at the top (see Figure 12).


Figure 12: In the overview statistics we can see that there are a lot of long-running cases, taking 81 days from the beginning to the end of the process and more

On the right, you can find some overview statistics about your data set. For example, we can see that there are 608 cases (purchase orders) and 9119 events (rows in the data set). This is quite a small data set. You can analyze many millions of records with Disco. You can also see the timeframe of the process that is covered: The data runs from January 2011 to October 2011, so there are about ten months of data.

Now, recall that we have received complaints about the throughput time for this process. So, to take a look at the performance you can change from the Events over time to the Case duration statistics next to the chart (see Figure 12). The case duration shows the time from the very beginning to the very end of the case. When you move the mouse over the histogram, you can see that most cases are completed within up to 16 or 17 days in total. However, there are also quite a few that take much longer than that: 80 days, 90 days, and even more.

This does not seem like an exception but it looks like we have a serious problem with the throughput time in this process. As the process owner, you now want to know where in the process we are spending so much time that we end up with such long case durations. We will come back to that in a moment.

Step 5 - Inspect Cases

Before we focus on the discovered performance problem, let’s go one step deeper to inspect individual process instances by changing to the ‘Cases’ tab (see Figure 13).

In the right area, you see the concrete history of an individual case. Switch from the ‘Graph’ view to the ‘Table’ view to get a more compact representation. For example, in Figure 13 the history of purchase order 151 is shown: There were just two steps, first “Create Purchase Requisition” and then “Analyze Purchase Requisition”. If you select another case, then the history for that other case will be shown.


Figure 13: The cases view lets you inspect individual cases with all their additional attributes. It also shows you how many variants there are and shows you the different variants sorted based on their frequency. We can see that the third-most frequent scenario in this process are requests that were stopped early on

The Cases view is really important, because not only does process mining show you an objective process map based on the data, but for any problem that you discover in your analysis you can always go back to a concrete example case that has this problem. This makes it possible for you to do a root cause analysis and take action.

On the left side, the Cases view lets you inspect the variants of your process. A variant is a sequence of steps, from the very beginning to the very end. If two cases follow exactly the same path through the process in their order of activities then they belong to the same variant.

By looking at the most frequent variants you can often already get an understanding of the main scenarios, covering 60-80% of the process. For example, the purchasing process has 98 variants in total and the most frequent variant covers 88 cases (ca. 15% of the data set). Variant 2 covers 77 cases, and so on. When you select a variant at the left, then the second column shows you a list of all cases that belong to that particular variant.

Interestingly, we can find that in Variant 3 the process has been stopped after just two steps. After the “Analyze Purchase Requisition” step, the process ended because the request has been rejected. [2] Of course this can happen, but we would not expect that a stopped request is the third-most frequent variant in this process. You could say that this scenario is also a form of waste, because ideally we would not have started the purchasing request in the first place (avoiding to perform these two steps altogether). A solution may be to update the purchasing guidelines to clarify which things employees can buy and which not.


Figure 14: This early end point is visible in the process map through the dashed line that leads towards the end point

You can also find this early end point back in the process map (see Figure 14). When you go back to the Map view, then you can see a dashed line leading from activity “Analyze Purchase Requisition” to the end point. There is just one dashed line leading from the start point in the process, so all cases have started with the same activity “Create Purchase Requisition”. But next to the regular end activity “Pay invoice” you will find that there are two additional early endpoints in the process. One of them is the scenario of the stopped requests from Variant 3.


Figure 15: We have already answered one of our three questions and will now dive into the long-running cases

Now, let’s stop for a moment and review our analysis goals from the beginning (see Figure 15). We can see that we have answered the first question: The discovered process map gives us a complete overview about the actual process. Furthermore, we have already found some opportunities for process improvement. We have seen that there were a lots of amendments in the process (the rework loop we have discovered in the process map) and that there are lots of stopped requests (the Variant 3 scenario).

We also have seen that there indeed quite a few cases that take much longer than the expected 21 days. This is what we want to investigate next.

Step 6 - Filter on Performance

You can use filters to focus on particular questions about your process. To investigate why some of the cases are taking so long we will use the Performance filter.


Figure 16: To focus on the long-running cases, add a filter and select the Performance filter

You can add a Performance filter by clicking on the filter symbol in the lower left corner and then choosing the filter from the list as shown in Figure 16.


Figure 17: Then, move the left end of the slider to the 21 day mark and apply the filter

Then move the left end of the slider to the right around the 21 day mark (see Figure 17). The blue area now covers all cases that we want to focus on: The cases that take longer than 21 days. We can see that ca. 15% of all cases in the data set fall outside of the service level target for this process.

Now click the ‘Copy and filter’ button in the lower right corner and give your analysis a short name (we used “SLA Analysis”) to save it in your project and press ‘Create’.

Step 7 - Visualize Bottlenecks

As a reminder that you are not looking at the full data set at the moment, you can see a pie chart in the lower left corner (see Figure 18). It indicates that you are currently looking at 15% of the data. So, the process map now shows you the process flow only for the 92 cases that take longer than 21 days (out of the 608 in total).


Figure 18: You now see the process map only for the 15% of the cases that took longer than 21 days. To inspect the delays in the process, rather than the frequencies, you can switch to the Performance view

We notice that the rework loop around activity “Amend Request for Quotation” has become even more dominant than before. We are now going through this loop almost 3 times per case on average! [3] However, this is a situation, where we are much more interested in the timing information. We want to know where in the process we are losing so much time that we miss our 21 day target.

To investigate this, you can change to the Performance view in the lower right (see Figure 18). The timestamps in the data set are now analyzed to project the execution times (the time someone actively works on a particular step in the process, shown within the activity boxes) as well as the waiting times (the delay between the completion of one activity until the start of the next activity, shown along the paths) on the process map. Initially, the Total durations (the cumulative delays over all cases) is shown, which is great to quickly find the high-impact areas in the process. To inspect the average delays, you can switch from Total duration to Mean duration in the drop-down list to the right (see Figure 19).


Figure 19: In the performance view, change the drop-down list from Total duration to Mean duration to see how much time is spent in each part of the process on average

We can see that not only are we going through this loop unnecessarily often, but it also causes significant delays. The “Amend Request for Quotation” step itself does not take particularly long (just 9.8 minutes on average). However, after this step has been completed, there is an average waiting time of 14.9 days before the normal process continues. Also in other parts of the process, we can see huge delays.

Clearly, we have discovered a bottleneck around activity “Analyze Request for Quotation”. Process mining cannot tell us why we have that bottleneck. We need to go outside of the process mining tool, and outside of the data, to speak with the people who are involved in the process. One reason might be that there are a very high workload and not enough resources available for that part of the process. Another reason could be that this step is performed by a manager as a low-priority task every four weeks and in the meantime cases are piling up.

What process mining can do is to show us where we have problems in our process. Because our analysis results are based on data, we can see objectively where we need to focus. Figure 19 shows a quite typical example in the sense that the waiting times (here more than 14 days on average) are often magnitudes higher than the execution times (here just 10 minutes on average). In most of the process improvement projects, the focus is not on making people work faster but to organize the process in a smarter way. For example, the manager might not know that the way they organize their work (performing a particular task just once every four weeks), while convenient and efficient for them, has a big impact on the process overall.

This brings us to a second huge benefit of process mining: In addition to showing objectively where the problem areas are, process mining helps to communicate those findings to the people who are working in the corresponding business units. Change initiatives are hard, because nobody really likes to change. Furthermore, processes are complex and hard to understand. Charts and statistics are only meaningful in a limited way and are often too abstract.

Process Mining allows you to provide a visual representation to the process owner and to other people working in the process. This helps you to engage them in your improvement initiative. In many situations, you can also profit from their domain knowledge in interactive analysis workshops, because they will be intuitively able to understand the process maps of “their” process and point out additional information you were not aware of.

Step 8 - Animate Process

Similar to the process visualization in the process maps, the animation can be extremely helpful in the communication of any process problems you have found.

Click on the Animation button in the middle at the bottom of the process map to get to the Animation view (see Figure 20). Then press the Play-button in the lower left corner.


Figure 20: Rather than displaying average durations and waiting times, the animation provides a dynamic view on the process flow

We are still looking at just the 15% slow-running cases. But instead of seeing the average delays as before in the process map, you can now see a dynamic replay [4] of the process over the 10 months of data that we have. Every yellow dot represents one case that is moving through the process at its actual, relative speed based on the timestamps in the imported data set.

This way, we can make the discovered bottleneck really tangible for people and “bring it to life”. You might be surprised how many only truly understand that they are actually looking at a process once they have seen the animation move cases around.


Figure 21: Two out of the three questions have already been answered

Let’s take a look again at our original questions (see Figure 21). You can see that we have answered two of the three questions. We now know that there is a bottleneck around activity “Analyze Request for Quotation” and that we need a process change to address this.

Step 9 - Compliance Check

To look at a compliance example, let’s go back to the full process. To do this, simply click on the drop-down list with the +1 at the top and change back to initial data set (see Figure 22).


Figure 22: To get back to the full process, click on the drop-down list at the top and return to the initial data set

If you then scroll towards the end of the process, you can see that at some point the invoice is being sent and ultimately paid. There is also one activity called “Release Supplier’s Invoice”, which is an extra step in the process to prevent fraud. This activity is mandatory and should always be executed. However, you can now see that there are actually 10 cases bypassing this mandatory step, directly going on to the step “Authorize Supplier’s Invoice payment” (see Figure 23).


Figure 23: In the frequency view we can see that the mandatory activity ‘Release Supplier’s invoice’ has actually been skipped 10 times!

We were not aware that it was even possible to skip this mandatory process step. Now we know that it is happening and how frequently (10 times for 608 cases). But to take action we need to know which ten cases are bypassing the fraud prevention step.

To find this out, you can click on the path leading from “Send invoice” to “Authorize Supplier’s Invoice payment” (see Figure 24). This will bring up an overview badge with all the different frequency and performance metrics for this path.


Figure 24: To find out which 10 cases deviated from the expected process, you can click on the path and use the ‘Filter this path…’ shortcut at the bottom of the overview badge

At the bottom you see a ‘Filter this path…’ button. This is a shortcut to add a pre-configured filter that only keeps all those cases that exactly follow that particular path in the process. Click the ‘Filter this path…’ shortcut in the overview badge and then click the ‘Copy and filter’ button in the lower right corner to save the new analysis (see Figure 25).


Figure 25: Through the ‘Filter this path…’ shortcut, a pre-configured Follower filter will be automatically added and you can simply apply these filter settings to focus your analysis on the 10 cases that went through the path of the process that you clicked on

Then, change from the Map view to the Cases tab to see which ten cases have deviated from the prescribed process (see Figure 26).


Figure 26: After applying the filter and changing to the Cases view, you can take a deeper look at exactly the 10 cases that skipped the mandatory process step

This provides us with an opportunity to speak with the people who are involved and find out why the mandatory process step was skipped. Perhaps there was a good reason and we can take their explanation into account. If the deviation should not have happened, we can start thinking of ways to prevent such compliance deviations in the future. For example, we might give a targeted training. Or we can implement a system change that enforces the mandatory step from now on.


Figure 27: We have now answered all three questions and defined follow-up activities to address the discovered problems

If we look again at our original questions one last time (see Figure 27), we can see that we have answered all three questions. In addition to the inefficiencies and the bottleneck discovered earlier, we have now also found a deviation from the prescribed process that we did not expect.

To conclude the tutorial, let’s take one last step to see how we can also take different perspectives on the data.

Step 10 - Organizational View

Imagine that we have found out everything we wanted to know about the activity flows in this process. We now want to understand how the process works from an organizational view.


Figure 28: To take a different view on the data, you can go to the Project view and press the grey ‘Reload’ button in the right area

To shift to this perspective, you can go to the project view (we have not been there before) by clicking on the second-left button at the top (see Figure 28). Then find the grey ‘Reload’ button in the area on the right and click it to get back to the import screen.

You can now change the import configuration in the following way. Select the Activity column and change the configuration at the top to ‘Other’ (this will make it just a regular attribute in the data set). Then select the Role column and configure it as ‘Activity’. Figure 29 shows this new configuration.


Figure 29: This brings you back to the import screen, where you can switch the configuration for the Activity and Role columns: The Role column is now configured as the activity

Now press the ‘Start import’ button again and your data set will be imported from a different perspective. In the process map you now don’t see the activity sequences displayed but rather the hand-over of work between different functions, or departments, in your organization (see Figure 30).


Figure 30: As a result, we can see the same process from an organizational perspective. The hand-over of work between different organizational units becomes visible

In this view, you can see ping pong behavior and delays between the different roles. Often, inefficiencies emerge in the hand-over of work between organizational units, because one group is not responsible anymore and the new team has not claimed ownership yet.

This is just one example. You can take many different perspectives on your process, even based on the same data set. This flexibility to explore a process from different viewpoints is one of the big powers of process mining, right next to the ability to quickly focus on particular questions through the various filters.

Take-away Points and Next Steps

There are a few things you can learn from this exercise.

  • First of all, real processes are often more complex than you might expect. For example, who would have thought that this purchasing process has 98 variants? People typically underestimate how complicated their processes really are. That’s why process mining adds a lot of depth to the classical, manual process analysis. With a manual process mapping approach you can usually get a good overview of the main process (the “sunny day scenario”) but you will never be able to get a hold of all the exceptions.
  • Secondly, there is not one “right” model. You can look at your process with varying levels of detail, you can focus on particular subsets based on the questions that you have, and often you can take different perspectives already during import. In any given analysis, you will create multiple views to explore different questions (see also Take Different Perspectives On Your Process).
  • Thirdly, process mining is not about mining a data set to create one process model and then you are done. Once you have imported your data, this is only the starting point of your process mining analysis. Without programming, you can explore your process by translating your process questions in the right combination of filters very quickly. Process mining is an interactive activity that requires an analyst with process mining experience who has access to domain knowledge about the process to interpret what they are seeing.

To continue practicing, you can download the demo logs that come with Disco. If you want to start analyzing some of your own data right away, you can jump to the Data Requirements section. You will be surprised about the new insights that process mining can bring you!

If you first want to learn more about the typical Process Mining Use Cases, simply continue to the next chapter.

[1]The events are sorted in this example, but keep in mind that normally the events coming out of a database would not be sorted and at least one timestamp will be needed to bring them in the right order.
[2]If you are thinking that another reason could be that these cases are not completed yet, you are right!. It is a typical data preparation step to filter out incomplete cases. You can use the Endpoints Filter to do that. However, in this example only closed cases have been extracted from the ERP system in the first place (based on a status attribute in the database). So, there are no incomplete, or “in progress”, cases in this scenario.
[3]If you want to see how many cases go through that loop, then you can switch from Absolute frequency to Case frequency in the drop-down list to the right.
[4]Note that a process mining animation should not be confused with a simulation, which you might know from process modeling tools. They are not the same. In fact, you could argue that they are the opposite: Simulation is built on manually constructed models to play out future “What if” scenarios. In contrast, process mining shows the actual process by replaying the current reality. Both are complementary. You can find more information in this article about process mining and simulation.