Self-service health check reports for SQL Server and PostgreSQL with Jupyter Notebook

Alpay Kurbaloğlu
Trendyol Tech
Published in
6 min readJun 14, 2021

--

Hello everyone,

In this story, we will talk about how we are doing health checks on our SQL Server and PostgreSQL databases in Trendyol.

Health checks are so important for any kind of system. Doing periodic checks on a system can save you from many crashes and failures that can happen during busy times. For systems where you save and use your data, health checks are even more important.

In Trendyol, we have lots of applications and lots of databases. We are reviewing systems with our application teams every week. Checking them one by one with connecting, querying and writing results in some kind of document would be painful for us with our other weekly tasks. Therefore we developed a process that can check databases and generate reports on request.

Our process based on Jupyter Notebook. It is a very useful tool when you want to create and share documents with live code. You can run Python codes with comments and create documents from the notebook.

To start, you have to install Jupyter Notebook. If you already have Python in your environment you can install notebook with pip;

pip install notebook

In shell, you can start notebook with this command;

jupyter notebook

After executing this command, you should see Jupyter Notebook is opened on your browser automatically. You can find it at http://localhost:8888/ address.

Creating a new Python 3 notebook

After creating a new notebook with a meaningful name, in a “code” cell you can write Python code and then you can run to see its output.

Running a “code” cell in Jupyter Notebook

If you want to add some comment about this code and output, you can change the next cell type to “markdown” and run the cell.

Adding free-text with markdown cell

So far so good, we can put some codes in a notebook and add some comments about it. But we said running queries right? How we can connect a database and run a query against it?

Now ipython-sql comes to our stage. It “introduces a %sql (or %%sql) magic.” You can reach the project page https://pypi.org/project/ipython-sql here.

You can install it via pip;

pip install ipython-sql

Also, you would need drivers for databases. We will install Pyodbc and for PostgreSQL Pyscopg2. You can use PyMySQL for MySQL and Cx Oracle for Oracle databases.

pip install pyodbc
pip install psycopg2

For linux environments, to install pyodbc you should install unixODBC-devel, gcc-c++, python36-devel packages.

Also for mssql odbc driver you can add repo of microsoft and install msodbcsql17 package if you follow instructions; https://docs.microsoft.com/en-us/sql/connect/odbc/linux-mac/installing-the-microsoft-odbc-driver-for-sql-server?view=sql-server-ver15

You can check installed drivers from file: /etc/odbcinst.ini

We like working on Pandas DataFrames. Who doesn’t like it? To do that we also have to install pandas with this command;

pip install pandas

After that you can connect to a database, using SQLAlchemy URL connect strings, then issue SQL commands within IPython or IPython Notebook.

There are some parameters required to connect database;

Connection address, database name, username (User should have rights to connect database and view server state) and it’s password of course.

First, we define variables for connection info in a different cell. (You will understand later why we are creating variables in the different cell.)

Second, load sql extension and import pandas.

Third, change some of ipython-sql and pandas parameters to make your report prettier.

Finally, prepare your query cell “%%sql” prefix to execute queries against the database and push the button.

We can add any query that required for the database health check. Generally, we are using Glenn’s wonderful SQL Server diagnostic queries.

https://glennsqlperformance.com/resources/

You can find queries for every sql server version to gather;

Version info, hardware info, free space sizes per drive, io metrics, missing indexes, bad indexes, unused indexes, all tables with sizes, availability group information, information about sql agent jobs and high queries, stored procedures from cpu, ram, io perspective.

If you want to assign query results to a variable, the << operator captures query results in a local variable and can be used in multi-line.

In [3]: %%sql names << SELECT name
...: FROM employee
...:
34 rows affected.
Returning data to local variable names

To query PostgreSQL you just need to change the driver section.

%sql postgresql://{user}:{password}@{host}:{port}/{db}

We are gathering cache hit ratio, unused indexes, table sizes, bloat ratios, unused indexes, database sizes, tables without primary key, some PostgreSQL parameters, frequent used and long running queries.

You can find some queries here;

It’s ok. We can connect SQL Server and PostgreSQL and run queries against our beloved databases. Also, we already put some SQL queries for the health check. But how we can get our results as a file? Jupyter Notebook helps us with nbconvert.

You can use “jupyter nbconvert” command to get our notebook in HTML format.

jupyter nbconvert --to html --no-input --output "my_healtcheck_report"  myhealtchecknotebook.ipynb

Now we have our report in HTML format. We can examine it and share it as we like.

Here comes another question; Do we have to open Jupyter Notebook and change connection parameters every time we want to get a report for a database? Of course not. We will use Papermill to automate this process. Papermill helps us to parameterize and execute notebooks. https://pypi.org/project/papermill/

First, we need to install papermill.

pip install papermill

Second, do you remember our famous cell that we defined variables for connection info in it? We have to give “parameters” tag to this cell. From view enable tag box. Type “parameters” into the box and add the tag.

Now we can call our Jupyter Notebook with parameters. You can leave notebook parameters empty, every time you call it, Papermill will replace variables with the ones you give.

Let’s write a python script and call our notebook with Papermill;

import papermill as pmhostip = 'sampleip'
dbname = 'sampledbname'
dbuser = 'sampleuser'
dbpw = 'somestrongpassword'
pm.execute_notebook(
'myhealtchecknotebook.ipynb',
'output.ipynb',
parameters=dict(host = hostip, db = dbname, user = dbuser, password = dbpw)
)

Papermill will use our parameters, execute the notebook and save it with the name “output.ipynb”. Now you can convert it to HTML format.

jupyter nbconvert --to html --no-input --output "my_healtcheck_report"  output.ipynb

We are using Patroni for our PostgreSQL clusters. Now maybe we can write a basic script that takes these parameters as input, send requests to Patroni, create reports for master and every replica in HTML format.

For Microsoft SQL Server Availability Groups you can connect to master via pyodbc and get replica names with query (If they are registered on DNS, if not you can get IP addresses of them) evaluate them in a for loop and create a report for each of them.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Thanks for reading.

You can find me at LinkedIn: https://www.linkedin.com/in/alpaykurbal/

Feel free to ask any questions and discuss technology topics.

Alpay

--

--