A Practical Guide to SQL, Python, and Databricks Queries

Whether you’re wrangling data, running analyses, or crafting reports, having a solid understanding of SQL, Python, and Databricks queries is essential. In this guide, we’ll walk through practical commands in each of these domains to help you level up your data skills.

SQL Essentials:

Structured Query Language (SQL) is the go-to language for managing and querying relational databases. Let’s dive into some practical SQL commands:

1. SELECT Statement:

sqlCopy code

— Retrieve all columns from a table SELECT * FROM employees; — Retrieve specific columns SELECT employee_id, first_name, last_name FROM employees; — Filter results based on a condition SELECT * FROM orders WHERE order_status = ‘Shipped’; — Order results SELECT * FROM products ORDER BY price DESC; 

2. JOIN Operations:

sqlCopy code

— INNER JOIN SELECT employees.employee_id, employees.first_name, departments.department_name FROM employees INNER JOIN departments ON employees.department_id = departments.department_id; — LEFT JOIN SELECT customers.customer_id, customers.customer_name, orders.order_id FROM customers LEFT JOIN orders ON customers.customer_id = orders.customer_id; 

3. Aggregate Functions:

sqlCopy code

— Calculate average salary SELECT AVG(salary) as average_salary FROM employees; — Find the total number of orders SELECT COUNT(*) as total_orders FROM orders; — Group by and aggregate SELECT department_id, AVG(salary) as avg_salary FROM employees GROUP BY department_id; 

Python Basics:

Python is a versatile programming language with a rich ecosystem of libraries. Let’s cover some fundamental Python commands:

1. Variables and Data Types:

pythonCopy code

# Variables name = “John” age = 30 is_student = False # Data Types num = 42 # Integer price = 19.99 # Float text = “Hello, World!” # String 

2. Lists and Loops:

pythonCopy code

# Lists fruits = [“apple”, “banana”, “orange”] # Loop through a list for fruit in fruits: print(fruit) # List comprehension squared_numbers = [x**2 for x in range(5)] 

3. Functions:

pythonCopy code

# Define a function def greet(name): return f”Hello, {name}!” # Call the function result = greet(“Alice”) print(result) 

Databricks Queries:

Databricks provides a collaborative environment for big data analytics using Apache Spark. Let’s explore some basic Databricks commands:

# Read data from CSV

csv_path = “/FileStore/tables/sample_data.csv”

df_csv = spark.read.csv(csv_path, header=True, inferSchema=True)

# Read data from JSON

json_path = “/FileStore/tables/sample_data.json”

df_json = spark.read.json(json_path)

# Read data from an external source (e.g., Parquet format)

parquet_path = “/FileStore/tables/sample_data.parquet”

df_parquet = spark.read.parquet(parquet_path)

# Read data from an external JDBC server

jdbc_url = “jdbc:mysql://your-jdbc-server:3306/your_database”

jdbc_properties = {“user”: “your_username”, “password”: “your_password”, “driver”: “com.mysql.jdbc.Driver”}

query = “(SELECT * FROM your_table) AS temp_table”

df_jdbc = spark.read.jdbc(jdbc_url, query, properties=jdbc_properties)

# Read data from cloud files (e.g., Azure Blob Storage)

azure_blob_url = “wasbs://your-container@your-storage-account.blob.core.windows.net/path/to/data.csv”

azure_blob_config = {“fs.azure.account.key.your-storage-account.blob.core.windows.net”: “your-access-key”}

df_azure_blob = spark.read.csv(azure_blob_url, header=True, inferSchema=True, sep=”,”)

# Apply Databricks transformations in SQL

df_csv.createOrReplaceTempView(“csv_table”)

df_json.createOrReplaceTempView(“json_table”)

df_parquet.createOrReplaceTempView(“parquet_table”)

df_jdbc.createOrReplaceTempView(“jdbc_table”)

df_azure_blob.createOrReplaceTempView(“azure_blob_table”)

# SQL Transformation

df_result_sql = spark.sql(“””

    SELECT 

        c.column_name, 

        j.other_column 

    FROM csv_table c

    JOIN jdbc_table j ON c.id = j.id

    WHERE c.value > 100

“””)

# Apply Databricks transformations in Python

# Python Transformation

df_result_python = (df_csv

    .join(df_json, df_csv.id == df_json.id, “inner”)

    .filter(df_csv.value > 100)

    .select(df_csv.column_name, df_json.other_column)

)

# Display the results

display(df_result_sql)

display(df_result_python)

In Databricks, you primarily work with Apache Spark, which does not have traditional database views and triggers like in relational databases. However, Databricks provides similar functionality through temporary views, global temporary views, and you can perform transformations using triggers or scheduled jobs. Let’s explore these concepts:

Temporary Views:

Temporary views in Databricks are similar to SQL views. They are temporary and tied to a specific Spark session.

Creating a Temporary View in SQL:

# Creating a temporary view from a DataFrame

df.createOrReplaceTempView(“my_temp_view”)

# Querying the temporary view

result = spark.sql(“SELECT * FROM my_temp_view”)

Global Temporary Views:

Global temporary views are shared among different Spark sessions.

Creating a Global Temporary View:

# Creating a global temporary view from a DataFrame

df.createOrReplaceGlobalTempView(“my_global_temp_view”)

# Querying the global temporary view

result = spark.sql(“SELECT * FROM global_temp.my_global_temp_view”)

Addend Analytics is a Microsoft Gold Partner based in Mumbai, India, and a branch office in the U.S.

Addend has successfully implemented 100+ Microsoft Power BI and Business Central projects for 100+ clients across sectors like Financial Services, Banking, Insurance, Retail, Sales, Manufacturing, Real estate, Logistics, and Healthcare in countries like the US, Europe, Switzerland, and Australia.

Get a free consultation now by emailing us or contacting us.