Whether you’re wrangling data, running analyses, or crafting reports, having a solid understanding of SQL, Python, and Databricks queries is essential. In this guide, we’ll walk through practical commands in each of these domains to help you level up your data skills.
SQL Essentials:
Structured Query Language (SQL) is the go-to language for managing and querying relational databases. Let’s dive into some practical SQL commands:
1. SELECT Statement:
sqlCopy code
— Retrieve all columns from a table SELECT * FROM employees; — Retrieve specific columns SELECT employee_id, first_name, last_name FROM employees; — Filter results based on a condition SELECT * FROM orders WHERE order_status = ‘Shipped’; — Order results SELECT * FROM products ORDER BY price DESC;
2. JOIN Operations:
sqlCopy code
— INNER JOIN SELECT employees.employee_id, employees.first_name, departments.department_name FROM employees INNER JOIN departments ON employees.department_id = departments.department_id; — LEFT JOIN SELECT customers.customer_id, customers.customer_name, orders.order_id FROM customers LEFT JOIN orders ON customers.customer_id = orders.customer_id;
3. Aggregate Functions:
sqlCopy code
— Calculate average salary SELECT AVG(salary) as average_salary FROM employees; — Find the total number of orders SELECT COUNT(*) as total_orders FROM orders; — Group by and aggregate SELECT department_id, AVG(salary) as avg_salary FROM employees GROUP BY department_id;
Python Basics:
Python is a versatile programming language with a rich ecosystem of libraries. Let’s cover some fundamental Python commands:
1. Variables and Data Types:
pythonCopy code
# Variables name = “John” age = 30 is_student = False # Data Types num = 42 # Integer price = 19.99 # Float text = “Hello, World!” # String
2. Lists and Loops:
pythonCopy code
# Lists fruits = [“apple”, “banana”, “orange”] # Loop through a list for fruit in fruits: print(fruit) # List comprehension squared_numbers = [x**2 for x in range(5)]
3. Functions:
pythonCopy code
# Define a function def greet(name): return f”Hello, {name}!” # Call the function result = greet(“Alice”) print(result)
Databricks Queries:
Databricks provides a collaborative environment for big data analytics using Apache Spark. Let’s explore some basic Databricks commands:
# Read data from CSV
csv_path = “/FileStore/tables/sample_data.csv”
df_csv = spark.read.csv(csv_path, header=True, inferSchema=True)
# Read data from JSON
json_path = “/FileStore/tables/sample_data.json”
df_json = spark.read.json(json_path)
# Read data from an external source (e.g., Parquet format)
parquet_path = “/FileStore/tables/sample_data.parquet”
df_parquet = spark.read.parquet(parquet_path)
# Read data from an external JDBC server
jdbc_url = “jdbc:mysql://your-jdbc-server:3306/your_database”
jdbc_properties = {“user”: “your_username”, “password”: “your_password”, “driver”: “com.mysql.jdbc.Driver”}
query = “(SELECT * FROM your_table) AS temp_table”
df_jdbc = spark.read.jdbc(jdbc_url, query, properties=jdbc_properties)
# Read data from cloud files (e.g., Azure Blob Storage)
azure_blob_url = “wasbs://your-container@your-storage-account.blob.core.windows.net/path/to/data.csv”
azure_blob_config = {“fs.azure.account.key.your-storage-account.blob.core.windows.net”: “your-access-key”}
df_azure_blob = spark.read.csv(azure_blob_url, header=True, inferSchema=True, sep=”,”)
# Apply Databricks transformations in SQL
df_csv.createOrReplaceTempView(“csv_table”)
df_json.createOrReplaceTempView(“json_table”)
df_parquet.createOrReplaceTempView(“parquet_table”)
df_jdbc.createOrReplaceTempView(“jdbc_table”)
df_azure_blob.createOrReplaceTempView(“azure_blob_table”)
# SQL Transformation
df_result_sql = spark.sql(“””
SELECT
c.column_name,
j.other_column
FROM csv_table c
JOIN jdbc_table j ON c.id = j.id
WHERE c.value > 100
“””)
# Apply Databricks transformations in Python
# Python Transformation
df_result_python = (df_csv
.join(df_json, df_csv.id == df_json.id, “inner”)
.filter(df_csv.value > 100)
.select(df_csv.column_name, df_json.other_column)
)
# Display the results
display(df_result_sql)
display(df_result_python)
In Databricks, you primarily work with Apache Spark, which does not have traditional database views and triggers like in relational databases. However, Databricks provides similar functionality through temporary views, global temporary views, and you can perform transformations using triggers or scheduled jobs. Let’s explore these concepts:
Temporary Views:
Temporary views in Databricks are similar to SQL views. They are temporary and tied to a specific Spark session.
Creating a Temporary View in SQL:
# Creating a temporary view from a DataFrame
df.createOrReplaceTempView(“my_temp_view”)
# Querying the temporary view
result = spark.sql(“SELECT * FROM my_temp_view”)
Global Temporary Views:
Global temporary views are shared among different Spark sessions.
Creating a Global Temporary View:
# Creating a global temporary view from a DataFrame
df.createOrReplaceGlobalTempView(“my_global_temp_view”)
# Querying the global temporary view
result = spark.sql(“SELECT * FROM global_temp.my_global_temp_view”)