Specializing in Large-Scale Spark Processing

Seongju Hwang

Data Engineer leading hybrid cloud analytics. Managing 300+ Spark pipelines processing 1B+ rows daily with Spark JDBC & DataFrames.

View Experience Technical Stack
100M+
JDBC Ingestion
300+
Oozie Workflows
1B+
Daily Rows
~4h
Latency Reduction

Professional Experience

Hanwha General Insurance

Jul 2024 – Present

Data Engineer | Seoul, Korea

Platform & Pipeline

  • Large-Scale Ingestion: Built parallel extraction pipelines using Spark JDBC to ingest 100M+ rows from MySQL/Tibero into Azure HDFS.
  • Workflow Mgmt: Designed and maintained 300+ Oozie workflows processing ~1B rows/day with robust retry and failure isolation.
  • Optimization: Migrated Sqoop-based CDC to Spark JDBC, reducing end-to-end latency from 7h to 3h.
  • Platform Ops: Managed Azure HDInsight-based Spark/Hive platforms with YARN-level resource tuning.

Real-time & BI

  • IoT Streaming: Developed Spark Structured Streaming pipelines ingesting 100K events/sec from 700K devices via Kafka.
  • BI Infrastructure: Built end-to-end Power BI infra, including on-prem gateway and security coordination.
  • Modern Stack: Led PoCs for Apache Superset and Databricks to modernize the analytical ecosystem.

Technical Architecture: Hanwha Data Platform

graph TD
    subgraph Sources [Data Sources]
        direction LR
        MySQL[(MySQL)]
        Tibero[(Tibero)]
        IoT[700K IoT Devices]
    end

    subgraph Compute [Spark Processing Engine]
        direction TB
        Kafka[Kafka Cluster]
        JDBC[Spark JDBC Ingestion]
        DF[Spark DataFrame Transform]
        SSS[Structured Streaming]
        
        JDBC --> DF
        Kafka --> SSS
    end

    subgraph Storage [Azure HDInsight Lake]
        HDFS[Azure HDFS]
        Hive[Hive Metastore]
        HDFS --- Hive
    end

    subgraph Consumption [Analytics Layer]
        JH[Jupyter Hub]
        PB[Power BI]
        SS[Apache Superset]
    end

    MySQL --> JDBC
    Tibero --> JDBC
    IoT --> Kafka
    DF --> HDFS
    SSS --> HDFS
    HDFS --> JH
    HDFS --> PB
    HDFS --> SS

    style Sources fill:#1e293b,stroke:#475569,color:#fff
    style Compute fill:#312e81,stroke:#818cf8,color:#fff
    style Storage fill:#0f172a,stroke:#4f46e5,color:#fff
    style Consumption fill:#064e3b,stroke:#10b981,color:#fff
                            

Visualization of the hybrid cloud data flow and transformation layers.

Carrot Insurance

Apr 2024 – Jul 2024

DW/BI Intern | Seoul, Korea

  • Data Mart Design: Developed Hive SQL-based data marts aligned with business performance indicators (KPIs).
  • Performance Migration: Migrated legacy Tez-based workflows to Spark, reducing batch execution time from 2 hours to 30 minutes (75% improvement).
  • BI Automation: Established automated refresh environments by connecting Power BI dashboards with on-premise data gateways.

Technical Skills

Advanced Spark

  • Spark JDBC Ingestion
  • DataFrame API Optimization
  • Structured Streaming
  • YARN / Spark Tuning

Platforms

  • Azure HDInsight
  • Airflow & Oozie
  • HDFS / Hive Metastore
  • Databricks (PoC)

Databases

  • MySQL & Tibero
  • PostgreSQL
  • Kafka (Pub/Sub)

Visualization

  • Power BI (Expert)
  • Jupyter Hub
  • Apache Superset