Legal Data Acquisition Engineer (Scraping & Extraction)

Legal Data Acquisition Engineer (Scraping & Extraction)

Legal Data Acquisition Engineer (Scraping & Extraction)

Legal Data Acquisition Engineer (Scraping & Extraction)

Omnilex

Informationsdienste

Zürich

  • Art der Beschäftigung: Vollzeit
  • 8.000 CHF – 12.000 CHF (Unternehmensangabe)
  • Vor Ort

Legal Data Acquisition Engineer (Scraping & Extraction)

Passt der Job zu Dir?

Mit einem XING Profil siehst Du gleich, welche Deiner Fähigkeiten und Wünsche konkret zum Job passen. Damit Du Dich nicht nur im Home-Office wie zuhause fühlst.

Jetzt anmelden und herausfinden

Über diesen Job

🔍 The Role (Behind the Scenes, Mission-Critical)

We’re looking for an engineer who loves the messy reality of web data: dynamic pages, broken markup, inconsistent PDFs, changing source structures, missing metadata, rate limits, anti-bot protections, and jurisdiction-specific publishing habits.

This is a behind-the-scenes role with huge product impact. You’ll build and maintain the systems that continuously collect and extract legal content from websites, APIs, bulk files, and document repositories and turn it into reliable inputs for our AI products.

If you enjoy scraping, parsing, reverse-engineering content structures, and designing robust ingestion pipelines that survive real-world change, this role is for you.

🏛️ About Omnilex

Omnilex is a young, dynamic AI legal tech startup with roots at ETH Zurich. Our interdisciplinary team is building AI-native tools for legal research and answering complex legal questions across jurisdictions.

A core reason we stand out is our data foundation: combining external legal sources, customer-internal sources, and our own AI-first legal content. This role strengthens that foundation.

Tasks

⚙️ What You’ll Work On

Your focus will be source acquisition, scraping, parsing, and extraction reliability for legal data.

Core responsibilities

Build and maintain resilient pipelines to ingest legal content from:

  • public websites
  • APIs
  • document portals
  • bulk datasets
  • PDFs / HTML / XML / DOCX-like formats

Design scraping systems that are robust to:

  • layout changes
  • pagination quirks
  • JavaScript-rendered sites
  • inconsistent metadata
  • rate limits and retry behavior
  • Implement parsers and extractors for legal documents (statutes, decisions, guidance, commentaries, etc.)

Extract and structure:

  • document text
  • headings/sections
  • citations and references
  • dates, courts, authorities, identifiers
  • language / jurisdiction metadata
  • Build source-specific adapters and reusable extraction components (rather than one-off scripts)
  • Monitor source health and detect breakage quickly (e.g., selector failures, coverage drops, schema drift)
  • Improve data quality with validation checks, deduplication, canonicalization, and content versioning
  • Work closely with AI/data/search teammates so extracted data is optimized for downstream indexing, RAG, and analytics
  • Document source behavior and operational playbooks so ingestion remains maintainable as we scale

🧱 What Success Looks Like

In this role, success is not “number of scrapers written.” Success looks like:

  • high source coverage across target jurisdictions
  • fast detection and repair when sources change
  • clean, structured extractions with fewer downstream fixes
  • stable ingestion SLAs and predictable runtimes/costs
  • reusable tooling that makes adding new sources increasingly faster

Requirements

✅ Minimum Qualifications

  • Degree in Computer Science, Data Science, Software Engineering, or related field — or equivalent practical experience
  • Strong hands-on engineering experience with TypeScript (backend/data pipeline context)
  • Real experience building web scraping / crawling / extraction pipelines in production
  • Strong understanding of HTML/DOM parsing, HTTP, pagination, sessions/cookies, and common web data edge cases
  • Experience working with messy document formats (especially PDFs) and text extraction challenges
  • Good SQL skills (PostgreSQL) and experience storing structured/unstructured content
  • Strong debugging skills and a pragmatic mindset: you can make unreliable sources reliable
  • Ability to work with ownership in a fast-moving startup
  • Availability full-time; on-site in Zurich at least two days per week (hybrid)

🌟 Preferred Qualifications

  • Familiarity with modern scraping and browser automation tools (e.g. Playwright, Puppeteer)
  • Experience with PDF/document tooling, OCR pipelines, and parsing libraries
  • Experience designing queue-based or worker-based ingestion systems
  • Experience with Azure (including storage/search services), Docker, and CI/CD
  • Working proficiency in German and proficiency in English
  • Swiss work permit or EU/EFTA citizenship
  • Experience with legal or regulatory document structures (Switzerland / Germany / EU / US is a plus)
  • Familiarity with downstream AI/search use cases (chunking, embeddings, indexing, citation traceability)

🧠 Nice-to-Have Strengths (But Not Required)

  • You enjoy source forensics: inspecting network calls, hidden endpoints, export formats, and content variants
  • You think in terms of reusable extraction architecture, not just one-off fixes
  • You care about observability and operational quality, not just “it ran once on my machine”
  • You like collaborating with product/AI teams to understand what metadata actually matters downstream

Benefits

🤝 Benefits

  • High leverage impact: your work directly improves coverage, freshness, and trust in legal AI answers
  • Ownership: own the ingestion/scraping layer end-to-end for key legal sources
  • Real engineering challenges: dynamic websites, parsing complexity, document extraction, reliability at scale
  • Interdisciplinary team: work closely with engineers, legal experts, and AI specialists
  • Compensation: CHF 8’000–12’000 per month + ESOP (employee stock options), depending on experience and skills

If you’re excited about building the invisible infrastructure that powers great legal AI products, we’d love to hear from you. Apply via the Apply button.

Ähnliche Jobs

Senior Data Pipeline Engineer - Freelance (80-100%, remote/Zurich/Berlin)

Decentriq

Zürich + 0 weitere

Senior Data Pipeline Engineer - Freelance (80-100%, remote/Zurich/Berlin)

Zürich + 0 weitere

Decentriq

Software Engineer, AI Focus [100%], Zurich Hybrid

Optiml

Zürich + 0 weitere

89.500 CHF115.500 CHF

Software Engineer, AI Focus [100%], Zurich Hybrid

Zürich + 0 weitere

Optiml

89.500 CHF115.500 CHF

Private AI Runtime & Infrastructure Engineer (Local Data Residency)

Omnilex

Zürich + 0 weitere

8.000 CHF12.000 CHF

Neu · 

Private AI Runtime & Infrastructure Engineer (Local Data Residency)

Zürich + 0 weitere

Omnilex

8.000 CHF12.000 CHF

Neu · 

Senior Generative AI Engineer (m/w/d) - Schweiz

Alexander Thamm GmbH

Zürich + 0 weitere

Senior Generative AI Engineer (m/w/d) - Schweiz

Zürich + 0 weitere

Alexander Thamm GmbH

Principal Generative AI Engineer (m/w/d) - Schweiz

Alexander Thamm GmbH

Zürich + 0 weitere

Principal Generative AI Engineer (m/w/d) - Schweiz

Zürich + 0 weitere

Alexander Thamm GmbH

Senior Snowflake Data Engineer

Luxoft DXC

Zug + 0 weitere

102.000 CHF120.000 CHF

Senior Snowflake Data Engineer

Zug + 0 weitere

Luxoft DXC

102.000 CHF120.000 CHF

AI Software Engineer

Avaloq

Zürich + 0 weitere

99.000 CHF135.500 CHF

AI Software Engineer

Zürich + 0 weitere

Avaloq

99.000 CHF135.500 CHF

Data Engineer - Snowflake

Capco - The Capital Markets Company GmbH

Zürich + 0 weitere

89.000 CHF118.500 CHF

Data Engineer - Snowflake

Zürich + 0 weitere

Capco - The Capital Markets Company GmbH

89.000 CHF118.500 CHF

AI Engineering Lead

Swiss Re

Zürich + 0 weitere

AI Engineering Lead

Zürich + 0 weitere

Swiss Re