Dear Candidate,Greetings from Merit DATA & TECHNOLOGY .... !!!! Position: Team LeadExperience: 10-14 YearsLocation: PAN INDIAWork Mode: Permanent RemoteJob Type: Full Time EmploymentBudget: Best in the MarketNotice: Immediate Joiners Job Description:Primary (70%): Web scraping architecture anti-bot, distributed crawling, proxy/CAPTCHA strategy, compliance Secondary (30%): Data engineering pipelines, lakehouse, orchestration, dbt Mandatory ownership of the technical solution and effort estimation for every new scraping proposal/RFP Senior profile: 15+ years overall, 10+ years in scraping, 3+ years in data engineeringJob DescriptionSolution Architect – Web Scraping & Data EngineeringRFP Reference DocumentRole Title Solution Architect – Web Scraping (Primary) & Data Engineering (Secondary)Engagement Type Contract / RFP-based deploymentReporting To Program / Delivery ManagerLocation [Onsite / Hybrid / Remote – to be specified]Primary Focus Web Scraping & Data Extraction Architecture (~70%)Secondary Focus Data Engineering & Pipeline Architecture (~30%)1. Role Summary:We are seeking a senior Solution Architect to lead the design and delivery of large-scale web scraping and data extraction solutions, with strong supporting expertise in data engineering. The architect will be the technical authority on all scraping initiatives — from pre-sales solutioning and effort estimation through to architecture, build, and stabilization.The primary mandate is web scraping: designing resilient crawlers, anti-bot strategies, distributed extraction systems, and compliance frameworks. The secondary mandate is data engineering: ensuring extracted data flows reliably into well-modelled, query-ready storage layers and downstream analytical or operational systems.2. Pre-Sales, Proposal & Estimation Responsibilities:This is a non-negotiable part of the role. The Solution Architect is mandatorily involved in every new scraping opportunity from the proposal stage onward, and is jointly accountable with the sales / delivery leadership for the technical correctness of all proposals.Mandatory Involvement Areas: Author or co-author the technical solution section of every new RFP / RFI / proposal document for scraping projects. Lead technical discovery calls with prospective clients to understand target sources, data SLAs, volume, and compliance constraints. Conduct target-site feasibility assessments — anti-bot complexity, dynamic content, login walls, geo-restrictions, rate limits — and document findings before commercials are committed. Own end-to-end effort estimation for scraping engagements: crawler build effort, infrastructure sizing, proxy and CAPTCHA cost projections, maintenance overhead, and contingency. Produce solution architecture diagrams, tech stack recommendations, and assumption logs as part of every proposal. Define and document SLAs, KPIs, and acceptance criteria proposed to the client. Participate in client orals, technical defence sessions, and commercial negotiations as the technical SPOC. Maintain an internal estimation knowledge base — reusable estimation templates, complexity matrices, target-site classification, and historical actuals — and continuously refine it after every project closure. Sign off on the technical feasibility and risk profile of every proposal before submission. No scraping proposal goes out without architect approval.3. Core Responsibilities3.1 Scraping Architecture & Design (Primary) Design end-to-end scraping solutions covering crawl orchestration, extraction, parsing, storage, and downstream data consumption. Define reference architectures for high-volume, high-velocity, and high-variety scraping use cases. Architect resilient systems handling JavaScript-heavy sites, CAPTCHAs, rate limiting, IP blocking, and frequent DOM changes. Evaluate and select frameworks, proxy networks, and anti-bot bypass strategies aligned with cost, performance, and compliance. Design data quality, deduplication, validation, and schema-evolution strategies for scraped data.3.2 Data Engineering Architecture (Secondary) Design downstream data pipelines that ingest, transform, and serve scraped data to analytics, ML, or operational consumers. Architect lakehouse / warehouse layers, define data modelling standards (dimensional, Data Vault, or hybrid), and govern schema evolution. Define ELT/ETL patterns, orchestration strategy, and SLA-backed data freshness commitments. Establish data quality, observability, lineage, and cataloguing practices across the platform. Recommend storage formats, partitioning, and indexing strategies for cost and query performance.3.3 Delivery Leadership Translate business requirements into technical specifications, sprint plans, and implementation roadmaps. Produce HLD, LLD, and Architecture Decision Records (ADRs) for each engagement. Provide hands-on guidance, perform code reviews, and mentor scraping and data engineers. Drive proof-of-concept (PoC) builds for complex or high-risk targets before full-scale rollout.3.4 Compliance, Risk & Governance Ensure all scraping work adheres to applicable laws and platform terms (GDPR, CCPA, DPDP, robots.txt, copyright). Define and enforce ethical scraping practices, request throttling, and PII handling guidelines. Conduct risk assessments for each target source and recommend mitigations.3.5 Performance, Cost & Operations Establish SLAs for crawl freshness, completeness, and accuracy. Design monitoring, alerting, and self-healing mechanisms for scraping pipelines. Optimise infrastructure cost (compute, proxies, storage) without compromising delivery KPIs.4. Required Technical Skills – Primary (Web Scraping)These are non-negotiable. The candidate must demonstrate deep, hands-on expertise in each of the following:4.1 Scraping Frameworks & Tooling Expert-level Python: Scrapy, BeautifulSoup, lxml, Requests, httpx, parsel. Headless browser automation: Playwright, Puppeteer, Selenium, Pyppeteer. Node.js scraping stack (where applicable): Puppeteer, Cheerio, Crawlee.4.2 Anti-Bot & Evasion Strategy Proven experience bypassing Cloudflare, Akamai Bot Manager, DataDome, PerimeterX, Imperva, Kasada. CAPTCHA handling: reCAPTCHA v2/v3, hCaptcha, FunCaptcha, image / audio solvers; integration with 2Captcha, Anti-Captcha, CapSolver. TLS / JA3 / JA4 fingerprinting awareness; HTTP/2 fingerprint evasion; browser fingerprint spoofing. Stealth plugins, user-agent rotation, header normalisation, cookie / session management at scale.4.3 Proxy & Network Infrastructure Hands-on experience with rotating, residential, mobile, ISP, and datacenter proxies. Integration with providers: Bright Data, Oxylabs, Smartproxy, NetNut, IPRoyal, SOAX. Proxy pool design, health-checking, geo-targeting, sticky sessions, and cost optimisation.4.4 Parsing & Extraction XPath, CSS selectors, regex, JSON-LD, microdata, RDFa. Reverse-engineering of internal / mobile APIs, GraphQL endpoints, and XHR traffic. ML/LLM-assisted extraction for unstructured layouts (nice-to-have, increasingly expected).4.5 Distributed Crawling & Orchestration Scrapy-Redis, Scrapy Cluster, Frontera, Crawlee, or equivalent distributed crawling frameworks. Job scheduling and orchestration with Apache Airflow, Prefect, Dagster, or Celery. Queue-based architectures using Kafka, RabbitMQ, AWS SQS, GCP Pub/Sub.5. Required Technical Skills – Secondary (Data Engineering)Working-to-strong proficiency expected. The architect must be able to design, review, and guide data engineering work without depending on a separate data architect.5.1 Data Pipelines & Orchestration ETL / ELT design patterns, idempotent pipelines, CDC (Change Data Capture), incremental loads. Apache Airflow, Prefect, Dagster, AWS Glue, Azure Data Factory, GCP Dataflow. Stream processing: Kafka Streams, Apache Flink, Spark Structured Streaming. Batch processing: Apache Spark (PySpark), Databricks, EMR, Dataproc.5.2 Data Modelling & Storage Dimensional modelling (Kimball), Data Vault 2.0, normalised vs. denormalised trade-offs. Data warehouses: Snowflake, BigQuery, Redshift, Synapse, Databricks SQL Warehouse. Data lakes / lakehouses: Delta Lake, Apache Iceberg, Apache Hudi on S3 / GCS / ADLS. OLTP databases: PostgreSQL, MySQL; NoSQL: MongoDB, DynamoDB, Cassandra, Elasticsearch, Redis. File formats: Parquet, Avro, ORC, JSON, CSV; partitioning, bucketing, compaction strategies.5.3 Transformation & Quality dbt (data build tool) for transformation, testing, and documentation. Data quality frameworks: Great Expectations, Soda, Deequ, custom validators. Data lineage and cataloguing: DataHub, OpenMetadata, Amundsen, Atlan, Collibra.5.4 Cloud, DevOps & Observability Strong on at least one of AWS, GCP, Azure (compute, storage, IAM, networking, serverless). Containerisation and orchestration: Docker, Kubernetes (EKS / GKE / AKS), ECS. Infrastructure as Code: Terraform, Pulumi, CloudFormation. CI/CD: GitHub Actions, GitLab CI, Jenkins, Argo CD. Observability: Prometheus, Grafana, ELK / OpenSearch, Datadog, Sentry, OpenTelemetry.6. Experience & Qualifications Bachelor's or Master's degree in Computer Science, Engineering, or related field. 10+ years of overall software engineering experience. 5+ years dedicated to large-scale web scraping / data extraction (primary). 3+ years of hands-on data engineering experience covering pipelines, warehouses / lakehouses, and orchestration (secondary). Proven track record of architecting scraping platforms processing millions of pages per day across diverse target sites. Prior experience as Solution Architect, Tech Lead, or Principal Engineer leading teams of 5+ engineers. Demonstrated experience in pre-sales / proposal authoring / RFP responses for scraping or data engineering engagements. Client-facing consulting experience strongly preferred.7. Soft Skills Excellent written and verbal communication; able to defend technical proposals to CXO-level audiences. Strong commercial acumen — understands the cost / risk / quality trade-offs in estimation. Analytical, structured problem-solving mindset. Ownership-driven, comfortable being the single point of technical accountability. High standards for documentation, knowledge transfer, and reusability.8. Expected Deliverables (RFP Scope)The Solution Architect deployed under this RFP shall be responsible for producing, at minimum, the following artifacts:Pre-Sales / Proposal Phase Technical solution sections of all proposal documents. Target-site feasibility and risk assessment reports. Effort estimation models, assumption logs, and complexity matrices. Solution architecture diagrams and tech stack recommendations for proposals.Delivery Phase High-Level Design (HLD) and Low-Level Design (LLD) documents per initiative. Architecture Decision Records (ADRs). Reference implementations / PoCs for complex extraction scenarios. Code review records and engineering standards documentation. Operational runbooks, monitoring playbooks, and incident response procedures. Compliance and data governance documentation. Knowledge transfer sessions and final handover documentation.9. Nice-to-Have ML / NLP-based extraction, entity resolution, or LLM-assisted parsing experience. Exposure to GraphQL, gRPC, mobile API reverse-engineering, mitmproxy / Charles workflows. Domain experience: e-commerce price intelligence, market research, financial data, real estate, travel aggregation, hospitality. Open-source contributions to scraping or data engineering ecosystems. Familiarity with cross-jurisdictional legal frameworks for automated data collection. Experience with reverse ETL tools (Hightouch, Census) and feature stores.10. Evaluation Criteria (RFP Response)Vendors proposing candidates against this role will be evaluated on:1. Depth and recency of relevant scraping architecture experience (primary weight).2. Strength of supporting data engineering experience (secondary weight).3. Quality of past project case studies — scale, complexity, business outcomes.4. Demonstrated capability in proposal authoring and technical estimation.5. Technical depth in interviews, whiteboarding, and architecture discussions.6. Communication and stakeholder management capability.7. Commercial competitiveness of the rate card.8. Availability and ramp-up timelineInterested candidates kindly share the below detailsRelevant Experience:CCTC:ECTC:Notice Period:Location:Role: Technical Lead,Industry Type: BPM / BPO,Department: Engineering - Software & QA,Employment Type: Full Time, PermanentRole Category: Software DevelopmentEducationUG: Any Graduate