Scrape data masjid dengan multi threads

Mencoba untuk melakukan scrape masjid di simas kemenag menggunakan PHP dan multi threads dengan multi curl

Scrape akan melakukan scrape masjid id dan lokasi masjid atau koordinatnya di gmaps.

Scriptnya:

<?php
$startPage = 1;
$endPage = …


This content originally appeared on DEV Community and was authored by Eko Priyanto

Image description

Mencoba untuk melakukan scrape masjid di simas kemenag menggunakan PHP dan multi threads dengan multi curl

Scrape akan melakukan scrape masjid id dan lokasi masjid atau koordinatnya di gmaps.

Scriptnya:

<?php
$startPage   = 1;
$endPage     = 100;
$maxThreads  = 10;
$namaFile    = "hasil_masjid_page{$startPage}-{$endPage}.txt";

// Buka file dan tulis header
$fp = fopen($namaFile, 'w');
fwrite($fp, "\"s_id\",\"link_detil\",\"link_peta\"\n");

// Ambil banyak halaman secara paralel
function fetchPagesMultiCurl($urls) {
    $multiHandle = curl_multi_init();
    $curlHandles = [];
    $results = [];

    foreach ($urls as $key => $url) {
        $ch = curl_init();
        curl_setopt_array($ch, [
            CURLOPT_URL => $url,
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_FOLLOWLOCATION => true,
            CURLOPT_USERAGENT => 'Mozilla/5.0',
            CURLOPT_TIMEOUT => 10,
        ]);
        curl_multi_add_handle($multiHandle, $ch);
        $curlHandles[$key] = $ch;
    }

    $running = null;
    do {
        curl_multi_exec($multiHandle, $running);
        curl_multi_select($multiHandle, 1.0); // Optimasi I/O blocking
    } while ($running > 0);

    foreach ($curlHandles as $key => $ch) {
        $results[$key] = curl_multi_getcontent($ch);
        curl_multi_remove_handle($multiHandle, $ch);
        curl_close($ch);
    }

    curl_multi_close($multiHandle);
    return $results;
}

// Batch processing
for ($batchStart = $startPage; $batchStart <= $endPage; $batchStart += $maxThreads) {
    $batchEnd = min($batchStart + $maxThreads - 1, $endPage);
    $urls = [];

    for ($i = $batchStart; $i <= $batchEnd; $i++) {
        $urls[$i] = "https://simas.kemenag.go.id/page/profilmasjid/0/0/0/0/0?page=$i";
    }

    printf("🔄 Memproses halaman %d - %d\n", $batchStart, $batchEnd);
    $responses = fetchPagesMultiCurl($urls);

    foreach ($responses as $page => $html) {
        if (!$html) {
            echo "⚠️ Gagal mengambil halaman $page\n";
            continue;
        }

        libxml_use_internal_errors(true); // suppress warning
        $dom = new DOMDocument();
        $dom->loadHTML($html, LIBXML_NOERROR | LIBXML_NOWARNING);
        libxml_clear_errors();

        $xpath = new DOMXPath($dom);
        $linksDetil = $xpath->query("//a[contains(text(), 'Lihat Detil')]");
        $linksPeta  = $xpath->query("//a[contains(text(), 'Lihat di Peta')]");

        $limit = min($linksDetil->length, $linksPeta->length);

        for ($i = 0; $i < $limit; $i++) {
            $detilHref = $linksDetil->item($i)->getAttribute('href');
            $petaHref  = $linksPeta->item($i)->getAttribute('href');

            $row = [
                "{$page}-" . ($i + 1),
                "https://simas.kemenag.go.id" . $detilHref,
                $petaHref
            ];

            fwrite($fp, '"' . implode('","', $row) . '"' . "\n");
        }

        // Bebaskan memori DOM
        unset($dom, $xpath, $linksDetil, $linksPeta);
    }
}

fclose($fp);
echo "🎉 Selesai! File berhasil dibuat: $namaFile\n";

Setelah scrape per batch yang bisa diatur maka akan menyimpan ke file txt dengan format csv

cara menjalankannya adalah dengan menjalankan di CMD atau terminal

nohup PHP scrape.php

Semoga bermanfaat

Image description


This content originally appeared on DEV Community and was authored by Eko Priyanto


Print Share Comment Cite Upload Translate Updates
APA

Eko Priyanto | Sciencx (2025-04-10T00:45:24+00:00) Scrape data masjid dengan multi threads. Retrieved from https://www.scien.cx/2025/04/10/scrape-data-masjid-dengan-multi-threads/

MLA
" » Scrape data masjid dengan multi threads." Eko Priyanto | Sciencx - Thursday April 10, 2025, https://www.scien.cx/2025/04/10/scrape-data-masjid-dengan-multi-threads/
HARVARD
Eko Priyanto | Sciencx Thursday April 10, 2025 » Scrape data masjid dengan multi threads., viewed ,<https://www.scien.cx/2025/04/10/scrape-data-masjid-dengan-multi-threads/>
VANCOUVER
Eko Priyanto | Sciencx - » Scrape data masjid dengan multi threads. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/04/10/scrape-data-masjid-dengan-multi-threads/
CHICAGO
" » Scrape data masjid dengan multi threads." Eko Priyanto | Sciencx - Accessed . https://www.scien.cx/2025/04/10/scrape-data-masjid-dengan-multi-threads/
IEEE
" » Scrape data masjid dengan multi threads." Eko Priyanto | Sciencx [Online]. Available: https://www.scien.cx/2025/04/10/scrape-data-masjid-dengan-multi-threads/. [Accessed: ]
rf:citation
» Scrape data masjid dengan multi threads | Eko Priyanto | Sciencx | https://www.scien.cx/2025/04/10/scrape-data-masjid-dengan-multi-threads/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.