Hoe kan ik dit het beste doen?

**S. Renders** · 20-08-2007, 16:16

Beste SD leden,

Voor een klant ben ik op zoek naar een systeem dat het volgende automatiseerd;

- Surfen naar website met persberichten (bv;http://www.shell.com/home/content/me..._11012007.html)
- Downloaden van de inhoud van alle persberichten
- Opslaan in aparte files met de datum als bestandsnaam

Hoe zou dit het beste gemaakt kunnen worden, en wie kan me hierbij helpen (evt. tegen vergoeding)

Sam Renders

**I. van Zon** · 20-08-2007, 17:01

file_get_contents?

**Robbert S** · 20-08-2007, 17:08

of CURL

**Edward deLeau** · 25-08-2007, 23:33

Via een script dat o.a. gebruikt maakt van CURL en wat slimme regex-en.

Voorbeeld: Scrape Hyves Blog en maak er een RSS feed van, ik denk dat je hieruit wel alle code kunt copy en pasten die je nodig hebt.

In ruil voor een slof Benson&Hedges wil ik het ook nog voor je copy en pasten *grin*

Code:

<?php

//###########################################################################
//#
//# Edward de Leau (http://www.cogmios.nl), 21 september 2006
//#
//# I was working on a page to show the blogs of all my friends
//# (http://friends.hotwebsite.net) and noticed that the blogs from hyves
//# didnt have an rss feed so i wrote and stole the pieces to quickly
//# enable me todo so, took me about 3 hours so probably there is
//# some stuff to improve. The RSS scraping i took basically from the
//# wordpress RSS import script.
//#
//# v0.1 Placed this script on http://kijker.tv/
//# v0.2 Michel on Hyves asked me to give the source free (mklijmij.hyves.nl)
//#      so... i now have to live with the shame of this bad scripting...
//#      Please: if you make improvement e-mail them to me so i can
//#      incorporate them in version ... v.0.3 at first i think we need
//#      to add dates to the http header also.
//# v0.3 Modifications to make pubdate work better (MK)
//# v0.4 Adopted to new hyves layout (3/4/2007) (MK)
//#      Small modifications to filter smilies from title (MK)
//# v0.8 Minor improvements for encoding, implemented tags (MK)
//#      Can't think of much more to add, so bumped up to near-1.0
//# v0.9 Caching
//###########################################################################

$version = "0.9 (beta)";

$hyvesname = $_POST['h'];
$show = addslashes($_GET['s']);

// Clean cache older than 2 weeks
if ($handle = opendir('cache')) {
    while (false !== ($file = readdir($handle))) {
        if (ereg("html",$file) && filemtime("cache/" . $file) < (time()-(14*24*60*60))) unlink("cache/" . $file);
    }

    closedir($handle);
}

function xmlentities($string, $quote_style=ENT_QUOTES)
{
    $string=html_entity_decode($string);
    static $trans;
    if (!isset($trans)) {
        $trans = get_html_translation_table(HTML_ENTITIES, $quote_style);
        foreach ($trans as $key => $value)
            $trans[$key] = '&#'.ord($key).';';
            // dont translate the '&' in case it is part of &xxx;
            $trans[chr(38)] = '&';
    }
    // after the initial translation, _do_ map standalone '&' into '&'
    return preg_replace("/&(?![A-Za-z]{0,4}\w{2,3};|#[0-9]{2,3};)/","&" , strtr($string, $trans));
}

if (empty($hyvesname)) {
    $hyvesname = $_GET['h'];
}
if (empty($hyvesname)) {
    ?>
    <html>
    <head>
    <title>Hyves naar RSS Converter</title>
    <link href="style.css" rel="stylesheet" type="text/css" />
    <!-- nee deze html code is inderdaad zo invalide dat je er van zou
    gaan huilen, komt nog wel eens, ik had maar 20 minuten -->
    </head>
    <body>
    <p><img src="hyvesrss.png" alt="Hyves RSS" /></p>
    <h1>Hyves Blog 2 RSS Converter.</h1>
    <h2>Door <a href="http://www.cogmios.nl">cogmios</a> en <a href="http://mklijmij.hyves.nl/">Michel</a>.</h2>
    <p>
    <form>
    Hyves Naam:
        <input maxlength=100 name="h" size=50 type=Text value="cogmios" />
    <input type=Submit title="Submit" value="Submit">
    </form>
    </p>
    <p>
    Deze service levert je een <a href="http://nl.wikipedia.org/wiki/RSS">RSS feed</a>
    van iemand's weblog op Hyves.<br /><br />
    Vul het vakje in, druk op submit en je kunt de gegenereerde url in je
    eigen programmaatjes stoppen.<br /><br />
    Opmerkingen, vragen e.d. kunnen via ... <a href="http://mklijmij.hyves.nl/">mijn Hyves page</a> of die van <a href="http://cogmios.hyves.nl/index.php?l3=bl&l4=it&blogitem_id=644838&blogitem_secret=GuTa">Cogmios</a>.
    (waar anders).<br/><br />
    </p>

    <p align="right">Versie <?php echo $version;?> (<a href="index.txt">broncode</a>)</p>
    </body>
    </html>
    <?
} else {
    $blogname = $hyvesname;
    $fullblog = 'http://'.$blogname.'.hyves.nl/blog/';
    $cache = "cache/" . $blogname . ".html";
    if (file_exists($cache) && filemtime($cache) > (time()-(20*60))) {
        $file_contents = file_get_contents($cache);
    } else {
        $ch = curl_init();
        $timeout = 5; // set to zero for no timeout
        curl_setopt ($ch, CURLOPT_URL, $fullblog);
        curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt ($ch, CURLOPT_CONNECTTIMEOUT, $timeout);
        curl_setopt ($ch, CURLOPT_CRLF, 0);
        $file_contents = curl_exec($ch);
        curl_close($ch);

        $handle = fopen($cache,"w");
        fwrite($handle,$file_contents);
        fclose($handle);
    }

    //    $file_contents = utf8_encode(file_get_contents($fullblog));
    //    $file_contents = utf8_encode($file_contents);
    //    $file_contents = iconv("UTF-8","UTF-8",file_get_contents($fullblog));
    //    $file_contents = file_get_contents($fullblog);


    $enc = mb_detect_encoding($file_contents);
    //    if (empty($enc)) $enc="ISO-8859-1";
    if (empty($enc)) $enc="CP1252";
    //    $file_contents = iconv($enc,"UTF-8",$file_contents);
    //    $file_contents = utf8_encode($file_contents);
    //    $enc="UTF-8";

    $file_contents = stripslashes(str_replace("\\r","",str_replace("\\n","",$file_contents)));

    if (preg_match_all('|class="singlecontent_block">(.*?)</rdf:RDF>|is', $file_contents, $matches)) {
    unset($matches[0]);

    if ($show!="source") {
        header ("Content-Type: text/xml; charset=" . $enc);
        printf("<?xml version=\"1.0\" encoding=\"" . $enc . "\"?>"."\n");
        printf("<rss version=\"2.0\"" . "\n");
        printf("    xmlns:content=\"http://purl.org/rss/1.0/modules/content/\"" . "\n");
        printf("    xmlns:wfw=\"http://wellformedweb.org/CommentAPI/\""."\n");
        printf("    xmlns:dc=\"http://purl.org/dc/elements/1.1/\">"."\n");

    } else {
        header("Content-Type: text/plain");
    };

    printf("<channel>\n");
    printf("<title>Hyves weblog ".$blogname."</title>\n");
    printf("<link>http://" . $blogname . ".hyves.nl/blog/</link>\n");
    printf("<description>Hyves Blog van " . $blogname. "</description>\n");
    printf("<generator>hyves2rss " . $version . " (http://hyvesblog2rss.klijmij.net/)</generator>\n");

    foreach ($matches as $post2) {
        $p=0;
        foreach ($post2 as $post) {
            $p++;
            if ($p>10) break;
            if ($show=="tags" && $p>5) break;

            printf("<item>\n");
            // =======================
            // get the titlepart (containing title and permalink)
            // =======================
            preg_match('|personal_header BlogTitle">(.*?)</div>|is', $post, $post_titlepart);
            $post_titlepart = trim($post_titlepart[1]);

            // =======================
            // get the title
            // =======================
            preg_match('|personal_link personal_header">(.*?)</a>|is', $post_titlepart, $post_title);
            $post_title = xmlentities(strip_tags(trim($post_title[1])));
            printf("<title>".$post_title."</title>\n");

            // =======================
            // get the permalink
            // =======================
            preg_match('|<a href="(.*?)"|is', $post_titlepart, $post_permalink);
            $post_permalink = trim($post_permalink[1]);
            printf("<link>");
            echo $post_permalink;
            printf("</link>\n");

            // =======================
            // get the date
            // =======================
            preg_match('|style="display: block;padding-top: 3px;">(.*?)</span>|is', $post, $post_date);
            $post_date = trim($post_date[1]);

            // Problem: strtotime doesn't understand dates like
            // 25 Sep 16:48 because there's no year.
            if (!ereg(date("Y")."|".date("Y")-1,$post_date) && preg_match("/^[0-9]/",$post_date)) $post_date = str_replace(","," ".date("Y"),$post_date);

            $post_date = ereg_replace("vandaag|today",date("Y-m-d"),$post_date);

            $post_date = str_replace(",", "", $post_date);

            $post_date = strtotime($post_date);

            // New Year's Bug
            if ($post_date>(time()+(7*86400))) $post_date = $post_date - (365*86400);

            // Problem: Hyves reports dates as day, which are sometimes
            // converted to the future, and sometimes to the past. If the
            // date is in the future, substract a week.
            if ($post_date>(time()+43200)) $post_date = $post_date - (7*86400);

            $post_date = date("r", $post_date);
            printf("<pubDate>".$post_date."</pubDate>\n");

            // =======================
            // get the content
            // =======================
            preg_match('|singlecontent_block personal_text">(.*?)</div>|is', $post, $post_content);
            $post_content = trim($post_content[1]);
            printf("<description><![CDATA[");
            echo substr(xmlentities(strip_tags($post_content)),0,250) . " [...]";
            printf("]]></description>\n");

            printf("<content:encoded>");
            echo "<![CDATA[<p>" . $post_content . "</p>\n";
            printf("]]></content:encoded>\n");

            //      if ($show=="tags") {
            $cache = "cache/" . $blogname . "-" . strtolower(ereg_replace(" |\.|\/","_",$post_title)) . ".html";
            if (file_exists($cache) && filemtime($cache) > (time()-(20*60))) {
                $blogpost = file_get_contents($cache);
            } else {
                $blogpost = file_get_contents($post_permalink);
                $handle = fopen($cache,"w");
                fwrite($handle,$blogpost);
                fclose($handle);
            }
            if (preg_match_all('|<a href="http://www.hyves.nl/blog/tags/[^/]+/" class="[^"]+">(.*?)</a>|is', $blogpost, $blogtags)) {
                foreach ($blogtags as $tags) {
                    foreach ($tags as $taglink) {
                        preg_match('|<a href="http://www.hyves.nl/blog/tags/[^/]+/" class="[^"]+">(.*?)</a>|is',$taglink,$tag);
                        $tag = trim($tag[1]);
                        if (empty($tag)) continue;
                        printf("<category>" . xmlentities($tag) . "</category>\n");
                    }
                }
            }

            printf("</item>\n");
        } // foreach $post
    } // foreach $matches

    printf("</channel>\n");
    printf("</rss>\n");

} else {
    echo "Sorry, Hyves could not be reached or the name (" . $blogname. ") you supplied does not exist!";

    echo "<pre>" . htmlspecialchars($file_contents) . "</pre>";

    /*
    header ("Content-Type: text/xml");
    printf("<?xml version=\"1.0\" encoding=\"" . $enc . "\"?>"."\n");
    printf("<rss version=\"2.0\"" . "\n");
    printf("    xmlns:content=\"http://purl.org/rss/1.0/modules/content/\"" . "\n");
    printf("    xmlns:wfw=\"http://wellformedweb.org/CommentAPI/\""."\n");
    printf("    xmlns:dc=\"http://purl.org/dc/elements/1.1/\">"."\n");

    printf("<channel>\n");
    printf("<title>Hyves weblog ".$blogname."</title>\n");
    printf("<link>http://" . $blogname . ".hyves.nl/blog/</link>\n");
    printf("<description>Hyves Blog van " . $blogname . " - converted by www.cogmios.nl - alfa 0.1 still working on it!!</description>\n");

    printf("<item><title>Sorry, Hyves could not be reached or the name you supplied does not exist!</title></item>\n");
    printf("</channel>\n");
    printf("</rss>\n");
    */
}
}
?>

Wat meer werk is:

a) voor elke website moet je waarschijnlijk andere regex-en gebruiken, wellicht is een meta regex library dan uiteindelijk het resultaat
b) voor persberichten zal er meestal een rss feed zijn, die je simpel kunt
binnenhalen met diverse open source libs e.g. magpierss
c) je wilt die bestanden waarschijnlijk in een generiek formaat dumpen dus dat zal ook een component worden
d) wellicht ook nog archiving e.d.

Voor bovenstaande: dit is gebaseerd op een LAMP platform, Perl is heel handig additioneel omdat je dan wat geavanceerde cronjobs kunt schrijven maar eventueel kan dat ook in php.

cogmios - www.cogmios.nl

**S. Renders** · 26-08-2007, 10:03

Hm, bedankt voor je reactie; heb zelf helemaal geen ervaring met scripting oid dus ik ga ff kijken wat ik er mee kan. Eerst ff kijken of RSS een optie is.

**Henry K.** · 26-08-2007, 13:04

Beste Sam,

Als u dit script liever uitbesteedt aan een scripter die er wel verstand van heeft zou ik graag een PM van u ontvangen.

Mvg,
Henry

Hoe kan ik dit het beste doen?