Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于attached_url的bug #164

Open
yingzheng1980 opened this issue Jun 7, 2022 · 5 comments
Open

关于attached_url的bug #164

yingzheng1980 opened this issue Jun 7, 2022 · 5 comments

Comments

@yingzheng1980
Copy link

phpspider.php 文件的第 2114 行,下载应该使用 $collect_url
$html = requests::$method($collect_url, $params);

否则不会去下载 attached_url

@owner888
Copy link
Owner

phpspider.php 文件的第 2114 行,下载应该使用 $collect_url $html = requests::$method($collect_url, $params);

否则不会去下载 attached_url

你改下提个patch给我呀

@kavt
Copy link

kavt commented Jul 31, 2022

大哥!!!我真的太感谢你了!!
我找半天 总是找不到为什么加载了 下载不了,只能下载主页,原来是代码有问题!
@owner888 群主啊,你害人不清啊!! 虽然你的代码节省了我们大量时间,你好歹也测试下啊。
我搞了3天3夜没找到原因

@kavt
Copy link

kavt commented Jul 31, 2022

@yingzheng1980 还有个问题,有的详情页有分页 有的没有 如何判断呢

`
'fields' => array(

    array(
        'name' => "contents",
        'selector' => "//div[contains(@class,'art-pre')]/a/@href", ////div[contains(@class,'art-pre')]//a//@href

        ////*[@id="form1"]/div[6]/div/div[2]/div[1]/div[2]/a[5]
        ////div[contains(@class,'art-pre')]//a//@href
        
        'repeated' => true,
        'required' => true,//必填

        'children' => array(

          
            array(
                // 抽取出其他分页的url待用
                'name' => 'content_page_url',
               
                'selector' => "//text()"
            ),

        
            array(
                // 抽取其他分页的内容
                'name' => 'page_content',
  
                'source_type' => 'attached_url',
                'attached_url' => 'content_page_url',   // 'attached_url'=>"https://www.zhihu.com/r/answers/{comment_id}/comments",
                'selector' => "//div[contains(@class,'textWrap')]"
            ),
        ),`

@kavt
Copy link

kavt commented Jul 31, 2022

@kavt
Copy link

kavt commented Jul 31, 2022

改了也不对,
1:没有底部分页的会自动无视 跳过
'selector' => "//div[contains(@Class,'art-pre')]/a/@href", 因为有的网页没有分页 底部就没有下一页
没有这个就不会被执行采集

2:有分页的 只采集下一页等几个内容页,当前页内容并没有被采集

`<?php
require_once DIR . '/../autoloader.php';
use phpspider\core\phpspider;
use phpspider\core\requests;
use phpspider\core\selector;
/* Do NOT delete this comment /
/
不要删除这段注释 */

/【重要 模拟登录】/
$cookies = "ASP.NET_SessionId=uqbyzahwaa5fedgldcawsogx; Hm_lvt_782a719ae16424b0c7041b078eb9804a=1657892367,1658402814,1658581663,1658932747; Hm_lvt_29f14b13cac2f8b4e5fc964806f3ea52=1657892367,1658402820,1658581663,1658932747; Hm_lpvt_782a719ae16424b0c7041b078eb9804a=1658932755; Hm_lpvt_29f14b13cac2f8b4e5fc964806f3ea52=1658932755; UserToken=nrbjZ+ZFD3ulIoEX50957cwO1CrVaO5/NLAFj6bcy1Gx6rsh; LoginUserName=kavt12; LoginPassword=NRW/PSbsXFo=";

requests::set_cookies($cookies, 'www.qikan.com.cn');

/*[for scan_urls 计算出年份和周数 每次请求加上即可]
*

取当年份
取当周数

*/

//周数

$year=date('Y');
$week = date('W'); //电脑报一般周一下午出 周数-2
$week=$week-2;

//die;

//7 主要尝试增加分页

$configs = array(
'name' => 'diannaobao',
'log_show' => true,
'max_fields' => 2, //最大采集2条 每次
'domains' => array(
'www.qikan.com.cn'
),

//入口



'scan_urls' => array(
    "http://www.qikan.com.cn/magdetails/683A509F-26A9-46BC-B01D-3EFE1BABD7D8/{$year}/{$week}.html"   //  http://www.qikan.com.cn/magdetails/683A509F-26A9-46BC-B01D-3EFE1BABD7D8/2022/27.html
),





//内容 也对了
 'content_url_regexes' => array(
        "http://www.qikan.com.cn/article/[\s\S]+",  //http://www.qikan.com.cn/article/dinb20222701.html
        "http://www.qikan.com.cn/articleinfo/[\s\S]+"
    ),





'fields' => array(



    array(
        'name' => "contents",
        //'selector_type' => 'regex',
        'selector' => "//div[contains(@class,'art-pre')]/a/@href", ////div[contains(@class,'art-pre')]//a//@href

        ////*[@id="form1"]/div[6]/div/div[2]/div[1]/div[2]/a[5]
        ////div[contains(@class,'art-pre')]//a//@href
        
        'repeated' => true,
        'required' => true,//必填

        'children' => array(

          
            array(
                // 抽取出其他分页的url待用
                'name' => 'content_page_url',
               
                'selector' => "//text()"
            ),

        
            array(
                // 抽取其他分页的内容
                'name' => 'page_content',
  
                'source_type' => 'attached_url',
                'attached_url' => 'content_page_url',   // 'attached_url'=>"https://www.zhihu.com/r/answers/{comment_id}/comments",
                'selector' => "//div[contains(@class,'textWrap')]"
            ),
        ),
    ),




    // 抽取内容页的文章标题
    array(
        'name' => "title",
        'selector' => "//div[contains(@class,'article')]//h1",   //     备用//*[@id=\"form1\"]/div[6]/div/div[2]/div[1]/h1
        'required' => true
    ),


    //正文 //div[contains(@class,'textWrap')]

/*
array(
'name' => "text",
'selector' => "//div[contains(@Class,'textWrap')]", ///article-main //div[contains(@Class,"article-content")]', 内容部分 //div[contains(@Class,'textWrap')]
),
*/

/*

        array(
        'name' => "contents",
        'selector' => "//html",  ////div[@id='art-pre']//a//@href
        'repeated' => true,
        'children' => array(
                array(
                    // 抽取出其他分页的url待用
                    'name' => 'content_page_url',
                    'selector' => "div[contains(@class,'art-pre')]//a//@href"   ////div[contains(@class,'art-pre')]//a//@href
                ),
                array(
                    // 抽取其他分页的内容
                    'name' => 'page_content',
                    // 发送 attached_url 请求获取其他的分页数据
                    // attached_url 使用了上面抓取的 content_page_url
                    'source_type' => 'attached_url',
                    'attached_url' => 'http://www.qikan.com.cn/{content_page_url}',  //"https://www.zhihu.com/r/answers/{comment_id}/comments",  http://www.qikan.com.cn/
                    'selector' => "//div[contains(@class,'textWrap')]"
                )
        )
    ),

*/

    //图片
     array(
        'name' => "pic",
        'selector' => "//figure[contains(@class,'image')]//img",  ///html/body/div[1]/div[3]/div[1]/div/div[2]/div[1]/div[1]/div[5]  

        //返回的是图片数组 需要取一个出来 $data=$data[0];  估计要做个判断 一个就直接显示 多个就显示第一个   目前看来不处理也可以的
    ),



      


),

'export' => array(
    'type'  => 'sql',
    'file'  => './data/8.sql',
    'table' => '数据表',
),

);

$spider = new phpspider($configs);

//【如何对采集到的字段进行二次处理?】 on_extract_field进行二次处理即可
$spider->on_extract_field = function($fieldname, $data, $page)
{
if ($fieldname == 'contents')
{

                  $contents = $data;
                  $data = "";

                  $num=count($contents)-1;

                  for ($i=0; $i <$num ; $i++) { 
                  	 $data .= $contents[$i]['page_content'];
                  }

                  /*
                  foreach ($contents as $content) 
                  {
                      $data .= $content['page_content'];
                  }
              */

              
}
return $data;

};

$spider->start();`

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants