cookies loaded but login page detected #51

evilsh3ll · 2023-03-20T15:13:44Z

Hello, when I use this command:
crawley -depth -1 -dirs only -cookie "phpbb3_XXXXXX_sid=XXXXXX; phpbb3_XXXXXX_u=XXXXXX; phpbb3_XXXXXX_k=XXXXXX;" "https://XXXXXX.net/viewtopic.php?p=5018859" > urls.txt
crawley scrapes the login page of the forum and not the thread I selected, crawley doesn't return any errors

2023/03/20 16:08:58 [*] config: workers: 4 depth: -1 delay: 150ms
2023/03/20 16:08:58 [*] crawling url: https://XXXXXX.net/viewtopic.php?p=5018859
2023/03/20 16:09:00 [*] complete

I tried with different user agents and headers, every time the result is the same: login page of the forum. The cookies are ok, I copied them using EditThisCookie googlechrome extension. I'm not using any vpn/proxy.
I tested also other forums, the result is always the same, I can't login.
Do you know if there is a problem loading cookies?

The text was updated successfully, but these errors were encountered:

s0rg · 2023-03-20T15:58:25Z

The cookies works fine - they test-covered, try use headless mode: -headless flag, it will disable HEAD requests, it can help.

evilsh3ll · 2023-03-20T16:01:26Z

yes even with -headless the result is the same login page, if you want to try the forum is this "ddunlimited.net". I don't know if I need some special header for the request but I suppose that if it can scrape the login page, it should be able to scrape any other page using the same settings + logged cookies.

s0rg · 2023-03-20T17:29:52Z

i failed to register at ddunlimited.net, as long as i cant speak italian, so i cannot understand questions. I can try to help you with this issue, but i need credentials (and urls) for testing. Can you provide them? My email is al3x.s0rg(at)gmail.com

evilsh3ll · 2023-03-20T17:46:28Z

ok, I created a test account with temp email, I sent you the full cookie in netscape format. I just tested the cookie in two different browsers, it works.

s0rg · 2023-03-20T19:03:27Z

Ok, you need to use "Semicolon separated Cookie File" format upon export, and rigth now you also need to manauly remove all comments from result. I will add coments parsing in future releases, this is the first encounter )

evilsh3ll · 2023-03-20T19:51:37Z

I exported cookies from EditThisCookie extension using the "semicolon separeted format", I removed the first 3 comment lines but it's again in the login page.
This is the exported string (it's almost the same I manually built in the previous tests 🥲):

s0rg · 2023-03-20T20:03:50Z

please, try again, but set -headless flag, and unset -dirs only

evilsh3ll · 2023-03-20T20:21:26Z

I get these 48 links (as before when I set -headless without -dirs only) https://bin.disroot.org/?6bf486611c974dc6#BFLGkPZatknTvchGpe1UQbofmhZ4ToCR7RbHw3G4Eihb that are the parts of the login page because there is the button https://ddunlimited.net/ucp.php?mode=sendpassword "I lost password" (that is the "I lost password" button in the login page) and also the other buttons/links of the same login page.

Instead in the page I try to scrape there should be >400 forum threads (it's a page with a huge index)

evilsh3ll · 2023-03-21T12:26:36Z

Any idea? Did you get more links than me?

s0rg · 2023-03-21T22:18:25Z

Nope, i hope i got some time to investigate this at next weekend. As for now - please try change user-agent for crawley, you need to get you current browser user-agent string (from any request) and give it crawley with -user-agent "your-user-agent-here" flag.

evilsh3ll · 2023-03-22T08:46:35Z

I'll do tests with other headers too. When you have time, do you know how to get the header string used by my browser (brave, archlinux)? I'm searching but there aren't info.

s0rg · 2023-03-22T13:58:10Z

take a look at good tool for any http debbugging - https://httpbin.org/ for example you can obtain all headers from you browser here: https://httpbin.org/headers

evilsh3ll · 2023-03-26T14:41:08Z

I found a way to make the cookies working with curl:

curl --cookie "phpbb3_ddu4final_k=XXX;phpbb3_ddu4final_sid=XXXX;phpbb3_ddu4final_u=XXXX;" -j -L "https://ddunlimited.net/viewtopic.php?f=440" -o test.html

using the -j parameter the page is the correct one (not the login page), it works only for this website, I don't know why.

curl man:
New cookie session
Instead of telling curl when a session ends, curl features an option that lets the user decide when a new session begins.
A new cookie session means that all the old session cookies will be thrown away. It is the equivalent of closing a browser and starting it up again.
Tell curl a new cookie session starts by using -j, --junk-session-cookies

Do you know if there is a similar feature for crawley (something like cleaning all previous cookies before using the new one)?

s0rg · 2023-03-26T15:43:48Z

Thank you for this investigation, i will inspect this closer, an add such feature to crawley.

s0rg · 2023-03-28T20:08:53Z

Plese, check out new https://github.com/s0rg/crawley/releases/tag/v1.5.12 release it should fix this issue.

evilsh3ll · 2023-03-29T21:56:25Z

I tried again using this command:

crawley -depth -1 -headless -cookie "phpbb3_ddu4final_k=XXX" -cookie "phpbb3_ddu4final_sid=XXX" -cookie "phpbb3_ddu4final_u=XXX" "https://ddunlimited.net/viewtopic.php?p=5018859" > test.txt

getting this output:

Scraping all urls

2023/03/29 23:46:12 [*] config: workers: 4 depth: -1 delay: 150ms
2023/03/29 23:46:12 [*] crawling url: https://ddunlimited.net/viewtopic.php?p=5018859
2023/03/29 23:46:13 [*] complete

Cookies are loaded, I'm not sure if I enabled the "cleaning" feature or it's by default.
Crawley v1.5.12-a1f6de2 (archlinux)

Thanks for your support and patience

s0rg · 2023-03-30T17:21:41Z

its by default. Thank you for your detailed reports - they are very helpfull.

evilsh3ll · 2023-03-30T18:52:15Z

I forgot to say that the page is still the login page, it seems cookies are not correctly loaded

s0rg · 2023-04-01T12:44:06Z

Hello again )
Please check out new release: https://github.com/s0rg/crawley/releases/tag/v1.5.13
Seems i fixed it )

Please note, you still need -user-agent to make phpbb happy (i.e.):

crawley \
-headless \
-user-agent 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0' \
-cookie 'phpbb3_2yptu_sid=ef44a131d691d096c21add5db0eb2bd7;phpbb3_2yptu_u=2' \
'http://host/viewtopic.php?p=2'

evilsh3ll · 2023-04-05T22:33:01Z

I upgraded to Crawley v1.5.13-3c672bd (archlinux)

problem 1 (scraping still not working, the file list.txt is empty):

crawley \
-headless \
-user-agent 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0' \
-cookie 'phpbb3_ddu4final_k=xxx; phpbb3_ddu4final_sid=xxx; phpbb3_ddu4final_u=xxx' \
'https://ddunlimited.net/viewtopic.php?p=5018859' > list.txt

output:

2023/04/06 00:19:00 [*] config: workers: 4 depth: 0 delay: 150ms
2023/04/06 00:19:00 [*] crawling url: https://ddunlimited.net/viewtopic.php?p=5018859
2023/04/06 00:19:05 [-] GET https://ddunlimited.net/viewtopic.php?p=5018859: Get "https://ddunlimited.net/viewtopic.php?p=5018859": context deadline exceeded
2023/04/06 00:19:05 [*] complete

I tried a lot of combination of paramters, I get every time the same error above.

problem 2:

the following command worked with Crawley v1.5.8-84c5855 but now it's not working anymore :

crawley -headless -user-agent 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/111.0' -cookie 'CENSORED_key=a%3A4%3A%7Bi%3A0%3Bs%3A6%3A%22153366%22%3Bi%3A1%3Bs%3A40%3A%225b6hf75839hf5832g6f4cf2%22%3Bi%3A2%3Bi%3A1868962414%3Bi%3A3%3Bi%3A0%3B%7D' 'CENSORED_URL' > list.2.txt

output:

2023/04/06 00:26:00 [*] config: workers: 4 depth: 0 delay: 150ms
2023/04/06 00:26:00 [*] crawling url: CENSORED_URL
2023/04/06 00:26:00 net/http: invalid byte ';' in Cookie.Value; dropping invalid bytes
2023/04/06 00:26:01 [*] complete

the cookie value doesn't include any semicolon, maybe it's an encoded character (%..) in the cookie value (censored key and url don't include semicolon), also removing the user-agent parameter the error is still present

s0rg · 2023-04-08T18:26:10Z

Please, check version https://github.com/s0rg/crawley/releases/tag/v1.5.14
Cookie decoding now gone, so it will load any cookie. And please keep in mind that you need to use the same user-agent, from which you obtain those cookies. So you need to copy cookies AND you user-agent from request.

evilsh3ll · 2023-04-12T08:25:31Z

I'm using 1.5.14 now, the cookie is correctly loading but I get the context deadline exceeded error only in https://ddunlimited.net/viewtopic.php?p=5018859 website. I extracted my exact user-agent string and used a fresh new cookie (from the same browser) but the same error is still present.

2023/04/12 10:18:18 [*] config: workers: 4 depth: 0 delay: 150ms
2023/04/12 10:18:18 [*] crawling url: https://ddunlimited.net/viewtopic.php?t=3610730
2023/04/12 10:18:24 [-] GET https://ddunlimited.net/viewtopic.php?t=3610730: Get "https://ddunlimited.net/viewtopic.php?t=3610730": context deadline exceeded
2023/04/12 10:18:24 [*] complete
Time: 5 s.

I also get the same error without using cookies or with -delay 2000ms option, I can't get any info online about the context deadline exceeded error (except that it seems a timeout error)

Thanks for your support

s0rg · 2023-04-13T20:43:13Z

Please, check fresh release: https://github.com/s0rg/crawley/releases/tag/v1.6.0 it has -timeout option, so you can increase default timeout (5 seconds).

s0rg mentioned this issue Mar 28, 2023

encoded cookies fixed #53

Merged

s0rg closed this as completed in #53 Mar 28, 2023

s0rg mentioned this issue Apr 1, 2023

cookiefor phpbb + auto-referrer header #55

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cookies loaded but login page detected #51

cookies loaded but login page detected #51

evilsh3ll commented Mar 20, 2023 •

edited

Loading

s0rg commented Mar 20, 2023 •

edited

Loading

evilsh3ll commented Mar 20, 2023 •

edited

Loading

s0rg commented Mar 20, 2023

evilsh3ll commented Mar 20, 2023

s0rg commented Mar 20, 2023

evilsh3ll commented Mar 20, 2023

s0rg commented Mar 20, 2023

evilsh3ll commented Mar 20, 2023 •

edited

Loading

evilsh3ll commented Mar 21, 2023

s0rg commented Mar 21, 2023

evilsh3ll commented Mar 22, 2023 •

edited

Loading

s0rg commented Mar 22, 2023

evilsh3ll commented Mar 26, 2023 •

edited

Loading

s0rg commented Mar 26, 2023

s0rg commented Mar 28, 2023

evilsh3ll commented Mar 29, 2023 •

edited

Loading

s0rg commented Mar 30, 2023

evilsh3ll commented Mar 30, 2023

s0rg commented Apr 1, 2023 •

edited

Loading

evilsh3ll commented Apr 5, 2023 •

edited

Loading

s0rg commented Apr 8, 2023

evilsh3ll commented Apr 12, 2023 •

edited

Loading

s0rg commented Apr 13, 2023

cookies loaded but login page detected #51

cookies loaded but login page detected #51

Comments

evilsh3ll commented Mar 20, 2023 • edited Loading

s0rg commented Mar 20, 2023 • edited Loading

evilsh3ll commented Mar 20, 2023 • edited Loading

s0rg commented Mar 20, 2023

evilsh3ll commented Mar 20, 2023

s0rg commented Mar 20, 2023

evilsh3ll commented Mar 20, 2023

s0rg commented Mar 20, 2023

evilsh3ll commented Mar 20, 2023 • edited Loading

evilsh3ll commented Mar 21, 2023

s0rg commented Mar 21, 2023

evilsh3ll commented Mar 22, 2023 • edited Loading

s0rg commented Mar 22, 2023

evilsh3ll commented Mar 26, 2023 • edited Loading

s0rg commented Mar 26, 2023

s0rg commented Mar 28, 2023

evilsh3ll commented Mar 29, 2023 • edited Loading

s0rg commented Mar 30, 2023

evilsh3ll commented Mar 30, 2023

s0rg commented Apr 1, 2023 • edited Loading

evilsh3ll commented Apr 5, 2023 • edited Loading

s0rg commented Apr 8, 2023

evilsh3ll commented Apr 12, 2023 • edited Loading

s0rg commented Apr 13, 2023

evilsh3ll commented Mar 20, 2023 •

edited

Loading

s0rg commented Mar 20, 2023 •

edited

Loading

evilsh3ll commented Mar 20, 2023 •

edited

Loading

evilsh3ll commented Mar 20, 2023 •

edited

Loading

evilsh3ll commented Mar 22, 2023 •

edited

Loading

evilsh3ll commented Mar 26, 2023 •

edited

Loading

evilsh3ll commented Mar 29, 2023 •

edited

Loading

s0rg commented Apr 1, 2023 •

edited

Loading

evilsh3ll commented Apr 5, 2023 •

edited

Loading

evilsh3ll commented Apr 12, 2023 •

edited

Loading