Focus on learning the definition of information gathering for newcomers that want to upgrade their level of understanding and practice it properly.

In web security testing, information gathering is the first vital phase before you do deeper tests such as finding vulnerabilities or exploiting a site. It's like studying a building's layout before checking its doors or windows for weaknesses.

Information gathering help testers understand "what technologies are used?", "what services are exposed?", "what potential weaknesses may exist?".

This phase does not involve exploitation. It is focused on observation and analysis.

What is Information Gathering?

Information gathering is the process of collecting data about a target web application. The purpose is to build an initial understanding of the system.

In OWASP WSTG, this phase includes several activities such as:

  • Search engine reconnaissance
  • Web server fingerprinting
  • Reviewing server metafiles
  • Enumerating applications
  • Reviewing web content

Each activity helps reveal different types of information.

Why Information Gathering Is Important

Information gathering is important because it prepares testers before they start deeper security testing. It helps them understand how the web application works, what technologies are used, and what parts of the system are exposed.

Without proper information gathering, testers might miss important areas of the application. Some vulnerabilities may stay hidden simply because the tester did not know where to look. Testing can also become inefficient, since time may be spent on the wrong parts of the system.

When reconnaissance is done carefully, it makes the next steps, such as vulnerability analysis and further testing more accurate and more focused.

1. Conduct Search Engine Discovery Reconnaissance

Search engines can contain indexed information about a website. Sometimes, sensitive or internal files are accidentally exposed and indexed.

This is called Google Dorking, it lets you use advanced search operators to dig deeper, uncovering data that's otherwise hidden or difficult to find.

This technique is passive because it does not directly interact with the target server.

Example

Using Google search operators:

site:example.com filetype:pdf
site:example.com intitle:"admin"
site:example.com inurl:backup

These queries may reveal:

  • Public documents
  • Admin panels
  • Backup directories
  • Configuration files

If a backup file appears in search results, this indicates information leakage.

See what i'll find just by typing that to the google search. For instance i typed: "mahasiswa site:ubl.ac.id filetype:pdf".

None
This command narrows down your search to only PDF files related to "Mahasiswa" from only the ubl.ac.id site.

The goal of what i've done just now is to acknowledge the people related to the information i'm searching and what kind of information they might gave me.

Different Google Dorking Techniques

Google Dorking techniques primarily involve using specific search operators. Below are some of the most commonly used methods:

  1. Filetype: This operator searches for specific file types. For example, `filetype:pdf` would return PDF files.
  2. Inurl: The `inurl:` operator can be used to find specific words within the URL of a page. For example, `inurl:login` would return pages with 'login' in the URL.
  3. Intext: With the `intext:` operator, you can search for specific text within the content of a web page. For example, `intext:"password"` would yield pages that contain the word "password".
  4. Intitle: The `intitle:` operator is used to search for specific terms in the title of a webpage. For example, `intitle:"index of"` could reveal web servers with directory listing enabled.
  5. Link: The `link:` operator can be used to find pages that link to a specific URL. For example, `link:example.com` would find pages linking to example.com.
  6. Site: The `site:` operator allows you to search within a specific site. For example, `site:example.com` would search within example.com.

2. Fingerprint the Web Server

Web server fingerprinting is the process of identifying:

  • The web server type (Apache, Nginx, IIS)
  • The server version
  • Operating system indicators
  • Technologies used

This information helps testers identify possible known vulnerabilities related to specific versions.

Identity HTTP headers (using curl):

curl -I https://example.com

The goal is to understand the software server and basic config.

Example output:

None
  • The server software

This means the website uses OpenResty, OpenResty is built on Nginx, It works as a web server and reverse proxy, The version number is hidden (this is good for security)

  • The backend language

The header shows /wp-json/,This indicates the website uses WordPress, WordPress is built with PHP and usually uses MySQL as its database

So we can assume the Backend language is PHP and Database is MySQL

  • Security Configuration

Strict-Transport-Security is enabled, this means HSTS is active, the site forces HTTPS connections thus improves security.

Another tool example (Using Nmap):

Using Nmap is one of the best tools to helps us understand what is running behind the web app. It supports vulnerability research, giving us a grasp understanding of the service and version to help map possible CVEs. And so to understand the attack surface, giving us a broader view beyond just the website interface.

Nmap helps testers move from "what I see in the browser" to "what is actually running on the server."

In this practice, i'll be using commonly used Nmap Options.

#detect service versions running on open ports.
nmap -sV example.com
None
From this scan, we know that nmap sends requests to port 80 and 443, and it is said here that they used OpenResty web app server as their service version.
#Nmap full TCP port scan (0–65535)
nmap -Pn -sT -sV -p0-65535 example.com

-Pn
Skip host discovery. Treat the target as online and scan directly.

-sT
TCP connect scan. Completes full TCP handshake. Slower but works without root.

-p-
Scan all 65535 ports.
None
The Nmap full TCP port scan (0–65535) identified only ports 80 (HTTP) and 443 (HTTPS) as open. All other ports were filtered, and no services were detected running
#Check DNS records, Shows which DNS servers are authoritative for the domain (basic detail)
host -t ns example.com

#Check dns detailed information
dig example.com
dig (Domain Information Groper) retrieves detailed DNS information.

#basic dns queries (medium detail)
nslookup example.com

#retrieves Domain registration information (registrar info)
whois example.com
None
The results shows that the server ubl.ac.id uses Cloudflare name servers
None
The results shows it was identified that the domain ubl.ac.id resolves to IP address 45.249.216.238.
None
same with dig
None
From the results, we identified domain ownership is belongs to digitalregistra, the registrar's name, name servers, and it didnt have any DNSSEC.

Here are some DNS misconfiguration and it's security impact.

  1. Zone Transfer Enabled (AXFR)

Entire DNS zone can be downloaded -> Full infrastructure exposure (subdomains, internal hosts).

2. Dangling DNS Records

DNS points to deleted/unclaimed service -> Subdomain takeover

3. Missing SPF/DKIM/DMARC

No email authentication protection -> Email spoofing & phishing

4. Open Recursive Resolver

DNS server answers public recursive queries -> DNS amplification (DDoS abuse)

DNS misconfiguration can create serious security risks because DNS controls how users reach your infrastructure. If configured incorrectly, it can expose internal systems, enable attacks, or leak sensitive information.

3. Review Web Server Metafiles for Information Leakage

Web servers may contain public metafiles that provide technical information.

Common metafiles:

  • robots.txt
  • sitemap.xml
  • security.txt
  • .git directories

Example: robots.txt

Accessing:

https://example.com/robots.txt

Example content:

I'm trying to see if there's any metafiles in ubl.ac.id website.

None

Although this file is meant for search engines, it reveals internal directories.

If sensitive folders appear here, it indicates potential information exposure.

4. Enumerate Applications on the Web Server

Enumeration means identifying:

  • Subdirectories
  • Subdomains
  • Admin interfaces
  • CMS platforms
  • APIs

The objective is to understand the structure of the application.

Example

I'm using Wappalyzer to identify the technologies used by a website.

None
None
  • WordPress WordPress is detected, it means the website uses a CMS (Content Management System). This suggests there may be themes, plugins, and an admin panel. These components can become possible attack surfaces, especially if the version is outdated.
  • Rank Math SEO Rank Math SEO is detected, it means the site uses a WordPress plugin. Plugins can introduce vulnerabilities if not updated or properly configured.
  • jQuery jQuery is identified, it shows the site uses a JavaScript library. Older versions of jQuery may have known vulnerabilities, such as XSS.
  • PHP PHP is detected, it means the backend is built using PHP. This helps testers understand the server-side logic and possible common PHP misconfigurations.
  • MySQL MySQL is detected, it means the website uses a MySQL database. This indicates how data is stored and may suggest risks like SQL Injection if the application is not properly secured.
  • Nginx Nginx is detected, it means the site uses it as a web server or reverse proxy. This helps identify server configuration, request handling, and possible misconfigurations.
  • HSTS (HTTP Strict Transport Security) HSTS is detected, it means the site enforces HTTPS connections. This is a security feature that helps prevent downgrade attacks and man-in-the-middle attacks.

Another example:

dirsearch -u https://example.com

This tool attempts to discover hidden directories.

Enumeration helps testers understand the attack surface.

5. Review Web Page Content for Information Leakage

Reviewing the page source of a website is an important step in the information gathering phase. The objective is to identify information that is not visible in the normal user interface but is still publicly accessible through the browser.

Sometimes sensitive information is visible in:

  • HTML comments
  • Hidden input fields
  • JavaScript files
  • Metadata

Example

By using the "View Page Source" feature, testers can examine the underlying HTML, scripts, and metadata of a webpage

None
HTML comments may include developer notes or internal references.

Hidden input fields inside forms may reveal how data is processed on the server. JavaScript files can expose API endpoints, internal routes, or configuration variables. Metadata inside the <head> section may reveal the technologies or versions used by the application.

Viewing page source may show:

<!-- TODO: remove test credentials -->
<input type="hidden" value="internalKey123">

These comments or hidden values should not be publicly accessible.

Even if no critical data is exposed, reviewing page content helps testers understand the system structure and prepares them for deeper analysis in later testing phases.

Even small pieces of leaked information may assist later testing phases.

Sources:

WSTG — Latest — Information Gathering

https://owasp.org/www-project-web-security-testing-guide/latest/4-Web_Application_Security_Testing/01-Information_Gathering/README