[This is a Guest Diary by Gregory Weber, an ISC intern as part of the SANS.edu BACS program]
For the last 5 months, as part of my BACS internship with SANS, I have monitored two deployments of a DShield Sensor, sometimes referred to as a honeypot. The DShield sensor offers multiple attack surfaces including Telnet and SSH ports but one of its features is a public-facing web server. One of my deployments sits on a cloud instance and this web server sees a large volume of traffic, making it ideal for research on web server attacks.
Many of the web “attacks” I have observed are rapid-fire URL submissions to the WordPress server meant to see if the server will reveal any of its “secrets” like encryption key files, user accounts, or back end logic. Moreover, the submissions are automated and often what appear to be “just passing by and saw you were a web server so thought I would try” type opportunity checks (like a crook pulling door handles in a parking lot to see if anything happens to open for a quick snag). As a community, information security professionals are probably more concerned with targeted attacks to their organizations but crimes of opportunity can be just as damaging -particularly where they reveal the existence of weaknesses to an attack group that may otherwise never bother with that specific organization.
While tending to my daily analysis, I have also been progressing through SEC595 “Applied Data Science and AI/Machine Learning for Cybersecurity Professionals”. I enjoy the challenges of coding and I am fascinated with data driven decisions; particularly where carefully thought-out data science logic can help us separate out those things which our human problem-solving skills and expertise need to focus on versus the thousands of things they do not.
As such, I decided to experiment with applying frequency analysis to the Dshield data I had been collecting just to see whether I could write a simple classification program. I chose to focus on the web honeypot URL data to write a program that parses a URL and accurately determines if the URL represents an intrusive type request or what I call a legitimate request. The experiment differs from many other categorical URL classification programs in that those classifiers are often focused on user initiated connections to external sites. In other words, those programs attempt to determine if a URL a user is clicking/typing is malicious based on statistical metrics such as “known bad” IP address lists or name lists. This program is focused on those URLs that may get submitted to a public facing web server in attempts to scope the server’s logic, perform command injection, perform server side request forgeries, or retrieve restricted files from a database or file directory that trusts the server.
Why this project???
I should state up front I am aware this is not groundbreaking: WAFs and lots of other goodies organizations purchase use sophisticated methods to tag-team this task as part of a layered defense. Anyone reading this is no doubt aware Web Application Firewalls are designed to perform many metrics to intercept malicious web requests before they can ever reach the server. And a strong defense-in-depth strategy for any public facing server will harden the server with input validation logic, use of parameterized queries, and similar to strengthen it from anything that makes it past the firewall as well as ensure things like permission restriction for the server account and removal of any files from its directories not needed to perform its functions in the spirit of least-privilege design. However, web servers are still the most commonly attacked and this would not be so if the attacks did not work some of the time…. and there is another important consideration…
Information security in 2025 has continued to shift away from the idea that keeping attackers “out” is the strongest strategy. Keeping attackers out is still the most ideal goal, but the community has recognized that with so many applications, so many patches, so many systems that need to talk to each other, so many coding libraries, and so many, ..on-and-on, that no organization bats 1,000 at keeping attackers out. Applying statistical data science to an already robustly protected web application gives security monitors another tool: consider a yet undiscovered vulnerability or exposed resource that allows the maliciously crafted URL to work successfully. The WAF does not intercept it and the server processes the request for a 200 response, making it much more likely no one will be aware the attack was successful. A statistical based approach that alerts the SOC based on the probability a URL is crafted to be intrusive does not rely on a log of 400 level response or a WAF rule, it simply alerts someone to put eyes on the request based on statistical metrics. Security tools rely on rules – rules that engineers update as new exploits are discovered. Until the new rule is written, the tool does not alert anyone because it does not “know” that it should. Statistical models can alert based on probability and these models can discern abnormal based on features – a topic that is extremely fascinating and promising for blue team.
For me, this was about a learning experience and attempt to experiment with DShield data while simultaneously gaining more experience with malicious web queries; it is not sophisticated but offers an example of how these models function.
Approach
For continued learning, I want to apply various descriptive statistical models, probability theory, and eventually more sophisticated machine learning models to URL suffixes to categorize them as either legitimate or intrusion-oriented using techniques practiced in SEC595. This diary focuses on the simplest approach using frequency analysis as the basis to classify the URL. Frequency is simply the idea of how often something occurs in a set of data (in this case URL suffixes).
It is a fair question to ask how and why something like this would be expected to work. Leaving out the “how” for a moment, the why would it work question comes down to the entire concept of attacking a machine that accepts user input. The attacks highlighted by OWASP share the commonality that they use malformed input to trick an application. Therefore, it seemed possible to me that if I could build a dictionary of words or phrases that are ‘malformed’ along with a separate dictionary of words/phrases common to everyday URL submissions, I could take any URL, parse it into its pieces, and then compare how many of those pieces appear in the “normal” dictionary versus how many appear in the “malformed” dictionary. Whichever dictionary had more pieces of the URL would determine whether the URL was likely an attack. Although I don’t explore it in this experiment, it would be relatively easy to move beyond a simple majority vote and set a different threshold (say 70/30) or whatever seems to work best.
Below I will walk through the specific steps and provide the code used (feedback is welcome). The program specifically focuses on the content contained after “www.website.tld”
Overview of Steps for Frequency Classification
1. Obtain URL requests generally deemed malicious (DShield 404 logs).
2. Obtain URL requests that are legitimate to normal website traversal
3. Isolate the specific suffixes and create a dictionary of words/phrases present in the URLs, one based on the legitimate URLs and the other on known malicious URLs.
4. Create frequency function that classifies a random URL based on the comparison of parts in each dictionary.
5. See if the accuracy has revealed anything useful.
6. Refine and enhance with other more sophisticated machine learning methods as time allows in the future
Steps 1 and 2: Obtaining the Data
To obtain malicious web requests, I utilized a DShield sensor deployed as part of BACS 4499 that contains a web server and logs URL requests to the server. Since this device is specifically deployed to be attacked and, indeed, has been “attacked” throughout the last 6 months, my experiment makes the assumption all of these URL requests except as noted below are attempts to expose restricted access rather than legitimate web requests. The exception will be any “/” only requests as those represent the root of the web server and are common to all top page-level requests.
The more difficult challenge was to obtain URLs of legitimate website interaction. The challenge is two-fold:
1. Locating a source of aggregate data like this without visiting random websites to build what is likely only a very narrow list
2. Websites are mapped specific to an organization, there are no ‘rules’ to how this is done though there are fairly conventional URL structures based use of LAMP stacks and typical application programming.
To overcome this obstacle, I decided to use a dataset generated as part of a demonstration on how to use Python code to map websites. The code specifically crawled various, legitimate websites creating URL links of full depth into the sites. Again, due to wide variety of site structures prevalent across the internet, this data does not provide a complete training model (neither does the DShield intrusive URL data) but it did provide a great starting point to begin experimenting. The dataset of “legit” URLs is courtesy of Elias Dabbas published on the Kaggle.com repository.
The URLs were then separated into two folders of csv files containing links – one legit and one intrusion-oriented. For both folders, some of the data will be left unused until it is time to test the function accuracy on known data. The first 5 of each data file type is shown below. The reader will observe there are more “Intrusive” files than “Legit” however the legitimate URL data files are significantly larger (they contain more URLs each).
print(f’Num intrusive URL files: {len(intrusive_files)}tNum of legit URL files: {len(legit_files)}’)
print(intrusive_files[0:5], ‘n’, legit_files[0:5])
Num intrusive URL files: 25 Num of legit URL files: 6
[‘./Attack Observations/Project/Intrusive/404reports_2025-01-25.csv’, ‘./Attack Observations/Project/Intrusive/404reports_2025-01-30.csv’, ‘./Attack Observations/Project/Intrusive/404reports_2025-02-13.csv’, ‘./Attack Observations/Project/Intrusive/404reports_2025-02-14.csv’, ‘./Attack Observations/Project/Intrusive/404reports_2025-02-15.csv’]
[‘./Attack Observations/Project/Legit/sitemap_2022_12_30_google_com_cleaned.csv’, ‘./Attack Observations/Project/Legit/sitemap_2023_01_03_searchenginejournal_com_cleaned.csv’, ‘./Attack Observations/Project/Legit/sitemap_2023_01_08_foreignpolicy_com_cleaned.csv’, ‘./Attack Observations/Project/Legit/sitemap_2023_01_11_apple_com_cleaned.csv’, ‘./Attack Observations/Project/Legit/sitemap_2023_01_31_washingtonpost_com_cleaned.csv’]
Step 3 (part 1): Isolate the specific suffixes
I reviewed hundreds of the URLs from Dshield sensor throughout my internship as part of my daily analysis. This experiment relies on creating a useful list of words, phrases, or other structural commonalities in intrusively sent web requests as well as a useful list of legit web requests. I have very little experience setting up public-facing web applications or websites meant for user interaction. To keep things manageable, I decided to focus on two keys when parsing the URLs: specific words/phrases contained in the URL and accounting for the presence of either a period ‘.’ or a dash “-” in the suffix. There is certainly room for improvement but I have found many malformed URLs contain periods and dashes as a way to trip logic. There are of course other characters used for SQLi and similar but again, I kept it simple to start.
The code block that follows defines a function “parse_url_words” taking a file of URLs, iterating on it using a regular expression to extract parts of the suffix, and returning a Python list of individual words/phrases parsed from the regex.
def parse_url_words(url = ‘https://www.coffee4all.com/subpage1/subpage2’) :
#Returns a list of words/phrases and periods in URL link
import re
#regex block to account for fqdn as well as truncated records containing only ‘/sub1/sub2/etc.’
extract_domain = re.match(‘https?://.{,3}..+?..+?/’, url)
if extract_domain :
regex = re.compile(f'(?:{extract_domain.group()}|/)(.*?)$’)
else :
regex = re.compile(‘/(.*?)$’)
url_suffix = re.findall(regex, url)
#Error check and account for lines that have no returned value
if url_suffix == [] or url_suffix == [”] : return ([])
if ‘/’ in url_suffix[0] :
url_word_list = url_suffix[0].split(‘/’)
else :
url_word_list = url_suffix
#Extract words and break phrases into smaller parts by splitting on ‘-‘ or ‘.’
additionals = []
period = False
dash = False
for phrase in url_word_list :
if “.” in phrase :
additionals += phrase.split(‘.’)
period = True
if “-” in phrase :
additionals += phrase.split(‘-‘)
dash = True
if period :
url_word_list.append(‘.’)
url_word_list += additionals
if dash :
url_word_list.append(‘-‘)
url_word_list += additionals
url_word_list = list(set(url_word_list))
if ” in url_word_list : url_word_list.remove(”)
return(url_word_list)
#Test the function
test_legit_URL = parse_url_words(‘https://www.coffee4me.plzzz/top/order/pay/to-us’)
print(len(test_legit_URL),’n’,test_legit_URL)
7
[‘us’, ‘-‘, ‘pay’, ‘order’, ‘top’, ‘to’, ‘to-us’]
Step 3 (part 2): Build Dictionaries of Words
Using the parse_url_words function, I then looped on both directories to build datasets called “legit” and “intrusive”. The reader may observe the code block below uses a Counter() dictionary. Although the data sources were not extremely large (a potential weakness in the experiment discussed in further detail in conclusion), it was apparent that using Python lists would result in large numbers of repeats. For efficiency of memory and time, the lists are effectively reduced to unique values automatically by using a counter dictionary. It would have been possible to accomplish this using Python sets (as is performed when parsing an individual URL) however I like the added functionality and speed a dictionary provides for larger data sets. A counter dictionary specifically allowed me to store the frequency of each word/phrase in the values (since the words/phrases themsevles are the keys). Although my this diary of frequency classification does not make use of the counter’s added functionality, there is no doubt when I build on the experiment later, having that added functionality will be nice.
# Legit Words from first 4 files (preserving two for testing)
from collections import Counter
legit = Counter()
for file in legit_files[0:4] :
with open(file, ‘rb’) as fh:
for url_line in fh : #Iterates on lines as many sample datasets are large enough to consume memory
content = url_line.decode().lower().strip()
legit.update(parse_url_words(content))
# Intrusive Words from first 18 files (preserving 7 for testing)
intrusive = Counter()
for file in intrusive_files[0:20] :
with open(file, ‘rb’) as fh:
for url_line in fh : #Iterates on lines as many sample datasets are large enough to consume memory
content = url_line.decode().lower().strip()
intrusive.update(parse_url_words(content))
#Display 100 words contained in each of the lists
print(f’Total number of words in Legit: {len(legit)}n’)
print(list(legit.keys())[100:200], ‘n’)
print(f’Total number of word in Intrusive: {len(intrusive)}n’)
print(list(intrusive.keys())[100:200])
Total number of words in Legit: 490321 [TRUNCATED]
[‘zh_hk’, ‘signup_complete’, ‘signup_complete.html,’, ‘.’, ‘html,’, ‘mediatools’, ‘get’, ‘develop’, ‘develop.html,’, ‘engage.html,’, ‘engage’, ‘gather.html,’, ‘gather’, ‘publish’, ‘publish.html,’, ‘resources.html,’, ‘resources’, ‘search’, ‘search.html,’, ‘visualize’, ‘visualize.html,’, ‘videoqualityreport’, ‘m’, ‘faq.html,’, ‘faq’, ‘faster’, ‘web.html,’, ‘faster-web’, ‘faster-web.html,’, ‘how.html,’, ‘how’, ‘methodology’, ‘methodology.html,’, ‘youtube’, ‘youtube.html,’, ‘cardboard’, ‘apps’, ‘android’, ‘buy’, ‘buy-cardboard-android’, ‘ios’, ‘buy-cardboard-ios’, ‘buy-cardboard’, ‘developers’, ‘download’, ‘get-cardboard’, ‘jump’, ‘manufacturers’, ‘product-safety’, ‘safety’, ‘product’, ‘sundance’, ‘viewerprofilegenerator’, ‘es_mx’, ‘fr_ca’, ‘pt_br’, ‘pt_pt’, ‘plastic’, ‘expire.html,’, ‘expire’, ‘journalismfellowship’, ‘thankyou’, ‘thankyou.html,’, ‘noto’, ‘feedback’, ‘cjk’, ‘help’, ’emoji’, ‘activities’, ‘activities.html,’, ‘nature.html,’, ‘animals-nature’, ‘animals-nature.html,’, ‘animals’, ‘flags.html,’, ‘flags’, ‘food-drink.html,’, ‘drink.html,’, ‘food-drink’, ‘food’, ‘objects.html,’, ‘objects’, ‘smileys’, ‘smileys-people’, ‘smileys-people.html,’, ‘people.html,’, ‘symbols’, ‘symbols.html,’, ‘travel-places.html,’, ‘travel’, ‘places.html,’, ‘travel-places’, ‘guidelines’, ‘install’, ‘updates’, ‘projectlink’, ‘google2ba69e9df6ccb5fb.html,’, ‘google2ba69e9df6ccb5fb’, ‘spectrumdatabase’, ‘business’]
Total number of word in Intrusive: 12608 [TRUNCATED]
[‘config.ini’, ‘hnap1’, ‘secrets’, ‘secrets.json’, ‘wp-config.php’, ‘products’, ‘view.php’, ‘src’, ‘settings.js’, ‘main.js’, ‘main’, ‘server’, ‘server-info’, ‘bundleconfig.json’, ‘bundleconfig’, ‘wp-content’, ‘debug.log’, ‘content’, ‘phpversion’, ‘phpversion.php’, ‘secrets.yml’, ‘services.php’, ‘debug.php’, ‘production.json’, ‘production’, ‘php~’, ‘config.php~’, ‘wp-config.php~’, ‘.env.local’, ‘local’, ‘broadcasting.php’, ‘broadcasting’, ‘settings.json’, ‘server.js’, ‘config.env’, ‘env.json’, ‘file’, ‘status’, ‘server-status’, ‘keys.js’, ‘keys’, ‘application.properties’, ‘properties’, ‘test1.php’, ‘test1’, ‘mail.php’, ‘mail’, ‘environment’, ‘environment.ts’, ‘ts’, ‘acl’, ‘acl.config.php’, ‘library’, ‘global.php’, ‘autoload’, ‘global’, ‘phpconf.php’, ‘phpconf’, ‘session.php’, ‘session’, ‘.env.bak’, ‘bak’, ‘wp-config.org’, ‘config.org’, ‘org’, ‘database.yml’, ‘database’, ‘config.json’, ‘default.json’, ‘index.js’, ‘bootstrap’, ‘resources’, ‘bootstrap.yml’, ‘test.json’, ‘dev’, ‘php.php’, ‘php_info.php’, ‘php_info’, ‘test2.php’, ‘test2’, ‘database.php’, ‘.env.dev’, ‘tmp’, ‘index.html’, ‘server.php’, ‘test.config.php’, ‘front’, ‘queue.php’, ‘queue’, ‘config.properties’, ‘aws.json’, ‘config.bak’, ‘wp-config.bak’, ‘crm’, ‘xampp’, ‘users’, ‘prod’, ‘admins’, ‘infos.php’, ‘infos’]
Step 4 Using Frequency to Categorize a URL as Intrusive or Legit
The reader may question whether the significant size difference in the two dictionaries is going to cause skewing of results. This is fair and something ideally addressed by finding more sources of malformed URLs (which is something I will do as I move forward). However, I would expect the malformed URL dictionary to be much smaller: there are likely many more variations of legitimate URLs than intrusive (an assumption only at this point), and the source of the intrusive URLs at this point is the DShield, which logs reveal receives a high number of repeated attempts using lists of URLs (much like lists of common passwords).
Frequency is simply a measure of how often something occurs in a set of data. To classify a random URL as either Intrusive or Legit by frequency means determining whether its word and phrase structure, when parsed into a list, has more items that appear in the “Legit” dataset, or more items that appear in the “Intrusive” dataset.
The code block that follows defines a function called “url_word_count” that passes an individual URL to the URL parsing function above and returns a tuple count of the number of words/phrases present in the legit dictionary as well as the intrusive dictionary.
The function called “frequency_classifier” will receive a full file of URLs, make use of the url_word_count and then classify each URL as legitimate or intrusive. It will then display the results.
def url_word_count(url=’https://www.coffee4all.com/subpage1/subpage2′) :
all_words = parse_url_words(url)
url_legit_words = [word for word in all_words if word in legit.keys()]
url_intrusive_words = [word for word in all_words if word in intrusive.keys()]
return(len(url_legit_words), len(url_intrusive_words))
def frequency_classifier(filename) :
legit_urls = 0
intrusive_urls = 0
total = 0
with open(filename, ‘rb’) as fh:
for url_line in fh :
content = url_line.decode().lower().strip()
if content == ‘/’ : continue
legit, intrusive = url_word_count(content)
if legit >= intrusive :
legit_urls += 1
else :
intrusive_urls += 1
total += 1
print(f’File Name: {filename}’)
print(f’Total URLs in the file: {total}’)
print(f’Predicted Legit URLs: {legit_urls}tPercentage: {legit_urls/total:.2%}’)
print(f’Predicted Intrusive URLs: {intrusive_urls}tPercent Intrusive: {intrusive_urls/total:.2%}n’)
As a reminder for clarity, the test data was set aside and not used to generate the legit/intrusive datasets, but the data comes from the same sources so it is known which category the frequency classifier “should” come up with. This provides a way to check the classifier against known data.
The code blocks below run the function on the legit and intrusive test data files, listing the percentage of each the classifier found. An ideal score would be 100% of URLs predicted Legit for the legitimate URL test data and similar for the intrusive URL test data.
# Test data for Legit files are items 5-6
print(“Test data results for Legitimate Filesn”,50*’-‘)
for file in legit_files[len(legit_files)-2:] :
frequency_classifier(file)
print(“nTest data results for Intrusive Filesn”,50*’-‘)
for file in intrusive_files[len(intrusive_files)-5:] :
frequency_classifier(file)
Test data results for Legitimate Files
————————————————–
File Name: ./Attack Observations/Project/Legit/sitemap_2023_01_31_washingtonpost_com_cleaned.csv
Total URLs in the file: 847794
Predicted Legit URLs: 847792 Percentage: 100.00%
Predicted Intrusive URLs: 2 Percent Intrusive: 0.00%
File Name: ./Attack Observations/Project/Legit/sitemap_2023_03_16_economist_com_cleaned.csv
Total URLs in the file: 190424
Predicted Legit URLs: 190424 Percentage: 100.00%
Predicted Intrusive URLs: 0 Percent Intrusive: 0.00%
Test data results for Intrusive Files
————————————————–
File Name: ./Attack Observations/Project/Intrusive/404reports_2025-03-15.csv
Total URLs in the file: 591
Predicted Legit URLs: 128 Percentage: 21.66%
Predicted Intrusive URLs: 463 Percent Intrusive: 78.34%
File Name: ./Attack Observations/Project/Intrusive/404reports_2025-03-16.csv
Total URLs in the file: 137
Predicted Legit URLs: 10 Percentage: 7.30%
Predicted Intrusive URLs: 127 Percent Intrusive: 92.70%
File Name: ./Attack Observations/Project/Intrusive/404reports_2025-03-17.csv
Total URLs in the file: 579
Predicted Legit URLs: 2 Percentage: 0.35%
Predicted Intrusive URLs: 577 Percent Intrusive: 99.65%
File Name: ./Attack Observations/Project/Intrusive/404reports_2025-03-18.csv
Total URLs in the file: 1366
Predicted Legit URLs: 2 Percentage: 0.15%
Predicted Intrusive URLs: 1364 Percent Intrusive: 99.85%
File Name: ./Attack Observations/Project/Intrusive/404reports_2025-03-19.csv
Total URLs in the file: 348
Predicted Legit URLs: 2 Percentage: 0.57%
Predicted Intrusive URLs: 346 Percent Intrusive: 99.43%
Analysis of Test Results
The test runs showed that a simple frequency comparison of words or phrases contained in the suffix of legitimate URLs seemed to accurately predict a legitimate URL was in fact legitimate. However, the intrusive tests proved much less accurate. Specifically the URL links contained in web honeypot logs on March 15 and March 16 contained significant classification errors.
I performed an in-depth analysis of the files in question, utilizing a code block to print out those URLs the classifier had labelled as legitimate even though they were in the known intrusive test set. I was rather put off to discover a few of these URLs made it past the frequency classifier:
URL: http://api.ipify.org/ LEGIT number 108
URL: /credentials LEGIT number 109
URL: /robots.txt LEGIT number 110
URL: /ping_pong.php LEGIT number 111
URL: /robots.txt LEGIT number 112
URL: http://api.ipify.org/ LEGIT number 113
URL: http://the-cat.click/validate LEGIT number 114
This means that a URL like /robots.txt is was not in the training data that built the word dictionaries and highlights the potential for incomplete data when using only one source (in this case DShield logs). A larger disappointment to me was the presence of other URLs within the URL suffix. These represent the opportunity to perform server side request forgeries which can be used to reveal restricted information, have the server request something from a 3rd party that trusts it on behalf of the attacker, or even reveal instance metadata (in the case of a cloud based server or container).
Final Thoughts
The frequency classifier performed well on for a first time run by an admitted rookie to intrusion analysis. Though simple, it proved to be useful and different way for me to gain experience as part of my internship program and this is only the beginning. There are several weaknesses I need to address, among them: the extremely broad set of legitimate URL structures out there, my simplistic regular expression that parses the URLs, and the statistical model used. I plan to improve upon the experiment, attempting more sophisticated models as I gain experience in both intrusion analysis and machine learning techniques.
As a final note, I would like to acknowledge the work of SANS Instructor David Hoelzer in authoring the material for SEC595. It is this work that provided me the idea and starting point to attempt techniques from those class labs to apply to my DShield we
[1] https://www.sans.org/cyber-security-courses/applied-data-science-machine-learning/
[2] https://www.sans.edu/cyber-security-programs/bachelors-degree/
———–
Guy Bruneau IPSS Inc.
My GitHub Page
Twitter: GuyBruneau
gbruneau at isc dot sans dot edu
(c) SANS Internet Storm Center. https://isc.sans.edu Creative Commons Attribution-Noncommercial 3.0 United States License.